HomeResourcesBlogRobotics

Robotics

RLHF for vision-language-action models: how preference data differs from LLM RLHF

Q: Should I weight preference disagreement by annotator confidence?

Yes. If annotator A is "certain" and annotator B is "unsure," trust A more. Weight the example accordingly when training the reward model. This improves convergence.

Q: Can I use LLM RLHF reward models, or do I need a custom model?

You need a custom video reward model. LLM reward models score text; you're scoring video + action sequences. The two are structurally different. Train a separate model on trajectory preferences.

Q: How many preference examples do I need before the reward model is useful?

500–1,000 ranked trajectories (5–10 candidates each) gives you 2,500–10,000 pairwise preferences. Train and test on a holdout set. If validation loss plateaus and model agreement on test trajectories is >80%, you have enough data.

VLA models require different preference data than LLMs. Learn trajectory ranking, action alignment, and why LLM RLHF patterns don't translate.

Author · Mark Pinnes

19 April 2026

9 min

ML engineer ranking action preferences for vision-language-action model training.

IndiVillage Operating Centre · Bengaluru

einforcement Learning from Human Feedback (RLHF) trained GPT-4 and Llama. But RLHF for vision-language-action (VLA) models — systems that see an image, read a language instruction, and output robot actions — is a different beast. The preference data structure, annotation rubrics, and evaluation signals don't transfer from LLM RLHF.

Why LLM RLHF patterns break down for VLA

In LLM RLHF, annotators rank two completions (text outputs) on a single axis: quality, safety, helpfulness. Comparing "I don't know" with "Here's a detailed explanation" is straightforward — one is obviously better.

In VLA, you're comparing two action trajectories (sequences of robot movements). Trajectory A reaches the mug but knocks over a cup. Trajectory B takes longer but is collision-free. Which is "better"? The answer depends on your downstream task, your safety constraints, and your objective function — all of which are implicit in LLM ranking but explicit in robotics.

Preference data for VLA: trajectory ranking

Rather than pairwise text comparison, VLA RLHF uses trajectory ranking. Here's the structure:

Input: Robot egocentric video + language instruction ("Pick up the mug and place it on the shelf")

Candidate trajectories: 5–10 action sequences generated by the VLA model or behaviour cloning baseline

Ranking criteria:

Task success (does the trajectory accomplish the goal?)
Collision avoidance (does it hit obstacles?)
Efficiency (is it reasonably fast, or does it waste time?)
Safety margins (does it maintain distance from humans, delicate objects?)
Naturalness (does the movement look human-like, or are there jerky artifacts?)

Output: Rank these 5–10 trajectories from best to worst.

This is much more complex than LLM ranking because you're evaluating multi-dimensional properties of a continuous action sequence, not a discrete text output.

Annotation rubric design for trajectory RLHF

You need explicit, testable criteria:

Success: Did the robot reach the target state? For "pick up mug," the robot must have gripper closed around mug at end of trajectory. For "place on shelf," the mug must be on the shelf surface, not floating or fallen.

Collisions: Count collisions with the scene. Zero collisions = highest score. One minor collision = 0.5 deduction. Major collision (robot toppling) = automatic lowest rank.

Efficiency: Measure trajectory duration. Rank by duration, but weight collision and success first — an efficient failure is worse than a slow success.

Safety margins: For each frame, compute distance from end-effector to nearest obstacle. Minimum distance <5cm = 0.5 deduction. <2cm = disqualify.

Naturalness: This is the hardest criterion. Experienced annotators rate smoothness, joint acceleration limits, grip stability on a 1–10 scale. This requires domain expertise; crowd workers can't rate this reliably.

Common pitfalls in trajectory ranking

Conflicting preferences: Annotators may disagree on trade-offs. Trajectory A is fast but risky. Trajectory B is safe but slow. Rank them consistently first; if disagreement is >20%, refine the rubric.

Silent failures: A trajectory looks reasonable but the gripper state at frame 180 is wrong — the grasp is unstable but the video doesn't show it failing until frame 500. Annotators must watch full trajectories and flag these latent failures.

Sim-to-real blind spots: Trajectories that work in simulation often fail on real robots due to sensor noise or control delays. Rank synthetic trajectories conservatively if you don't have real-world validation data.

Inconsistent trajectory quality: If all candidate trajectories are terrible, annotators can't meaningfully rank them. Enforce minimum baseline quality: if all trajectories fail the task, reject the batch and regenerate candidates.

Data collection and preference elicitation

Generate candidates: Use your current VLA model to generate 5–10 action trajectories per instruction. This is cheap (one forward pass); high diversity comes from beam search or diverse beam search.

Annotator ranking: Have specialists watch each trajectory in video and rank them. This takes 5–10 minutes per instruction (one watch-through per trajectory, plus ranking decision).

Confidence scoring: Have annotators mark confidence: certain, fairly certain, unsure. Low-confidence rankings are down-weighted in reward model training.

Disagreement resolution: If two annotators rank the same trajectories differently, have them discuss and converge on a single ranking, or exclude that example from training.

Reward model training on trajectory preferences

Unlike LLM RLHF, which trains a text-based reward model, VLA preference learning trains a video reward model:

Input: Video + language instruction Output: Reward score (scalar) predicting trajectory quality

The model learns to predict which trajectories humans prefer. This reward model is then used to finetune the VLA via policy gradient or DPO (Direct Preference Optimisation).

Key difference from LLM RLHF: trajectory preferences are tied to visual observations and action sequences, not text semantics. The reward model must understand physics, collision geometry, and task structure — much harder than text quality scoring.

Scale and iteration

VLA RLHF programmes vary in scope from targeted experiments to comprehensive model refinement. Budget depends on programme scale, number of instruction types, and iteration cycles required.

Iteration is critical. Train a reward model on batch 1, use it to improve the VLA, generate new trajectories, rank them, retrain the reward model. This 3–4 cycle loop takes 6–9 months and is where most value comes from.

What this means for you

RLHF for embodied models is harder and more expensive than LLM RLHF, but it's essential for real-world performance. Gig-platform workers can't rank robot trajectories reliably. You need specialists who understand physics, robotics, and task semantics.

FAQ

Q: How many trajectory candidates should I generate per instruction? A: 5–10 is standard. Fewer than 5 limits the ranking decision; more than 10 adds annotation time without proportional value. Use diverse beam search to ensure variety across candidates.

Q: What if all candidate trajectories are bad? A: Reject the batch. Have the model regenerate with different parameters. A ranking between five failures teaches the reward model nothing useful. Enforce a minimum baseline: at least 30% of candidates must succeed at the task.

Q: How do I train annotators to rate "naturalness" consistently? A: Naturalness is hard. Provide video examples of what "natural" and "jerky" look like. Have annotators practise on 20 example trajectories and compare against reference rankings. Agreement <80% on these practises means the rubric needs refinement.

Q: Should I weight preference disagreement by annotator confidence? A: Yes. If annotator A is "certain" and annotator B is "unsure," trust A more. Weight the example accordingly when training the reward model. This improves convergence.

Q: Can I use LLM RLHF reward models, or do I need a custom model? A: You need a custom video reward model. LLM reward models score text; you're scoring video + action sequences. The two are structurally different. Train a separate model on trajectory preferences.

Q: How many preference examples do I need before the reward model is useful? A: 500–1,000 ranked trajectories (5–10 candidates each) gives you 2,500–10,000 pairwise preferences. Train and test on a holdout set. If validation loss plateaus and model agreement on test trajectories is >80%, you have enough data.

Q: What if my robot simulator doesn't match the real robot? A: Rank synthetic trajectories conservatively. Mark lower confidence on trajectories that rely on simulation-only dynamics (e.g., contact timing). Real-world validation is essential — deploy early and measure gap.

Plan for 6–9 month RLHF programmes. Budget for specialist annotators throughout. Expect iteration — your rubric will evolve as you see edge cases. And remember: preference data quality directly impacts model quality in ways that standard supervised learning doesn't.

Explore VLA training strategies or learn about annotation programmes built for iteration.