IndiVillage
HomeResourcesBlogRobotics
Robotics

RLHF for vision-language-action models: how preference data differs from LLM RLHF

VLA models require different preference data than LLMs. Learn trajectory ranking, action alignment, and why LLM RLHF patterns don't translate.
Author · Mark Pinnes
·
19 April 2026
·
9 min
IndiVillage robotics specialist at workstation
IndiVillage Robotics · Bengaluru
R

einforcement Learning from Human Feedback (RLHF) trained GPT-4 and Llama. But RLHF for vision-language-action (VLA) models — systems that see an image, read a language instruction, and output robot actions — is a different beast. The preference data structure, annotation rubrics, and evaluation signals don't transfer from LLM RLHF.

Why LLM RLHF patterns break down for VLA

In LLM RLHF, annotators rank two completions (text outputs) on a single axis: quality, safety, helpfulness. Comparing "I don't know" with "Here's a detailed explanation" is straightforward — one is obviously better.

In VLA, you're comparing two action trajectories (sequences of robot movements). Trajectory A reaches the mug but knocks over a cup. Trajectory B takes longer but is collision-free. Which is "better"? The answer depends on your downstream task, your safety constraints, and your objective function — all of which are implicit in LLM ranking but explicit in robotics.

Preference data for VLA: trajectory ranking

Rather than pairwise text comparison, VLA RLHF uses trajectory ranking. Here's the structure:

Input: Robot egocentric video + language instruction ("Pick up the mug and place it on the shelf")

Candidate trajectories: 5–10 action sequences generated by the VLA model or behaviour cloning baseline

Ranking criteria:

  • Task success (does the trajectory accomplish the goal?)
  • Collision avoidance (does it hit obstacles?)
  • Efficiency (is it reasonably fast, or does it waste time?)
  • Safety margins (does it maintain distance from humans, delicate objects?)
  • Naturalness (does the movement look human-like, or are there jerky artifacts?)

Output: Rank these 5–10 trajectories from best to worst.

This is much more complex than LLM ranking because you're evaluating multi-dimensional properties of a continuous action sequence, not a discrete text output.

Annotation rubric design for trajectory RLHF

You need explicit, testable criteria:

Success: Did the robot reach the target state? For "pick up mug," the robot must have gripper closed around mug at end of trajectory. For "place on shelf," the mug must be on the shelf surface, not floating or fallen.

Collisions: Count collisions with the scene. Zero collisions = highest score. One minor collision = 0.5 deduction. Major collision (robot toppling) = automatic lowest rank.

Efficiency: Measure trajectory duration. Rank by duration, but weight collision and success first — an efficient failure is worse than a slow success.

Safety margins: For each frame, compute distance from end-effector to nearest obstacle. Minimum distance <5cm = 0.5 deduction. <2cm = disqualify.

Naturalness: This is the hardest criterion. Experienced annotators rate smoothness, joint acceleration limits, grip stability on a 1–10 scale. This requires domain expertise; crowd workers can't rate this reliably.

Common pitfalls in trajectory ranking

Conflicting preferences: Annotators may disagree on trade-offs. Trajectory A is fast but risky. Trajectory B is safe but slow. Rank them consistently first; if disagreement is >20%, refine the rubric.

Silent failures: A trajectory looks reasonable but the gripper state at frame 180 is wrong — the grasp is unstable but the video doesn't show it failing until frame 500. Annotators must watch full trajectories and flag these latent failures.

Sim-to-real blind spots: Trajectories that work in simulation often fail on real robots due to sensor noise or control delays. Rank synthetic trajectories conservatively if you don't have real-world validation data.

Inconsistent trajectory quality: If all candidate trajectories are terrible, annotators can't meaningfully rank them. Enforce minimum baseline quality: if all trajectories fail the task, reject the batch and regenerate candidates.

Data collection and preference elicitation

Generate candidates: Use your current VLA model to generate 5–10 action trajectories per instruction. This is cheap (one forward pass); high diversity comes from beam search or diverse beam search.

Annotator ranking: Have specialists watch each trajectory in video and rank them. This takes 5–10 minutes per instruction (one watch-through per trajectory, plus ranking decision).

Confidence scoring: Have annotators mark confidence: certain, fairly certain, unsure. Low-confidence rankings are down-weighted in reward model training.

Disagreement resolution: If two annotators rank the same trajectories differently, have them discuss and converge on a single ranking, or exclude that example from training.

Reward model training on trajectory preferences

Unlike LLM RLHF, which trains a text-based reward model, VLA preference learning trains a video reward model:

Input: Video + language instruction Output: Reward score (scalar) predicting trajectory quality

The model learns to predict which trajectories humans prefer. This reward model is then used to finetune the VLA via policy gradient or DPO (Direct Preference Optimisation).

Key difference from LLM RLHF: trajectory preferences are tied to visual observations and action sequences, not text semantics. The reward model must understand physics, collision geometry, and task structure — much harder than text quality scoring.

Scale and iteration

VLA RLHF programmes vary in scope from targeted experiments to comprehensive model refinement. Budget depends on programme scale, number of instruction types, and iteration cycles required.

Iteration is critical. Train a reward model on batch 1, use it to improve the VLA, generate new trajectories, rank them, retrain the reward model. This 3–4 cycle loop takes 6–9 months and is where most value comes from.

What this means for you

RLHF for embodied models is harder and more expensive than LLM RLHF, but it's essential for real-world performance. Gig-platform workers can't rank robot trajectories reliably. You need specialists who understand physics, robotics, and task semantics.

Plan for 6–9 month RLHF programmes. Budget for specialist annotators throughout. Expect iteration — your rubric will evolve as you see edge cases. And remember: preference data quality directly impacts model quality in ways that standard supervised learning doesn't.

Explore VLA training strategies or learn about annotation programmes built for iteration.

Work with us
Run a specialist audit.
100 frames. Your modality. Your accuracy target. Returns in 48 hours.
Run a specialist audit
Talk to a delivery lead →