Playbook · Robotics
The Egocentric Annotation Playbook.
Twelve patterns from our humanoid-robotics annotation programme. Written for ML teams scoping egocentric video at training cadence, and the operators who have to staff the team doing it.
40-page illustrated PDF
Annotator rubrics, schema templates, QA checklist
Worked examples from programmes shipped with customers including Machani Robotics
Pattern 01 is below, unlocked. The other eleven are in the PDF.
Contents
Twelve patterns.
Each pattern is a rubric, a decision tree, or a failure-mode analysis. Written by our senior reviewers from programmes that actually shipped, not from what a generic annotation handbook would sound like.
- Pattern 01
Schema design for first-person actionUnlocked below
Hierarchical action taxonomies that hold up across kitchens, care environments, and workshop spaces. How to avoid the taxonomy debt that forces rewrites in month six.
- Pattern 02
Rubrics for object-in-hand versus out-of-hand
The single largest disagreement driver in egocentric annotation. A decision tree that cuts inter-annotator variance by half.
- Pattern 03
Partial grasps and re-grasps
Temporal boundary calls at grasp / release / adjustment. How we label them so the downstream action-recognition model is not confused by the continuous contact sequence.
- Pattern 04
Object hand-off labelling
The bidirectional giver/receiver problem. Person-to-person, person-to-robot, and robot-to-robot hand-offs need distinct schemas; here is the one we use.
- Pattern 05
Navigable-space parsing in clutter
Ground-plane segmentation that ignores transient clutter but respects permanent obstacles. Rules of thumb for edge cases.
- Pattern 06
Affordance mapping
Beyond object class: which surfaces support, which handles afford pull, which edges are safe contact. The labels VLA models actually learn from.
- Pattern 07
Temporal segmentation for action boundaries
When does 'reach' end and 'grasp' begin? A 30-frame rule that reduces ambiguity and improves action-recognition F1.
- Pattern 08
Multi-person egocentric disambiguation
When the robot's POV catches multiple humans interacting. How to label the subject of attention without losing context on the others.
- Pattern 09
Occlusion handling in first-person
The hand always occludes the object being manipulated. Pattern for consistent occluded-object labelling and downstream model robustness.
- Pattern 10
Calibration drift monitoring across training runs
The schema evolves; the annotator pool evolves. How to monitor for silent rubric drift across quarters without blocking throughput.
- Pattern 11
Safety-critical scene flagging
Annotators as the first line of safety review. Flags for hot surfaces, sharp edges, human contact with the robot. Escalation tree into senior review.
- Pattern 12
Disagreement-aware sampling for ambiguous transitions
Where annotators disagree, models learn brittle boundaries. Sampling strategy that routes high-disagreement clips to domain-expert reviewers for schema-level resolution.
Pattern 01 · Unlocked preview
Schema design for first-person action.
Most egocentric annotation schemas begin life as a flat list of actions — pick up, put down, push, pull, reach, grasp, release — and fall apart inside three months. The failure mode is always the same: real first-person video contains actions that compose, overlap, and interrupt each other in ways the flat list cannot express. Once annotators start stacking tags or inventing new ones inline, you have lost calibration and the training signal degrades.
Our schema is hierarchical. The top layer is a five-class action family: contact, manipulation, locomotion, perception, idle. Beneath each family sits a closed list of primitives, and beneath those, optional modifiers. The primitives do not compose — at any moment a frame belongs to exactly one primitive — but modifiers can stack. Manipulation / rotate-in-hand / one-handed / on-support is a valid four-token label. Manipulation / grasp / push is not; it would have to be split into two temporal segments.
This constraint is load-bearing. By banning primitive stacking we force annotators to draw temporal boundaries rather than paper over ambiguity with multiple tags. The model learns cleaner transition statistics. Inter-annotator agreement stabilises inside four weeks. And when the taxonomy evolves — which it will — the evolution happens at the modifier layer, not the primitive layer, so existing annotations stay valid.
Three heuristics for the primitive set itself:
- Primitives must be describable in under eight words. If the definition needs a paragraph, you are describing a composition. Split it.
- No primitive can depend on object class. “Pour-liquid” is a composition of manipulation / tilt-object plus an object-class modifier. Keeping action and object orthogonal is what makes the schema transferable across environments.
- Every primitive must have a reasonable idle counterpart. If there is no sensible “no-op equivalent,” the primitive is too specific.
In the full PDF, Pattern 02 picks up from here with the rubric for disambiguating object-in-hand state, the single largest source of disagreement once the primitive set is stable. Patterns 03-12 cover the rest of the stack.