IndiVillage
HomeResourcesBlogRobotics
Robotics

Annotation Timelines for Robotics at Scale

Speed and quality trade off. Eight weeks with minimal review produces noisy data. Twenty to thirty weeks with multi-pass QC and drift monitoring maintains 98%+ accuracy. The timeline depends on what you're willing to sacrifice in production.
Author · Mark Pinnes
·
26 May 2026
·
10 min
IndiVillage robotics specialist at workstation
IndiVillage Robotics · Bengaluru

The Timeline Question: Cost, Quality, and Urgency

You have 500,000 frames of egocentric video waiting to be annotated. Your deployment timeline is 6 months. You need a realistic answer to: "How long will this take?"

The answer depends on what you're optimising for. If you optimise for speed, you can annotate 500K frames in 8-10 weeks with a large team (50+ annotators) and minimal review. Quality will suffer; edge cases will be missed.

If you optimise for quality, you can annotate 500K frames with three-pass review, gold-set validation, and zero-drift monitoring over 12-16 weeks. Accuracy will be high; deployment risk will be low. You'll need a dedicated team but can work in parallel across geographies.

The choice between speed and quality determines your timeline and your downstream production risk.

Throughput Metrics: Baseline Expectations by Modality

Annotation speed depends entirely on the modality and complexity.

Simple 2D bounding boxes:

  • Throughput: 500–1,000 images/annotator/week (illustrative)
  • Ramp time: 1–2 weeks
  • 500K frames: 10–20 weeks (illustrative, speed-optimised)

Egocentric gripper-state sequences:

  • Throughput: 100–300 frames/annotator/week (illustrative)
  • Ramp time: 3–4 weeks (learning gripper mechanics and taxonomy)
  • 500K frames: 30–50 weeks (illustrative, quality-optimised with multi-pass review)

3D cuboid annotation (objects in 3D space):

  • Throughput: 200–400 frames/annotator/week (illustrative)
  • Ramp time: 2–3 weeks
  • 500K frames: 25–40 weeks (illustrative)

The difference is not arbitrary. Egocentric video requires temporal reasoning (frame N doesn't make sense without frames N-1 and N+1). 3D cuboids require spatial reasoning and multiple perspectives. Simple 2D boxes require neither.

Key insight: Raw throughput is not the constraint. Quality maintenance at scale is. A team can annotate 500K frames in 8 weeks if you allow quality to degrade. Maintaining 98%+ accuracy across 500K frames at consistent quality takes 20-30% longer.

The Ramp Timeline: Week-by-Week Breakdown

When a new team starts your project, there's a learning curve. This is not wasted time; it's necessary calibration.

Week 1: Onboarding and Setup

  • Review your taxonomy (gripper states, modality specifics, edge cases)
  • Each annotator labels 50-100 frames independently
  • Calibration against your gold set (how closely do they match?)
  • Team discussion of ambiguous cases
  • Deliverable: zero frames (this is training, not production work)
  • Success metric: all annotators achieve 85%+ agreement with gold set

Weeks 2-3: Ramp to Steady State

  • Begin production annotation (volume increases weekly)
  • Week 2: 5,000–10,000 frames/week (team-wide, illustrative)
  • Week 3: 15,000–25,000 frames/week (team-wide, illustrative)
  • Maintain multi-pass review (L1 primary, L2 consistency, L3 expert)
  • Success metric: inter-rater agreement κ ≥ 0.85 on randomly selected frames

Week 4: Full Capacity

  • Steady-state throughput achieved
  • 30,000–50,000 frames/week (team of 10–15 annotators, illustrative)
  • Three-pass review maintaine without bottlenecks
  • Gold-set recalibration begins (monthly cycle starts)
  • Success metric: zero-drift monitoring active; no accuracy degradation over time

Timeline for 500K frames:

  • Weeks 1–3: ramp (25,000–35,000 frames complete, illustrative)
  • Weeks 4–19: steady state (465,000 frames at 30,000/week illustrative rate)
  • Total: 19–20 weeks at quality-optimised speed (illustrative)

This assumes one dedicated team. If you're using multiple geographies (IndiVillage has 11 offices), you can parallelize:

  • Rotate batches across geographies
  • Maintain consistent gold set across all locations
  • Reduce single-location bottlenecks
  • Potential improvement: 25–30% acceleration (500K frames in 14–16 weeks, illustrative)

Scaling Factors: Team Size, Volume, Complexity

Factor 1: Team Size

  • 5 annotators: 100,000 frames/week steady state (illustrative) → 500K frames in 5 weeks (post-ramp)
  • 15 annotators: 300,000 frames/week steady state (illustrative) → 500K frames in 2 weeks (post-ramp)
  • 50 annotators: 1M frames/week steady state (illustrative) → 500K frames in 1 week (but quality control becomes the bottleneck)

Larger teams do NOT scale linearly. Adding a 50th annotator is not as productive as adding a 5th because:

  • Multi-pass review bottlenecks (L2 and L3 review can't keep pace)
  • Coordination overhead (more people = more QA meetings, more drift detection)
  • Taxonomy communication (harder to maintain consistency across 50 people)

Optimal scaling: 10-20 annotators per major language/domain. Beyond that, add new teams (new geographies, new languages) rather than growing a single team.

Factor 2: Volume Elasticity 500K frames is a single project. What if you need 2M frames over 12 months?

  • Linear ramp model: onboard teams sequentially as quarterly volume increases
  • Parallel model (preferred): onboard all teams at once, vary weekly volume based on project needs

IndiVillage's model for Taranis (4.5M+ images/year) is parallel: core team of 20 annotators with base allocation, elastic capacity to absorb 50% seasonal volume spikes (pre-season peaks in Northern hemisphere Feb-Mar, Southern Oct-Nov).

Factor 3: Modality Complexity Gripper-state annotation is slower than bounding boxes. If your project mixes modalities:

  • 60% gripper-state sequences (slow: 150 frames/annotator/week)
  • 40% 3D cuboids (faster: 350 frames/annotator/week)
  • Blended throughput: ~230 frames/annotator/week
  • 500K frames: 20-25 weeks

Communicate complexity upfront. A vendor quoting 8 weeks for mixed-modality 500K frames is either low-balling quality or misunderstanding the task.

Retention Advantage: Predictability as Timeline Benefit

This is the hidden advantage of high-retention vendors like IndiVillage (96% over 16 years).

High-churn scenario (50-70% annual turnover):

  • Month 1-3: team ramps to 80% productivity
  • Month 4-6: productive output; some team turnover begins
  • Month 7-9: 30% of team has turned over; new hires are ramping; productivity dips to 70%
  • Month 10-12: more turnover; reinstatement cycle; timeline slips

Low-churn scenario (IndiVillage: 96% retention):

  • Month 1-3: team ramps to 80% productivity
  • Month 4-12: same core team maintains productivity; no ramp restarts

For a 12-month project, the retention advantage is 4-6 weeks of timeline predictability. For long-term partnerships (Taranis 8+ years, Machani Robotics ongoing), the advantage is exponential: year 2 is faster and higher-quality than year 1 because the team remembers domain context.

Pilot to Production Pathway

Most projects follow a phased approach, not monolithic annotation:

Phase 1: Pilot (3-4 weeks)

  • 5,000 frames representative of your distribution
  • Small team (3-5 annotators)
  • All three review passes (L1, L2, L3)
  • Success criteria: 98%+ accuracy on gold set, zero taxonomy violations
  • Cost: £5-10K
  • Output: validated approach, team trained on your domain

Phase 2: Production Ramp (4-6 weeks)

  • 50,000-100,000 frames
  • Expand team (10-15 annotators)
  • Maintain three-pass review, add drift monitoring
  • Success criteria: zero-drift sustained, inter-rater agreement ≥ 0.85
  • Output: scaled delivery proof, team confidence high

Phase 3: Full Scale (ongoing)

  • 500,000 frames or more
  • Full team deployed across geographies
  • Steady-state throughput maintained
  • Success criteria: zero-drift sustained, SLA compliance 99%+

Total timeline: pilot (4w) + ramp (6w) + scale (15-20w) = 25-30 weeks for 500K frames via phased approach.

This is longer than monolithic annotation (20 weeks direct) but lower risk because pilots catch problems early. A failed pilot costs £5-10K (sunk cost). A failed full-scale project costs £50-100K+ (retraining, rework, timeline slips).

Common Timeline Mistakes

Mistake 1: Confusing ramp time with steady-state time. "We need 500K frames in 8 weeks. At 100K/week steady state, that's 5 weeks." This ignores the 3-4 week ramp. Realistic timeline is 8-9 weeks, not 5.

Mistake 2: Assuming linear scalability. "If 10 annotators produce 300K frames in 10 weeks, then 30 annotators produce 300K frames in 3.3 weeks." False. Multi-pass review becomes the bottleneck at larger scales. More realistic: 30 annotators produce 300K frames in 4-5 weeks because L2/L3 reviews can't scale as fast as L1 annotation.

Mistake 3: Underestimating edge-case discovery time. "We annotated the frames; we're done." If you discover edge cases in QA (L3 expert review), those frames need re-annotation. A 5% re-annotation rate adds 1-2 weeks to the project. Budget for it upfront.

Mistake 4: Not accounting for taxonomy evolution. "The taxonomy is locked." Taxonomy changes mid-project if edge cases reveal gaps. Version 1.0 → 1.1 (new state added, historical re-annotation required). Budget 10-15% rework time for taxonomy evolution.

Mistake 5: Confusing calendar time with work time. "We have 6 months, so we have plenty of time." Calendar includes holidays, team availability constraints, and other projects. A 5-week project might take 7-8 calendar weeks. A 20-week project might take 24-26 calendar weeks. Add 20-30% calendar buffer.

SLA Reality and Predictability

A good vendor will commit to throughput and accuracy SLAs:

Example SLA:

  • Throughput: 40,000 frames/week, ±10%
  • Accuracy: 98%+ sustained (measured weekly against gold set)
  • Turnaround: 2-4 weeks from receipt to initial deliverable
  • Escalation: if accuracy drops below 97%, pause and re-calibrate (adds 1-2 weeks)

Why these numbers matter:

  • 40,000/week means you can reliably plan for 500K frames in 12-13 weeks (post-ramp)
  • 98%+ accuracy means your model training is not corrupted by systematic errors
  • 2-4 week turnaround means you can iterate on taxonomy changes without multi-month delays
  • Escalation clause means quality is not sacrificed for timeline

IndiVillage's SLA on Machani Robotics is zero-drift sustained (accuracy improvement or flat, never degradation) over 18 months. This is the gold standard. Not all vendors can commit to it, but this is what you should target.

The Stakes: Timeline vs. Deployment Risk

A 500K-frame project can be annotated in 8 weeks with speed optimisation and degraded QA. Or in 20 weeks with quality-optimised processes.

The 12-week difference has a cost:

  • Speed path (8 weeks, degraded QA): Lower cost per frame, faster to market, higher production risk (edge cases missed, accuracy degradation in deployment)
  • Quality path (20 weeks, rigorous QA): Higher cost per frame, slower to market, lower production risk (edge cases caught, accuracy sustained)

Which matters more depends on your deployment criticality and timeline constraints. For safety-critical robotics (collaborative robots, surgical applications), quality path is non-negotiable. For non-critical applications with flexible timelines, speed path might be acceptable.

The key is making this trade-off consciously, not accidentally defaulting to fast vendors because they quote shorter timelines without explaining quality implications.


Frequently Asked Questions

Q: Can you parallelize annotation across geographies to speed it up? A: Yes, with coordination. Maintain a single gold set across all locations. Batch work rotates across geographies (1,000 frames per location per week). Quality is maintained through consistent taxonomy and regular sync. Potential acceleration: 20-30%. Coordination overhead: 2-3 weekly sync meetings.

Q: What's the fastest you can realistically achieve for 500K frames? A: 10-12 weeks post-ramp with a 30-50 person distributed team, if you're willing to accept slightly degraded edge-case coverage (97% accuracy instead of 98.5%). Quality-optimised path is 20-25 weeks.

Q: If I need 500K frames urgently, should I start with a cheaper vendor? A: No. A cheaper vendor that delivers corrupted labels forces you into retraining cycles (8-12 weeks). A more expensive vendor that delivers clean labels means 12 weeks to deployment, then stable production. Total timeline is often similar, but quality risk is higher with cheap vendors.

Q: Can annotation teams overlap across projects? A: Not for robotics. Context-switching between different robot domains (gripper types, gripper states, task types) introduces errors. Dedicated teams per project is the standard for high-quality work.

Q: How much does timeline accuracy matter in practice? A: Highly. A 6-month deployment target with a 5-month annotation commitment is risky. A 6-month deployment target with a 4-month annotation commitment is comfortable (allows 1-2 months for model training and testing). Overcommitting annotation timeline = project delays.


JSON-LD Schema

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "How Long Does It Take to Annotate Robot Training Data at Scale?",
  "description": "Realistic timelines for egocentric video annotation, ramp periods, and throughput expectations across team sizes and modality complexity.",
  "author": {
    "@type": "Organization",
    "@name": "IndiVillage"
  },
  "datePublished": "2026-05-26"
}
Work with us
Run a specialist audit.
100 frames. Your modality. Your accuracy target. Returns in 48 hours.
Run a specialist audit
Talk to a delivery lead →