HomeResourcesBlogWhite Label

White Label

How to Establish Quality Assurance in White-Label Annotation

You cannot assume quality. You have to measure it. White-label partnerships live or die on QC discipline. Spot-check 5-10% of work, track inter-rater agreement, measure defect rates, run monthly recalibrations, and escalate when signals drift. The governance burden falls on you.

Author · Mark Pinnes

26 May 2026

14 min

QA lead assessing white-label partner annotation output against scorecard.

IndiVillage Operating Centre · Bengaluru

How to Establish Quality Assurance in White-Label Annotation

A white-label partner delivers images with labels. The labels look reasonable. Accuracy seems high. A month later, you build a model and discover that the annotations were systematically wrong in ways that subtle-check samples did not catch. The model fails. The customer relationship cracks. The partner says "we annotated to your spec." You say "your annotations are wrong." Nobody is wrong. You are both missing a QC framework.

This is the most common white-label failure mode. Companies sign contracts, assume quality, and discover downstream that the assumption was wrong. Quality is not implicit. It has to be explicit, measured, and enforced.

The governance burden falls on you, not the partner. You define what "correct" is. You measure whether the partner is delivering it. You escalate if they drift. The partner executes. You verify.

The three levels of quality measurement in white-label partnerships

QC in white-label annotation has three distinct levels, and all three are necessary. Skipping any one creates risk.

Level 1: Spot-check sampling (ongoing, weekly)

You inspect a random sample of completed work. Sample size: 5% of deliverables, minimum 50 images per check.

The process:

Partner delivers 1,000 annotated images
You randomly select 50 images
You re-annotate them independently (or have a second expert review them)
You compare your annotations to the partner's
You measure agreement using Cohen's kappa or equivalent (not raw percentage, because kappa controls for chance agreement)
You calculate a defect rate (images where annotation quality is below acceptable)

The metrics:

Inter-rater agreement: target 95% or higher for commodity work, 90%+ for complex work
Defect rate: track how many images have annotation errors that would affect model performance
Defect types: categorise errors (missed object, wrong classification, wrong boundary, etc.)

Spot-check sampling is your early-warning system. It is fast enough to run weekly without heavy overhead, and it catches systematic drift before it spreads across thousands of images.

Common mistake: Spot-checking only "easy" or "clear" images. If you cherry-pick your sample, you will get a false confidence in quality. Use random sampling. Some of your sample will be ambiguous. That is the point — you want to find edge cases and measure whether the partner is handling them consistently.

Level 2: Agreement tracking across the partner's team

If the partner has 30 annotators working in parallel, they are likely to have different standards. One annotator is conservative (marks fewer objects). Another is liberal (marks more). An agronomist marks disease, a technician misses it. These differences are silent drift unless you measure them.

The process:

Have the partner assign 500-1,000 images to be annotated by at least two independent annotators
Calculate agreement between the two annotations on the same image
Track agreement by annotator pair, by annotator, by geography (if multi-region)
Flag if agreement on a specific annotator pair drops below 85%

This is more overhead than spot-check sampling, but it is not optional if the partner has a large team. A single bad annotator can poison thousands of images before spot-checking catches them. Measuring intra-team agreement catches them faster.

The metrics:

Annotator pair agreement: 85%+ for commodity work, 80%+ for complex work
Annotator consistency: does one annotator's work disagree with most others on the same images?
Retraining signals: if an annotator's agreement drops, that signals need for retraining before they continue production work

Common mistake: Not measuring agreement until there is a problem. Proactive measurement is prevention. Reactive measurement is firefighting.

Level 3: Model performance validation

The ultimate QC gate: does the model trained on the partner's annotations perform as expected? This is slower to measure (requires model training, test-set evaluation) but it is the ground truth.

The process:

Build a model on the partner's annotated data
Evaluate the model on a held-out test set (imagery not in training)
Compare the model's performance to expected baselines or previous models
If performance drops unexpectedly, investigate whether the annotation quality degraded or whether there is a distribution shift (new imagery type, new geography, seasonal variation)

This is not a daily check — it is monthly or quarterly. But it is crucial because it connects the annotation quality to actual business outcome.

Example: A partner delivers 50,000 agtech annotations. You build a model. Expected accuracy is 95%. Actual accuracy is 88%. Spot-check sampling showed 96% inter-rater agreement. This mismatch signals one of three things: (1) the spot-check sample was not representative (your validator is too lenient), (2) the taxonomy definition left out edge cases that matter to the model, or (3) there is a systematic issue with a subset of the annotations that spot-checking did not catch. Dig into the model's confusion matrix and the low-performing subsets. Usually it signals a category of images the partner treated differently from your instructions.

How to structure a QC workflow in a white-label partnership

A mature QC workflow has these components, in order:

Weekly spot-check sampling

5% of deliverables
Random selection, blind to the partner
48-hour turnaround on the check
Results documented and shared with the partner

Monthly inter-team agreement report

Pairwise agreement across the partner's team
Flagged if agreement on any pair drops below 85%
Root-cause investigation if multiple pairs show low agreement
Retraining plan if drift is systematic

Monthly recalibration meeting

Partner team joins (or a representative sample)
Review edge cases from the spot-checks
Discuss any interpretation differences
Update the taxonomy or decision rules if new edge cases emerged
Train everyone on the updates

Quarterly model performance validation

Train a model on the partner's annotations
Evaluate on held-out test set
Compare to baseline performance
Investigate if performance is below expected

Annual comprehensive audit

Full review of the partnership's performance metrics
Discuss contract terms, pricing, and scaling plans
Update SLAs if needed
Decision: continue, improve, or transition

This structure scales from one partner with 5 annotators to a large partnership with 100+. The workload is roughly constant — it does not scale linearly with team size because most QC work is sampling-based, not 100% review.

The partnership escalation protocol

QC is useless if you do not act on the signals. Define an escalation protocol upfront.

Escalation levels:

Level 1 — Spot-check agreement drops to 90-95%

Action: flag the result, share with partner
Partner response: within 2 weeks, identify the source and propose a fix
Examples: a taxonomy edge case was misinterpreted, an annotator needs retraining, the imagery quality changed

Level 2 — Spot-check agreement drops below 90%, or defect rate exceeds 5%

Action: pause accepting new work from the partner on the affected category
Partner response: within 1 week, pause annotation on that category, investigate, propose corrective action, retrain team
Example: weed identification agreement dropped to 87%. The partner finds that a new weed species appeared in imagery that was not in the taxonomy. They add it to the manual, retrain, re-annotate the affected batch

Level 3 — Monthly inter-team agreement shows systematic disagreement (one region vs. another, or one annotator consistently different)

Action: initiate a root-cause investigation call
Partner response: within 3 days, attend a call to diagnose the issue
Example: the India team has 91% agreement with the US team on pest classification. The partner discovers that a common Indian pest looks visually similar to a rare US pest. They add regional guidance to the manual and retrain both teams

Level 4 — Quarterly model performance validation shows unexpected drops

Action: do not release the model. Investigate whether the annotation data or the model approach has an issue.
Partner response: within 5 days, provide detailed analysis of their annotation work (spot-checking their own data, comparing to previous quarters)
Example: model accuracy dropped from 95% to 89%. You investigate. The partner's annotations look reasonable, but the imagery in this batch has a different resolution (high-altitude drone vs. low-altitude). This is a distribution shift, not an annotation quality issue. Add the new resolution to the QC monitoring.

Level 5 — Persistent escalation issues that cannot be resolved within 30 days

Action: initiate partnership review and potential transition planning
This is rare if the partnership is well-governed, but it can happen

Common QC failures in white-label partnerships

No formal QC definition upfront. A company signs a contract with vague language ("maintain 95% accuracy") without defining how accuracy is measured. Later, disagreement on what "accuracy" means causes conflict. Fix: define QC metrics in the contract. Specify inter-rater agreement target, defect rate tolerance, and measurement methodology before work starts.

Spot-checking without randomisation. A company hand-picks 50 "representative" images each week for spot-checking. The partner learns which images are checked and maintains quality on those. Other images drift. Fix: use a random sampling mechanism that the partner cannot anticipate.

Measuring raw agreement percentage instead of kappa. Two annotators agree 94% of the time, but 90% of the images are "healthy with no issues." By chance, they both say "healthy" and get the 90% agreement for free. Real disagreement is higher. Fix: use Cohen's kappa or Fleiss' kappa, which control for agreement by chance.

Not acting on escalation signals. Spot-checks show agreement dropping for three weeks in a row, but the company does not escalate. By week 8, the drift has spread and requires re-annotation of thousands of images. Fix: escalate at Level 1 immediately. Do not wait for a pattern to emerge across multiple weeks.

Confusing "meets the SLA" with "meets quality standards." The partner delivers 10,000 images in 8 weeks (meets the timeline SLA). Spot-checking shows agreement at 91% (below your 95% target). The partner says "we did our job" but you say "the quality is not acceptable." Fix: include quality targets in the SLA, not just timeline and volume. The SLA should be "deliver 10,000 images in 8 weeks at 95%+ inter-rater agreement."

Tools and workflows for white-label QC

You do not need complex software to run good QC. Spreadsheets and shared documents work, as long as the workflow is documented and followed.

Spot-check template:

Week	Batch_ID	Sample_Size	Agreement_Rate	Defect_Count	Defect_Rate	Kappa	Status	Notes
1	Batch_001	50	96%	2	4%	0.94	OK
2	Batch_002	50	94%	3	6%	0.91	FLAG	Weed classification drift
3	Batch_003	50	95%	2	4%	0.93	OK

Monthly recalibration meeting notes:

Date, attendees, topics discussed
Edge cases reviewed: [description of each case and decision]
Taxonomy updates: [what was changed and why]
Retraining plan: [who is retrained on what]
Agreement issues: [any specific annotators or regions with systematic disagreement]

Escalation log:

Date	Level	Issue	Partner Response	Resolution	Days_to_Resolve
2026-05-01	1	Agreement 92% on weed ID	Identified new species	Added to manual, retrained	7
2026-05-20	2	Defect rate 6% on disease	Annotator lack of expertise	Reassigned to senior team	5

The tools are secondary to the discipline. A spreadsheet with consistent tracking beats sophisticated software run inconsistently.

What good QC looks like at scale

A mature white-label partnership with 50+ annotators across multiple geographies and project types:

Weekly spot-check sampling: 5-10% of work, 95%+ agreement, documented
Monthly inter-team reports: agreement tracked by annotator, by region, by crop type
Quarterly model validation: expected performance achieved or root-causes identified
Escalation log: fewer than two Level 2 escalations per quarter
Recalibration meetings: monthly, documented, all major edge cases adjudicated

In this state, the partner is trustworthy because you are measuring and the measurements are stable. Quality is not assumed. It is verified.

For Taranis, our QC framework tracked agreement across multiple regional teams and 460+ weed species. Regular recalibration meetings brought together agronomists from different regions to align on edge cases. When a new weed species appeared mid-season, we added it to the taxonomy, retrained teams, and deployed the updated standard quickly. The model trained on these annotations stayed accurate across seasons and geographies because the annotation standards evolved with the imagery.

The cost of this QC framework is roughly 10-15% of the annotation project cost. It is not overhead. It is the cost of quality.

FAQ

Q: Can we rely on the partner's internal QC instead of running our own?

A: No. The partner has incentive to report quality numbers that justify their pricing. You have incentive to verify that the quality meets your standards. These are different. Run your own spot-checks. The partner's internal QC is a supplement, not a replacement.

Q: How often should we re-check the same image?

A: Once is enough. Random sampling means each image is checked once. Checking the same image twice is waste. The exception is when you are investigating a specific issue — you might re-check images from a particular annotator or category to understand the root cause of low agreement.

Q: What if the partner claims our spot-check methodology is biased?

A: It is a fair point to discuss. If the partner says "you are spot-checking the hardest images and ignoring the easy ones," review your sampling method. But bias is not an excuse to skip QC. If sampling methodology is the issue, fix the methodology, not the QC.

Q: How do we handle genuine disagreement on what is correct?

A: This is common in edge cases (is this plant stressed or diseased?). The solution: have a tie-breaking protocol upfront. Examples: (a) escalate to a third expert and use their judgment, (b) consult literature for the technical answer, (c) for ambiguous cases, label as "uncertain" and do not use for training. Document the rule.

Q: Should QC get stricter or looser as the partnership matures?

A: Stricter initially (catch systematic issues and drift early), then stabilise. After 12+ months of stable, high-agreement results, you might reduce spot-check frequency from 5% to 3%. But never relax the metrics themselves. You are changing frequency, not standards.

Q: What happens if the partner fails to meet QC escalations?

A: Follow the contract. Usually: first failure, send a formal notice and set a cure period (30 days). Second failure, reduce volume or invoke termination. Do not let failures slide. The partner learns that QC does not matter and quality will degrade further.

JSON-LD Schema

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Can we rely on the partner's internal QC instead of running our own?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "No. The partner has incentive to report quality that justifies their pricing. You have incentive to verify quality meets your standards. Run your own spot-checks. The partner's internal QC is a supplement, not a replacement."
      }
    },
    {
      "@type": "Question",
      "name": "How often should we re-check the same image?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Once is enough. Random sampling means each image is checked once. Checking twice is waste. The exception is investigating a specific issue — you might re-check images from a particular annotator to understand the root cause of low agreement."
      }
    },
    {
      "@type": "Question",
      "name": "What if the partner claims our spot-check methodology is biased?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "It is a fair point to discuss. If sampling methodology is the issue, fix the methodology, not the QC. Bias is not an excuse to skip QC. Review and adjust your sampling method."
      }
    },
    {
      "@type": "Question",
      "name": "How do we handle genuine disagreement on what is correct?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Have a tie-breaking protocol upfront. Examples: escalate to a third expert, consult literature for the technical answer, or label ambiguous cases as 'uncertain' and do not use for training. Document the rule."
      }
    },
    {
      "@type": "Question",
      "name": "Should QC get stricter or looser as the partnership matures?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Stricter initially to catch systematic issues, then stabilise. After 12+ months of stable results, you might reduce spot-check frequency from 5% to 3%. Never relax the metrics themselves. Change frequency, not standards."
      }
    }
  ]
}

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "How to Establish Quality Assurance in White-Label Annotation",
  "description": "You cannot assume quality. You have to measure it. White-label partnerships live or die on QC discipline. Spot-check 5-10%, track inter-rater agreement, measure defects, and escalate when signals drift.",
  "datePublished": "2026-05-26",
  "author": {
    "@type": "Organization",
    "name": "IndiVillage Tech Solutions"
  },
  "publisher": {
    "@type": "Organization",
    "name": "IndiVillage Tech Solutions",
    "logo": {
      "@type": "ImageObject",
      "url": "https://indivillage.co.uk/logo.png"
    }
  },
  "mainEntity": {
    "@type": "Question",
    "name": "How do you measure and enforce quality standards with a white-label annotation partner?"
  }
}