Product

How we built a Data Agent that cleans millions of recruitment records

Thousands of agencies. Millions of records. Every change tracked, auditable, reversible. Here's how we built it.

Lokesh

Updated 2 Apr 2026

Ask AI about this

How we built a Data Agent that cleans millions of recruitment records

We run this across thousands of agencies. Millions of candidate records. Every single change tracked, auditable, reversible. If something goes wrong on record 847,331, you can see exactly what happened, when it happened, and roll it back without touching the other 847,330.

That's the bar we set for ourselves. When you're letting AI touch a recruitment agency's most valuable asset, their candidate database, you don't get to move fast and break things. You move carefully, you track everything, and you give the agency full control over what happens to their data.

Here's why we built it and how it works.

The problem every agency has

Every recruitment agency I've worked with has the same problem. They've been importing candidates for years. LinkedIn scrapes, job board applications, CV uploads, manual entries. Thousands of records. And the data quality is terrible.

Names in all caps. Phone numbers in five different formats. Job titles like "🚀 Talent Acquisition Ninja 🎯 | Open to Work | DM me". Skills listed as "JS" and "JavaScript" and "java script" on the same profile. Industry field empty on 60% of records. Half the email addresses are personal Hotmail accounts from 2015.

One agency can't fix this manually. So imagine the scale when hundreds of agencies all need it done, each with 20,000 to 200,000 records, each with different master data lists, different field conventions, different quality standards. That's the problem the Data Agent solves.

What the Data Agent actually does

The Data Agent processes candidates, contacts, and companies in bulk. You pick a saved search (or your entire database), choose what you want fixed, and let it run. It works across six categories of operations.

Parse. Got 10,000 candidates with CV text but no structured data? The agent reads the CV and extracts name, job title, location, skills, employment history, education, languages, years of experience. Not keyword matching. Actually reading the CV and understanding it.

Cleanup. Phone numbers get reformatted to international standard (+44 instead of 07). Job titles get stripped of emojis and LinkedIn nonsense. Names get proper capitalisation (McDonald, O'Brien, van der Berg). Placeholder values like "N/A" and "TBD" and "-" get removed. Skills get deduplicated. Dates get standardised. Whitespace gets trimmed.

Classify. The agent matches each record against your CRM's master data lists. Industry, sector, tags. A candidate with "Goldman Sachs" in their employment history gets classified as Financial Services without anyone doing it manually.

Generate. AI writes professional summaries for each candidate. A 2-3 sentence overview that a client would actually want to read. Not the candidate's own summary from LinkedIn, which is usually either empty or three paragraphs of buzzwords.

LinkedIn enrichment. If a candidate has a LinkedIn URL on their record, the agent fetches the latest profile data and merges it. New job title, new company, updated skills. It can also generate a proper CV from the LinkedIn data.

Contact finder. Missing email or phone? The agent searches external data providers to find them. This is where sourcing meets data quality. You imported 5,000 candidates from LinkedIn but only 500 had email addresses. Now you've got contact details for most of them.

The engineering challenge

Processing 100,000 records sounds straightforward until you actually try it. Each record needs to be fetched from the CRM API, sent to an AI model for analysis, and written back with the changes. That's three API calls minimum per record. At 100,000 records, you're making 300,000+ API calls. You can't just fire them all at once and hope for the best.

We built a queue system. Records get batched and processed in controlled chunks. If the CRM API rate-limits us, we back off and retry. If an AI call fails on one record, it doesn't kill the entire job. That record gets marked as failed and the agent moves on. At the end, you get a summary: 98,500 succeeded, 1,200 had no data to update, 300 failed. You can retry the failures separately.

Every single change is tracked. For every record, we store the original data and the enriched data side by side. You can see exactly what changed, what the value was before, and what it is now. If the AI got something wrong (it happens), you can see it immediately. We store these changesets in R2 so they're always available for audit.

The approval workflow

This was non-negotiable. You do not let an AI modify 100,000 records without a human saying yes first. When someone creates a Data Agent job, it goes into "pending approval" status. An approval email goes out with a summary: what records, what actions, who requested it. Someone with authority clicks approve, and then it runs.

We debated whether to make this optional. Some agencies want to schedule weekly cleanup jobs that run automatically. But the risk of something going wrong on a bulk operation is too high. One bad AI interpretation applied to 50,000 records would be a nightmare to undo. The approval step takes 30 seconds and prevents catastrophic mistakes.

What the AI actually sees

For each record, we build a context window with the candidate's existing data, their CV text (if available), and the specific actions requested. The AI doesn't get the entire database. It gets one record at a time with clear instructions: "Here's a candidate. Here's their CV text. Extract the current job title. If you can't determine it confidently, return null."

We use Claude Haiku for most operations because it's fast and cheap. At $0.001 per candidate, processing 100,000 records costs about $100 in AI compute. That's nothing compared to the value of having clean, structured, classified data across your entire database.

The classification actions are interesting. The AI doesn't just guess the industry. It gets your CRM's actual master data list and matches against it. So if your master list has "Financial Services" but not "Finance", every candidate gets tagged as "Financial Services" consistently. No variations, no duplicates, no "banking/finance/finserv" mess.

Real examples

Phone cleanup. "07912 345678" becomes "+447912345678". "00971 50 123 4567" becomes "+971501234567". "N/A" gets removed entirely. Consistent, international format across every record.

Job title cleanup. "🚀 Head of Talent @ Google | Hiring! | Ex-Amazon | Open to Opportunities 🎯" becomes "Head of Talent". The signal stays. The noise goes.

Skills deduplication. A candidate with "JavaScript", "JS", "java script", "Javascript", and "JAVASCRIPT" in their skills list ends up with one entry: "JavaScript". Multiply that by 50,000 candidates and your search results go from messy to precise.

Industry classification. A candidate record has no industry field but their employment history shows Goldman Sachs, Barclays, and Deloitte. The AI classifies them as "Financial Services" and tags them with "Banking" and "Professional Services". Now they show up in the right searches.

Why this matters for agencies

A recruitment agency's database is its most valuable asset. But only if the data is usable. 50,000 candidates with bad data is worse than 5,000 candidates with good data because the bad data pollutes your search results, wastes time, and makes your AI matching less accurate.

Agencies that run the Data Agent typically see their sourcing results improve immediately. Searches return more relevant candidates because the data they're searching against is clean and classified. Time to fill drops because recruiters spend less time manually filtering through bad matches.

The Data Agent runs as part of the AI Agents suite in Recruitly. It's available on every plan. If you've got years of accumulated data that needs cleaning up, this is the fastest way to do it.

If you want to see it in action, book a demo and we'll run it on a sample of your actual data.

Google & AWS Infrastructure

Zero-Downtime Migration

Single Sign-On & 2FA

How we built a Data Agent that cleans millions of recruitment records

The problem every agency has

What the Data Agent actually does

The engineering challenge

The approval workflow

What the AI actually sees

Real examples

Why this matters for agencies

Keep reading

How to build a candidate database from scratch

How to automate your recruitment workflow

AI in recruitment — what actually works in 2026

Why we built a OneUp Sales alternative

Ready to run your agency on one system?