How we figure out two candidate records are the same person
Inside Recruitly's entity-resolution stack: which signals we trust, what wins when they disagree, and why dedup matters at 50,000 records.
I get asked a lot of questions about Recruitly's sourcing layer, but the one that comes up least is also the one that took the most engineering effort. How do you know two candidate records are the same person.
On the surface it sounds trivial. Same name, same email, same LinkedIn URL, done. In reality none of those signals are clean. Names get misspelled. Emails change every time someone leaves a job. LinkedIn vanity URLs get edited. CVs from different agencies describe the same person in completely different ways. If you take this seriously, the engineering problem isn't matching. It's deciding which signals to trust when they disagree.
I'm going to walk through how we do that inside Recruitly's candidate index. This post goes deeper than most recruitment blogs because the trade-offs only make sense when you can see the actual decisions.
Why candidate dedup is harder than customer dedup
People assume candidate dedup is the same problem as contact dedup in a sales CRM. It isn't. A B2B contact has one employer at a time, one work email, one work phone, and the record changes slowly. Two records with matching email almost certainly belong to the same person.
A candidate is different. Over a 15-year career they will have worked at six employers, used three or four work emails, kept one or two personal emails, switched phone numbers twice, edited their LinkedIn vanity at least once after a name change or rebrand, and shown up in maybe a dozen agency databases. If you key dedup on email alone, the first time a candidate changes jobs you create a duplicate. Do that across 100,000 candidates and you have a database your team stops trusting.
This is the gap most recruitment platforms quietly leave open. They borrow contact-dedup logic from a CRM template and tell themselves it's good enough. It isn't, and you only realise after a year or two when the database has thousands of soft duplicates and recruiters have stopped trusting search results.
Why "same email equals same person" falls apart at around 10,000 records
The first version of any dedup pipeline keys on email. It works fine for the first few thousand records. Then it breaks in three predictable ways.
The first is job changes. Mark Robertson at company A leaves and joins company B six months later. Same human, different work email. A naive pipeline creates a new record. Now you have two Mark Robertsons, both with partial histories, and your team reaches out to him twice with conflicting roles.
The second is reused addresses. Family shared inboxes are more common than people think, especially with older candidates. Agency relay addresses are worse. We've seen agencies where every CV submission gets forwarded from one shared team mailbox, so dozens of candidates end up with the same sender email in the submission record.
The third is throwaway addresses. A small share of candidates apply with temporary emails. If you key on email, every applicant from one of those domains gets shoved into the same record. We saw exactly this happen to one customer's import job before we tightened the rules. Sixty applicants, one record, no warning. They didn't notice for two months.
The signal stack we actually use
The fix isn't to pick a better single signal. There isn't one. The fix is to use a stack of signals and weight them by trust. Inside Recruitly the stack looks roughly like this, in descending order of trust.
LinkedIn vanity URL, normalised. We lowercase it, strip the trailing slash, and treat it as canonical. If two records share the same vanity, that's the strongest possible signal. LinkedIn enforces uniqueness at their end and vanities only get edited deliberately. We do keep a small alias table for vanities we've seen change for the same person, but that's rare.
Phone number in E.164 format. A real phone number, normalised to the international format, is almost as good as a LinkedIn vanity. The catch is that people change phone numbers when they switch countries, so we treat a phone match as strong but not absolute.
Verified email. Lowercased, trimmed, MX-checked. We treat verified email as a strong signal but not as strong as vanity or phone, for the job-change reasons above. We also flag known agency relay domains and refuse to use email from those as a dedup key at all.
Name plus employer-history fingerprint. Two records with the same full name and at least two overlapping employer-and-date pairs are very likely the same person. We don't auto-merge on this alone. We surface it as a suggested merge and let the recruiter confirm.
CV text shingle hash. When two CVs share enough overlapping multi-word shingles, the underlying person is almost certainly the same. We use this to catch the case where a CV gets imported through two channels with no other matching fields.
What wins when signals disagree
The interesting work isn't matching. It's resolving conflicts. Two records can have matching name and matching phone but completely different LinkedIn URLs. Either you're looking at one person who edited their vanity, or you're looking at two different humans with overlapping data. The pipeline has to pick one.
The rule we settled on is that auto-merge requires at least one strong signal agreeing and no strong signal actively disagreeing. If LinkedIn vanities differ, we don't auto-merge even if phone and email match. LinkedIn enforces uniqueness server-side, and two different live vanities almost certainly means two different people. The record gets flagged for human review instead.
If only soft signals match (name plus employer history, or CV shingle alone) we never auto-merge. We surface a suggested merge in the UI and let the recruiter make the call. The cost of a wrong auto-merge is much higher than the cost of a duplicate that lingers for a week.
The case that broke our first version
We rebuilt this pipeline twice. The first version used a fingerprint of full name plus current employer plus seniority. It worked on test data and shipped to production. Within a fortnight we had reports of merged records that obviously weren't the same person.
The case that broke it was twins. Two siblings in finance, same surname, similar first names that one customer's CRM had abbreviated identically, both at firms in the City of London at the same time. Our fingerprint collapsed them into one record. The candidate noticed because we'd emailed him about a role his brother had applied for.
The fix wasn't a smarter fingerprint. It was a rule that no auto-merge happens without at least one strong signal: LinkedIn vanity, phone, or verified email. Soft fingerprints can suggest, never decide. That rule has held for two years across millions of records since.
What we deliberately don't try to dedupe
There are three classes of duplicate we leave alone on purpose.
The first is the same person across different customer tenants. If two of our customer agencies both have the same candidate in their database, we never merge those records. They live in separate logical databases and that boundary doesn't move, ever. Cross-tenant signal is useful internally for things like profile view tracking, but it never causes a merge.
The second is historical name changes. If a candidate changed their surname five years ago and we have records under both names with no overlapping employer history or contact details, we don't try to chase that. We flag possible alias links when we have evidence (matching phone, matching LinkedIn vanity that survived the rename) but we don't guess.
The third is transliterations of the same name. Mohammed, Muhammad, and Mohamed are sometimes the same person and sometimes three different people. The cost of getting this wrong in a recruitment context is real, so we flag it as a soft suggestion and never auto-merge.
Why this matters even if you only have 50,000 records
People assume entity resolution is a problem you only think about at hundreds of millions of records. It isn't. The pain shows up much earlier than that.
A solo recruiter with 50,000 candidates already has a duplicate problem if dedup wasn't built in from the start. The symptoms are familiar. Search returns three versions of the same person. You send the same candidate to a client twice. A BD person calls a contact who is already sitting in your candidate pipeline under a different name. Every one of those incidents costs trust with either the candidate or the client.
The reason we built the stack this way is that the agencies on Recruitly range from one-person shops to teams of fifty, and the pain shows up at every size. Our data agent that cleans millions of recruitment records is the customer-facing surface of this work. It uses the same signal stack to suggest merges in your own database, not just in the sourcing index.
If you want the broader picture of why we treat candidate data as the most valuable thing an agency owns, the post on LinkedIn plus your own database covers the philosophy. The dedup work is what makes that philosophy actually pay off. The sourcing and AI pages show the surfaces that sit on top of the candidate index, and the background on how to build a candidate database is a good companion read.



