Product

How we picked the best OCR for reading CVs at scale

We tested six OCR engines on real candidate CVs to find the one that does not break emails, phone numbers, and umlauts. Here is how it went.

Lokesh

Updated 16 May 2026

Ask AI about this

Every CV that hits Recruitly goes through one decision before anything else happens. Can we read the text out of it directly, or do we need to send it through OCR. About 75% of the time we can read it directly. The other 25% are scanned, photographed, or rendered as images by whatever software the candidate used. Those go through OCR, and the engine we pick for that job quietly decides whether the downstream parser sees "Sebastian Müller, +49 30 12345" or "Sebastion Muller, 1 49 30 12345".

One character wrong in an email and the candidate never gets the job alert. The plus dropped from a phone number and WhatsApp cannot dial it. The umlaut read as a plain u and the recruiter searching for "Müller" never finds them. We have watched all three happen, on real candidates, in production. So we sat down and tested every serious OCR engine we could find on the same set of CVs, and picked one that does not make those mistakes.

Here is how we did it and what we ended up with.

What "good" looks like for CV OCR

OCR for a CV is not the same problem as OCR for a receipt or a scanned book. The text on a CV has a few hostile properties. It mixes languages (German names, English job titles, Arabic phone codes). It is full of symbols that look like text but are not (+44, €60k, 10+, &). It uses diacritics that change meaning (Müller is not the same person as Muller). It has two-column layouts where the engine has to decide what is a sidebar and what is the main story. And the most important fields, the email and phone number, sit in a tiny block of text the engine has only one shot at.

We did not care about prose accuracy. The downstream Llama parser is good enough to clean up paragraph quirks. What we cared about was the five fields that decide whether a recruiter can ever contact this candidate again. Email. Phone. Name. Date of birth where present. Location. Get those five right and the rest sorts itself out.

So the test was not "which engine produces the cleanest transcript". The test was "which engine never breaks the five things that matter".

The six engines we put on the bench

The baseline was PaddleOCR running locally through RapidOCR-ONNX, which is what we had been using up to that point. Free, no network, bundled into the container. Tesseract was the second free baseline, older but battle-tested.

For the cloud engines we tested four. GLM-OCR, the Chinese open model with the strong layout-detection story. Mistral OCR, which markets itself specifically as a document-OCR product. Datalab, the premium engine that big-document workflows tend to converge on. And Google's Gemini 3.1 Flash Lite, which is a general multimodal model rather than a dedicated OCR product, but cheap enough to throw at every page.

Six engines total: PaddleOCR, Tesseract, GLM-OCR, Mistral OCR, Datalab, and Gemini 3.1 Flash Lite. We sent the same PDFs to each and graded the output by hand. No automated scoring, no benchmark suite. Real CVs, real eyes.

The two CVs that mattered

You cannot benchmark on synthetic data. We picked two real CVs that had broken our pipeline at some point in the past and used them as the gold-standard test set.

The first was a German DaF teacher's CV. Two-column layout. Left sidebar held her name, photo, contact details, and date of birth. Right column held experience. Heavy umlauts throughout (Über, Prüferin, Pädagogische). German typographical quirks like the low opening quote („Grünen Diploms"). DOB written as 19.04.1979 in German date order. The kind of CV a Berlin agency sees ten times a day and an OCR engine trained mostly on English receipts trips over.

The second was an English single-column tech CV with a photo placeholder header. Name in capital case with a space. LinkedIn handle in the contact block (alexmorgan, not Alex Morgan). Numbers that look like text (10+ years, 50M+ users, 2TB data). The hostile thing about this CV is the numbers, because OCR engines love to read "10+" as "1o+" and "50M" as "5oM".

Two CVs, both real, both representative of a class of CV we see every week. If an engine could read both cleanly it would handle the long tail. If it could not, we knew exactly where it would fail.

What we measured, and what the results looked like

We graded each engine on five things. Email accuracy. Phone accuracy including the leading plus. Diacritics and umlauts preserved. Numbers and dates not corrupted. And cost per 100k pages, because at our volume that decides whether something is viable.

PaddleOCR. Got most of the prose right but failed where it mattered. Phone numbers came back without the leading plus. Umlauts were stripped wholesale. The DOB came out as "1g.04.1979" because the engine read 9 as g. The English CV had the name as "ALEXMORGAN" with no space, and the LinkedIn handle came back as "alexmorgai". Free and fast, but the failure modes were exactly the wrong ones.

Tesseract. Similar story. Reliable for what it does, but trained on too-clean documents to handle modern CV layouts.

GLM-OCR. Catastrophic failure on the photo-header CV. The model decided the photo and the contact block next to it were one element and skipped both. An OCR engine that loses the candidate's email and phone has failed at its only job.

Mistral OCR. Got close. Diacritics preserved, prose accurate, reading order sensible. But it read the leading plus on every phone number as a 1. Every German mobile came back as 149 30 something. A small error that breaks every phone dial in the database.

Datalab. Perfect. Every field correct on both CVs. Diacritics intact, plus signs preserved, numbers clean. Five seconds per page. The catch was cost: $325 to $475 per 100k pages.

Gemini 3.1 Flash Lite. Also perfect on the critical fields. Email and phone with plus signs, umlauts preserved including the German low-opening quotes, DOB clean, numbers like 10+ and 50M+ all read correctly. 3.5 seconds per page. Cost: $25 to $30 per 100k pages. Roughly ten to fifteen times cheaper than Datalab for the same accuracy on the things we cared about.

Why Gemini won

The decision was simpler than the test made it look. Two engines had zero critical defects on our test CVs: Datalab and Gemini Flash Lite. Both correctly handled the email, phone with plus, umlauts, and numbers. Gemini was ten to fifteen times cheaper. That was the whole argument.

Reading order did not matter for us. Gemini reads two-column CVs main-column-first, which is the "wrong" order if you are trying to reproduce the document visually. Datalab and the dedicated OCR engines read sidebar-first, which is technically correct. But the downstream Llama parser does not care which order the text arrives in. It re-extracts the structured fields from the text regardless. So we would have been paying ten times more for a feature that did not change the outcome.

The other thing that helped Gemini was the model itself. It is a general multimodal model, not a dedicated OCR product. That sounds like a disadvantage. In practice it is the opposite. The model knows that German names take umlauts. It knows phone numbers start with a plus. It knows "10+" is a number followed by a plus sign, not "1o+". Domain knowledge baked into the model corrects errors that pure OCR engines make blindly.

Why we kept the local engines anyway

We did not rip out PaddleOCR and Tesseract. They are still in the container, sitting behind Gemini as fallbacks. Gemini runs first. If anything goes wrong (missing API key, HTTP error, network timeout, empty response, unexpected JSON shape) the call falls through to PaddleOCR. If Paddle fails, it falls through to Tesseract.

The reason is simple. If Google has an outage and we have no fallback, our document pipeline stops. Candidates cannot be onboarded. CVs cannot be parsed. Agencies lose hours of work. With PaddleOCR sitting one layer down, the worst case is "OCR quality degrades to what it used to be for the duration of the outage, then recovers". Service stays up. Nothing crashes. Nothing fails silently.

This pattern, paid cloud primary with a free local fallback, is something we use elsewhere in the stack and it has earned its keep more than once. Cloud services go down. Local fallbacks do not, because they are sitting on disk in your own container. Cost of running both: roughly zero, because the local engines only fire when the cloud one fails.

What this actually costs at our scale

At our current volume we process about 100,000 CVs a month across all agencies on the platform. Around 75% of those are text-PDFs that we can read directly without any OCR. That leaves around 25,000 CVs a month that hit the OCR cascade. At an average of 2.5 pages per CV, that is about 62,000 pages going to Gemini each month. At $0.00025 per page, the total monthly bill comes to between $15 and $20.

For a feature that decides whether 25,000 candidate records have correct contact details every month, $15 is not a number we lose sleep over. It works out at roughly $0.0006 per CV (less than a tenth of a cent) to guarantee the candidate's email and phone come out of OCR correctly. The alternative was either accepting the PaddleOCR failure modes for free, or paying ten times more for Datalab for no extra benefit. Gemini sat exactly where we needed it on the cost-accuracy curve.

If Google goes dark for an hour, maybe 30 CVs fall through to PaddleOCR with the old quality, and everything resumes the moment Gemini comes back. That is a trade-off we can live with.

What this means for agencies on Recruitly

Most of this work is invisible. You upload a CV (scanned, photographed, taken on a phone, sent over WhatsApp) and the Parse Any CV pipeline turns it into a clean candidate record with the email and phone intact. That is the entire user-facing surface of months of OCR work. But it matters, because clean contact data is what makes the rest of the stack actually work. Sourcing searches return real people you can dial. The AI copilot can run outreach without bouncing on bad email addresses. The Data Agent does not have to spend its time cleaning up OCR-introduced errors before it can do its real job.

The general principle this slots into is one we have written about before. Your own database is the asset, and AI on your data is the winning formula. None of that works if the data going into the database is corrupted at the point of entry. OCR is the boundary where most CRM databases pick up half their data-quality problems. Worth getting right.

If you want to see how this looks in practice, book a demo and we will run a stack of your actual CVs through the pipeline. If you would rather get hands-on, start a trial and upload a batch.

Google & AWS Infrastructure

Zero-Downtime Migration

Single Sign-On & 2FA

How we picked the best OCR for reading CVs at scale

What "good" looks like for CV OCR

The six engines we put on the bench

The two CVs that mattered

What we measured, and what the results looked like

Why Gemini won

Why we kept the local engines anyway

What this actually costs at our scale

What this means for agencies on Recruitly

Keep reading

How to build a candidate database from scratch

AI in recruitment — what actually works in 2026

How we built a Data Agent that cleans millions of recruitment records

LinkedIn + Your Database = Winning Combination

Ready to run your agency on one system?