How to Extract Text from a Scanned PDF (OCR Guide)



You've got a scanned PDF. Maybe it's a contract someone faxed you (yes, people still fax things in 2026). Maybe it's a receipt you photographed, or a textbook chapter your professor uploaded as a flat image. Either way, you try to select the text and... nothing happens. You can't copy it, can't search it, can't do anything useful with it.
That's because a scanned PDF isn't really a document. It's a picture of a document. Your computer sees pixels, not words. To get the actual text out, you need OCR — Optical Character Recognition. It's been around for decades, but it's gotten shockingly good in the last few years.
This guide walks you through how OCR works, the best ways to extract text from scanned PDFs, and what to watch out for so you don't end up with garbled nonsense.
✏️ Edit your PDF for free — no signup required
Fill forms, add signatures, merge files, and more — right in your browser.
Open Free Editor →What OCR Actually Does
OCR software looks at an image and tries to figure out which shapes are letters. It sounds simple, but think about all the ways text can appear: different fonts, sizes, handwriting, smudged ink, weird angles, low resolution. The software has to handle all of it.
Modern OCR engines use machine learning models trained on millions of document images. Google's Tesseract (which is open source and free) can recognize over 100 languages. Adobe's OCR engine handles complex layouts with columns and tables. Apple has built-in OCR in macOS and iOS through its Live Text feature.
The accuracy you'll get depends on three things: the quality of your scan, the complexity of the layout, and which OCR tool you're using. A clean, high-resolution scan of typed text? You're looking at 98-99% accuracy. A blurry photo of a handwritten note? Maybe 60-70%, and you'll be doing a lot of manual cleanup.
Method 1: Use an Online OCR Tool
The fastest option for most people. No software to install, no accounts to create (usually), and it works on any device with a browser.
How it works:
- Upload your scanned PDF to an online OCR service
- The tool processes the image and identifies text
- You download the extracted text or a searchable PDF
Popular free options:
- OnlyDocs (onlydocs.net) — Upload your PDF and use the built-in text tools to work with your document. The editor handles common PDF tasks without requiring signups or downloads.
- Google Drive — Upload any PDF to Google Drive, then open it with Google Docs. Google automatically runs OCR on scanned pages. The formatting won't be perfect, but the text extraction is surprisingly accurate.
- SmallPDF — Has an OCR feature in their PDF-to-Word converter. Free tier limits you to two files per day.
The tradeoff with online tools is privacy. You're uploading your document to someone else's server. For a restaurant menu or a textbook page, that's fine. For sensitive legal or medical documents, think twice.
Method 2: Adobe Acrobat's Built-In OCR
If you're already paying for Adobe Acrobat Pro (around $23/month), it has one of the best OCR engines available. Adobe calls it "Recognize Text" or "Scan & OCR."
Steps in Adobe Acrobat Pro:
- Open your scanned PDF
- Click Tools → Scan & OCR
- Select Recognize Text → In This File
- Choose your language and output setting ("Searchable Image" keeps the visual look; "Editable Text and Images" tries to reconstruct the layout)
- Click Recognize Text and wait
Adobe handles multi-column layouts, tables, and mixed content (text plus images) better than most free tools. It also preserves the original appearance while adding a searchable text layer underneath — so the PDF looks the same, but now you can Ctrl+F through it.
The downside? Cost. If you don't already have Acrobat Pro, $23/month just to OCR a few documents is hard to justify. There are cheaper ways.
Method 3: Google Docs (Free, No Software)
This one surprises people. Google Drive has had built-in OCR for years, and it's genuinely good.
- Go to drive.google.com
- Upload your scanned PDF
- Right-click the file → Open with → Google Docs
- Google automatically extracts the text
The document that opens will have the extracted text below each page image. Formatting gets rough — tables usually fall apart, columns merge together, and headers might end up in weird places. But the actual text recognition is solid, especially for English documents.
What Google's OCR handles well: Standard typed text, common fonts, reasonably clear scans, documents in major languages.
Where it struggles: Handwriting, heavily stylized fonts, very low-resolution images, complex multi-column layouts.
This is my go-to recommendation for people who need to OCR something quickly and don't want to install anything or pay for anything.
Method 4: Tesseract OCR (Free, Open Source, Powerful)
For anyone comfortable with the command line, Tesseract is hard to beat. It's maintained by Google, completely free, and handles over 100 languages. Researchers and developers have been using it for years.
Install on Mac:
brew install tesseract
Install on Ubuntu/Debian:
sudo apt install tesseract-ocr
Basic usage:
tesseract scanned-document.pdf output-text.txt
For a searchable PDF output:
tesseract scanned-document.pdf output pdf
Tesseract works best when you preprocess the image first — convert to grayscale, increase contrast, straighten any skew. Tools like ImageMagick can handle that automatically. The combination of ImageMagick preprocessing plus Tesseract OCR gives you results that rival paid software.
One thing to know: Tesseract processes images, not PDFs directly. You'll need to extract the images from your PDF first (using a tool like pdfimages or pdftoppm) and then run Tesseract on each image. There are wrapper scripts that automate this, like ocrmypdf.
Method 5: OCRmyPDF (The Best Free Option for Batch Processing)
If Tesseract is the engine, OCRmyPDF is the car. It wraps Tesseract in a convenient command-line tool that handles PDFs directly, preserves the original formatting, and adds a searchable text layer.
Install:
pip install ocrmypdf
Usage:
ocrmypdf scanned-document.pdf searchable-document.pdf
That's it. One command, and you get a PDF that looks identical to the original but now has selectable, searchable text. OCRmyPDF also handles things like automatic page rotation, deskewing, and removing background noise.
For batch processing a whole folder of scanned PDFs:
for f in *.pdf; do ocrmypdf "$f" "searchable_$f"; done
I've used this on folders of 200+ scanned receipts and it worked through every one of them without issues. The accuracy was good enough that I could search for specific vendor names and amounts without manually checking each file.
Tips for Better OCR Results
OCR accuracy depends heavily on input quality. Here's how to get the best results:
Scan at 300 DPI or higher. Most scanners default to 150 DPI, which looks fine on screen but gives OCR engines less data to work with. 300 DPI is the sweet spot — higher than that gives diminishing returns unless you're dealing with very small text.
Use black and white mode for text documents. Color scans create larger files and can confuse OCR engines with background patterns or colored paper. If your document is just text, scan in grayscale or black and white.
Straighten before processing. Even a 2-3 degree tilt can hurt accuracy. Most scanner software has auto-deskew. If you're photographing a document, try to keep the camera perpendicular to the page.
Check the language setting. OCR engines use language-specific dictionaries to improve accuracy. If you're processing a German document but your OCR tool is set to English, you'll get worse results — especially with characters like ü, ö, and ß.
Proofread the output. Even the best OCR isn't perfect. Common errors include confusing "l" with "1", "O" with "0", and "rn" with "m". A quick find-and-replace pass catches most of these.
When OCR Won't Work Well
Let's be honest about the limitations. OCR isn't magic, and some documents will give you trouble no matter which tool you use.
Handwritten text is still hit-or-miss. Printed handwriting (block letters) works okay. Cursive? Forget it for most tools. Google's and Apple's handwriting recognition are the best options here, but expect to do significant manual correction.
Documents with complex backgrounds — think watermarks, colored paper, stamps overlapping text — can confuse OCR engines. Preprocessing to remove the background helps, but it's extra work.
Very old or degraded documents with faded ink, stains, or physical damage will always be challenging. Libraries and archives use specialized software (and sometimes manual transcription) for these.
Mathematical formulas and special notation don't OCR well with general-purpose tools. There are specialized tools like Mathpix for math, but they're a separate category.
Frequently Asked Questions
Is OCR 100% accurate?
No. Even under ideal conditions (clean scan, standard font, good resolution), OCR accuracy tops out around 99% for the best engines. That means roughly 1 error per 100 characters — so a full page might have 20-30 small mistakes. Always proofread important documents after OCR processing.
Can I OCR a PDF on my phone?
Yes. Both iPhone and Android have built-in OCR capabilities. On iPhone, the Files app and Notes app can scan documents with OCR. On Android, Google Lens and Google Drive both handle OCR. Microsoft Lens (free, both platforms) is another solid option that produces good results.
Does OCR work on handwritten documents?
It depends on the handwriting. Neat, printed handwriting in block letters works reasonably well with modern tools like Google Lens or Apple's Live Text. Cursive or messy handwriting still defeats most OCR software. For historical handwritten documents, services like Transkribus use specialized AI models trained specifically on handwriting.
What file formats can I get from OCR?
Most OCR tools can output plain text (.txt), searchable PDF (looks like the original but with a text layer), Word documents (.docx), or rich text (.rtf). The searchable PDF option is usually the best choice because it preserves the visual layout while making the text accessible.
The Bottom Line
Extracting text from scanned PDFs used to require expensive software and a lot of patience. Now you've got free options that handle 90% of use cases without any hassle. For quick one-off jobs, upload to Google Drive and open with Google Docs. For regular processing, install OCRmyPDF and automate the whole thing.
If you're working with PDFs regularly — editing, annotating, converting — check out OnlyDocs. It handles the common PDF tasks right in your browser, no installs or subscriptions required.
Whatever method you pick, remember: scan quality matters more than tool choice. A clean 300 DPI scan with a free tool will beat a blurry phone photo with the most expensive software every time.
✏️ Try OnlyDocs Free — Edit, sign, and merge PDFs right in your browser. No signup required.
Open Editor →