You need to get text out of a PDF and it should be simple. Read the file, get the text, done. Except PDFs weren't built for text extraction—they were built for printing. And that distinction is the source of every problem you're about to run into.
A PDF doesn't store paragraphs. It doesn't store sentences. It stores individual text fragments positioned at exact x,y coordinates on a page. The word "Hello" might be five separate glyphs placed at five separate positions. "Reading order" is something your eyes reconstruct—the PDF format has no concept of it.
This means every text extraction library is actually doing two things: pulling the raw text content, and then guessing how those fragments assemble into the thing humans would call "the text." Some libraries do this well. Most don't. And for scanned PDFs (about 30% of PDFs in the wild), none of the pure JavaScript libraries can do anything at all—the pages are images, not text.
This guide covers four approaches with working code, honest tradeoffs, and a decision framework so you can stop reading Stack Overflow threads and start shipping.
Method 1: pdf-parse (pdf.js wrapper)
The most popular npm package for PDF text extraction. Over 2 million weekly downloads. It's a thin wrapper around Mozilla's pdf.js that gives you a dead-simple API: pass in a buffer, get text back.
npm install pdf-parse
import fs from 'fs'
import pdfParse from 'pdf-parse'
const buffer = fs.readFileSync('./report.pdf')
const data = await pdfParse(buffer)
console.log(data.text) // all text, all pages
console.log(data.numpages) // total page count
console.log(data.info.Title) // PDF metadata
Need text per page? Use the pagerender callback:
const pageTexts = []
function renderPage(pageData) {
return pageData.getTextContent().then(content => {
const text = content.items
.map(item => item.str)
.join(' ')
pageTexts.push(text)
return text
})
}
await pdfParse(buffer, { pagerender: renderPage })
console.log(pageTexts[0]) // page 1 text
console.log(pageTexts[1]) // page 2 text
The problems
- Abandoned. The last publish to npm was in 2020. The repo is essentially archived. You're building on a library that won't get security patches, bug fixes, or compatibility updates with newer Node versions.
- Layout reconstruction is naive. pdf-parse joins text fragments in the order the PDF stores them, with spaces between. For single-column documents this is fine. For multi-column layouts, tables, or any complex structure, the output is scrambled gibberish.
- No scanned PDF support. If the PDF contains images instead of text (scanned documents, photographed pages), pdf-parse returns an empty string. No OCR. No fallback. Just silence.
- Encoding edge cases. Some PDFs use custom font encodings where character "A" is stored as an arbitrary glyph index. The underlying pdf.js handles most of this, but old Acrobat-generated PDFs and CJK text can produce garbled output.
For quick scripts that extract text from simple, modern PDFs—single-column, digitally created—pdf-parse is the fastest path. For anything else, it's a trap. The API is seductively simple, so you don't discover the problems until they hit production.
What scrambled extraction actually looks like
Here's what pdf-parse gives you on a two-column annual report vs what you actually want:
The fragments from column 1 and column 2 get interleaved because pdf-parse doesn't understand spatial layout. It just concatenates in document order.
Method 2: pdf.js directly (pdfjs-dist)
Instead of using the pdf-parse wrapper, you can use Mozilla's pdf.js directly via the pdfjs-dist npm package. The payoff: you get x/y coordinates for every text fragment, which means you can reconstruct layout yourself.
npm install pdfjs-dist
import { getDocument } from 'pdfjs-dist/legacy/build/pdf.mjs'
async function extractText(pdfPath) {
const doc = await getDocument(pdfPath).promise
const pages = []
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i)
const content = await page.getTextContent()
// Each item has: str, transform[4]=x, transform[5]=y
const items = content.items.map(item => ({
text: item.str,
x: item.transform[4],
y: item.transform[5],
fontSize: item.transform[0],
fontName: item.fontName
}))
pages.push({ page: i, items })
}
return pages
}
const result = await extractText('./report.pdf')
// Now you have position data for each fragment:
// { text: "Revenue", x: 56.7, y: 702.3, fontSize: 12, fontName: "Helvetica" }
With the position data, you can build layout reconstruction logic: group fragments by y-coordinate into lines, sort lines by x-coordinate, detect columns by clustering x-values, and identify paragraphs by spacing gaps.
The tradeoffs
- You get raw power. Position data, font info, font sizes. You can detect headings, identify columns, and reconstruct tables. Nothing else in the Node.js ecosystem gives you this level of control.
- But you're writing a layout engine. That "group by y, sort by x, detect columns" logic? It's hundreds of lines of code. And every edge case (rotated text, overlapping elements, variable line spacing) adds more. You're building something that already exists inside proper PDF extraction tools.
- Actively maintained. Unlike pdf-parse, pdfjs-dist gets regular updates from Mozilla. Security patches ship. New PDF spec features get implemented.
- Still no scanned PDF support. Same limitation as pdf-parse. If the page is an image, you get nothing.
- Node.js setup quirks. You need the
legacybuild path for Node.js (not the browser build). The import path has changed across versions, and the docs assume browser usage.
Use this approach when you need positional data—maybe you're extracting specific regions of a known PDF format (always the same template), or building a custom parser for a specific document type. For general-purpose "give me the text" extraction, it's overkill.
Method 3: pdf-lib + custom extraction
Let's be upfront: pdf-lib is not a text extraction library. It's a PDF creation and editing library. But it shows up in search results for "extract text from PDF Node.js" because people try it, and you should know why it doesn't work for this use case.
import { PDFDocument } from 'pdf-lib'
import fs from 'fs'
const bytes = fs.readFileSync('./form.pdf')
const pdfDoc = await PDFDocument.load(bytes)
// You can read form fields...
const form = pdfDoc.getForm()
const fields = form.getFields()
fields.forEach(field => {
console.log(field.getName(), field.constructor.name)
})
// ...and read metadata
console.log(pdfDoc.getTitle())
console.log(pdfDoc.getAuthor())
console.log(pdfDoc.getPageCount())
// But there's no getTextContent() or extractText() method.
// The page body text is NOT accessible through pdf-lib.
Why people try this (and why it fails)
pdf-lib can load an existing PDF, read its structure, extract form field values, and pull metadata. So developers think: "surely I can get the page text too." You can't. pdf-lib operates on PDF objects at a structural level but doesn't include a text content parser. It can add text to a PDF, but it can't read the text that's already there.
Technically, you could dig into the raw PDF operators via page.node.Contents(), decode the content stream, and parse the text-drawing operators (Tj, TJ, Tf) manually. But at that point you're writing a PDF parser from scratch, and you might as well use pdfjs-dist which already does this.
When pdf-lib is the right choice
- Reading form field values from fillable PDFs (tax forms, applications, surveys)
- Extracting metadata (title, author, creation date, page count)
- Modifying existing PDFs (add watermarks, merge documents, fill forms)
For body text extraction, use one of the other three methods.
Roughly 30% of PDFs you'll encounter in the wild are scanned documents—the pages are images, not text. Methods 1-3 above will return an empty string. No error, no warning, just nothing. Your code will "work" in testing with digitally-created PDFs and silently fail on scanned ones.
To extract text from scanned PDFs, you need OCR (Optical Character Recognition). In Node.js, that means either Tesseract.js (slow, mediocre accuracy on complex layouts) or an API with built-in OCR. PDFBase's AI extraction endpoint handles this automatically—it detects scanned pages and runs OCR as part of the extraction pipeline.
Method 4: PDFBase API
Send a PDF, get structured text back. No layout reconstruction. No scanned PDF anxiety. No encoding nightmares. The API handles all of it.
import PDFBase from 'pdfbase'
import fs from 'fs'
const client = new PDFBase('pk_live_...')
// Basic text extraction
const result = await client.extract.text({
file: fs.readFileSync('./report.pdf')
})
console.log(result.data.text) // full extracted text
console.log(result.data.pages) // total page count
Per-page extraction
Need text broken down by page? The response includes it:
const result = await client.extract.text({
file: fs.readFileSync('./report.pdf'),
perPage: true
})
// result.data.pages is an array:
// [
// { page: 1, text: "Annual Report..." },
// { page: 2, text: "Q4 Results..." },
// ...
// ]
result.data.pages.forEach(p => {
console.log(`Page ${p.page}: ${p.text.slice(0, 100)}...`)
})
AI extraction for scanned PDFs
For scanned documents or PDFs with complex layouts where standard extraction falls short, use the AI extraction endpoint. It combines OCR with layout-aware parsing:
// AI-powered extraction: OCR + layout reconstruction
const result = await client.extract.ai({
file: fs.readFileSync('./scanned-contract.pdf'),
prompt: 'Extract all text, preserving paragraph structure'
})
// Structured output with paragraphs, not a blob
console.log(result.data.content)
// Works on:
// - Scanned documents (OCR)
// - Multi-column layouts
// - PDFs with custom font encodings
// - Mixed text + image pages
Why an API for extraction
- Layout reconstruction is a solved problem—on the server. PDFBase runs spatial analysis to correctly separate columns, identify paragraph boundaries, and reconstruct reading order. You don't have to write that logic.
- Scanned PDFs just work. The API detects image-only pages and runs OCR automatically. No Tesseract installation, no language model downloads, no accuracy tuning.
- Encoding nightmares handled. Custom font CMAPs, identity-H encodings, ToUnicode tables—the kind of edge cases that make pdf.js produce garbled output on certain PDFs. The API normalizes all of this.
- Error handling in batch pipelines. When you're processing thousands of PDFs, some will be malformed, password-protected, or corrupt. The API returns structured errors instead of crashing your process. That matters at scale.
For the API reference and advanced options (structured data extraction, table extraction, format conversion), see the PDFBase docs.
A note on table extraction
If what you actually need is tabular data from a PDF, none of the text extraction methods above will give you clean results. Text extraction flattens tables into lines of text—columns merge, alignment is lost, and you're left trying to regex your way through a mess.
Table extraction is a fundamentally different problem that requires spatial analysis: detecting cell boundaries, understanding row/column structure, and mapping content to a grid. If that's your use case, check our PDF Table Extractor tool which returns structured JSON or CSV from tables in any PDF.
Comparison
Here's how all four methods stack up across the dimensions that matter for production text extraction.
| pdf-parse | pdfjs-dist | pdf-lib | PDFBase API | |
|---|---|---|---|---|
| Scanned PDFs | No | No | No | Yes (OCR built-in) |
| Layout Accuracy | Poor (linear concat) | Good (with custom code) | N/A | Excellent (auto) |
| Table Data | No | Manual (from positions) | No | Yes (dedicated endpoint) |
| Maintained | No (last: 2020) | Yes (Mozilla) | Yes (community) | Yes (managed) |
| Speed (10-page doc) | ~100ms | ~150ms | N/A | ~300ms (incl. network) |
| Setup | npm install | npm install + config | npm install | npm install + API key |
| Best For | Quick scripts, simple PDFs | Custom parsers, position data | Form fields, metadata only | Production pipelines, any PDF |
On speed: pdf-parse and pdfjs-dist are faster on a per-call basis because they run locally with no network round-trip. But for batch processing, the API wins because it handles errors gracefully—a corrupt PDF won't crash your process. And when you factor in the code you'd need to write for layout reconstruction, encoding normalization, and OCR fallback, the "slower" API gives you output quality that would take weeks to replicate locally.
The encoding nightmare (and why it matters)
Here's something most tutorials skip: not all PDFs store text as text. Some PDFs—especially older ones generated by Adobe Acrobat, InDesign, or professional typesetting tools—use custom font encodings where the character "A" isn't stored as the Unicode codepoint for "A." It's stored as an arbitrary glyph index that means "A" only in the context of that specific embedded font.
pdf.js (and by extension pdf-parse) handles most of this through ToUnicode mapping tables embedded in the PDF. But when those tables are missing or incorrect—which happens more than you'd like with CJK text, mathematical symbols, and ligatures—you get output like %$#@ where you expected readable text.
This is the kind of edge case that's invisible in testing (your test PDFs were probably created with modern tools) and surfaces in production when a customer uploads a PDF exported from a 2008 version of InDesign.
Which method should you use?
Don't overthink it. Match your situation:
Simple text from modern, single-column PDFs
Use pdf-parse. Five lines of code, works great for digitally-created PDFs with straightforward layouts. Just know its limits—and know it hasn't been updated since 2020.
Need position data, font info, or building a custom parser
Use pdfjs-dist directly. You get x/y coordinates for every text fragment. Ideal when you're parsing a known PDF template where layout positions are predictable.
Reading form fields or metadata from PDFs
Use pdf-lib. It's the right tool for interacting with PDF structure—forms, metadata, merging, watermarks. Just don't expect it to extract body text.
Scanned PDFs, complex layouts, production pipeline, or you just want it to work
Use an API like PDFBase. Layout reconstruction, OCR, encoding normalization, error handling for malformed files—all handled. The 300ms round-trip is a trade you should make for output quality and reliability.
Performance considerations
Raw speed isn't the only performance metric that matters. Here's the full picture:
- pdf-parse (~100ms for 10 pages): Fastest because it does the least work. Linear concatenation without layout analysis. If accuracy matters, this speed is misleading—you'll spend time post-processing the scrambled output.
- pdfjs-dist (~150ms for 10 pages): Slightly slower because you're iterating pages and processing text items individually. Still fast, but your custom layout code adds more.
- PDFBase API (~300ms for 10 pages): Includes network round-trip. For single-document extraction, this is "slower." For batch processing hundreds or thousands of PDFs, the API's error handling, retry logic, and consistent output format save you far more time than the 200ms difference per call.
If you're building an ETL pipeline processing 10,000 PDFs, the question isn't "which is faster per call"—it's "which one won't crash at PDF #4,237 because of a malformed cross-reference table." The API approach handles that gracefully. A local library throws an unhandled exception.
Wrapping up
PDF text extraction in Node.js sounds like a solved problem. It isn't. The format was designed for visual fidelity, not data extraction, and that fundamental mismatch creates every issue you'll encounter: scrambled layout order, missing text from scanned pages, encoding failures, and table data that turns to soup.
For quick-and-dirty scripts on simple PDFs, pdf-parse is still the fastest path. For position-aware parsing of known templates, pdfjs-dist gives you the control you need. For everything else—especially production pipelines where you can't predict what PDFs users will upload—an API that handles the hard parts is the pragmatic choice.
Try the free PDF Text Extractor tool to see the output quality firsthand, or grab 100 free API credits to integrate it into your project. The docs cover all extraction endpoints, including structured data extraction and table parsing.