LoginJoin GenScrap
公開ギャラリーに戻る
extracting tables from PDFs using Python
出典: k1ito-tech作成者: k1ito2 months ago

Here’s a practical, up-to-date comparison of the main Python options for extracting tables from PDFs—what they’re best at, trade-offs, and quickstart code.

#TL;DR — Which should you use?

  • Digital PDFs with clear ruling/borders: Start with Camelot (lattice/stream modes). (Camelot, Read the Docs)
  • Digital PDFs (simple/fast, Java OK): tabula-py (wrapper of tabula-java). Great defaults, very fast. Requires Java. (Tabula-Py)
  • Mixed/quirky layouts & you want control: pdfplumber (fine-grained, tunable table finding). (GitHub)
  • Scanned/image PDFs: Pre-OCR with OCRmyPDF, then run Camelot/tabula/pdfplumber. (OCRmyPDF, GitHub, Ubuntu Manpages)
  • Complex documents (forms, multi-span/merged cells) or images with no ruling: ML approach like Microsoft Table Transformer (TaTR) via Hugging Face / layoutparser. Heavier, but most robust for detection/structure. (GitHub)

##Side-by-side comparison

LibraryBest forAccuracySpeedSetup frictionNotes
CamelotDigital PDFs with lines or consistent spacingHigh (lattice); good (stream)FastPure PythonTwo modes: lattice (uses ruling lines) and stream (white-space). Exports DataFrame/CSV easily. (Camelot)
tabula-pyBulk extraction from consistent PDFsHigh on “regular” tablesVery fastNeeds Java (tabula-java)Simple API (read_pdf, convert_into), good for pipelines. (Tabula-Py)
pdfplumberTricky layouts & fine controlGood→High (with tuning)MediumPure PythonExposes characters/lines/rects; you can tune table settings per page. (GitHub)
OCRmyPDF (pre-step)Scanned/image PDFsMediumExternal CLIAdds searchable text layer; then use one of the above. (OCRmyPDF, GitHub)
Table Transformer (TaTR)Complex or image-only tablesHigh (best on hard layouts)SlowestML model + depsDetects table regions & structure; works on PDFs/images. (GitHub)

##Minimal working examples

###1) Camelot

python
import camelot tables = camelot.read_pdf("file.pdf", pages="1-3", flavor="lattice") # or flavor="stream" df = tables[0].df tables.export("out.csv", f="csv")

Camelot’s lattice mode relies on ruling lines; use stream for whitespace-based tables. (Camelot)

###2) tabula-py

python
import tabula dfs = tabula.read_pdf("file.pdf", pages="all", multiple_tables=True) # returns list of DataFrames tabula.convert_into("file.pdf", "out.csv", output_format="csv", pages="all")

Requires Java because it wraps tabula-java. Great for batch jobs. (Tabula-Py)

###3) pdfplumber

python
import pdfplumber, pandas as pd rows = [] with pdfplumber.open("file.pdf") as pdf: page = pdf.pages[0] table = page.extract_table( table_settings={"vertical_strategy":"lines", "horizontal_strategy":"lines"} ) df = pd.DataFrame(table[1:], columns=table[0])

table_settings lets you adapt to non-standard layouts; you can also use edge/line tolerance strategies. (GitHub)

###4) Scanned PDFs: OCRmyPDF → (Camelot/tabula/pdfplumber)

bash
ocrmypdf --deskew --clean input_scanned.pdf input_searchable.pdf

Then run Camelot/tabula/pdfplumber on input_searchable.pdf. (OCRmyPDF, GitHub)

###5) ML route: Microsoft Table Transformer (TaTR)

  • Typical stack: layoutparser/transformers + TaTR weights → detect tables → reconstruct structure → export CSV/DF. Best on complex images and messy scans; heavier to set up and slower. (GitHub)

##Recommended decision flow

  1. Is it scanned? → Yes: OCRmyPDF, then try Camelot (lattice) → Camelot (stream) → pdfplumber. (OCRmyPDF)
  2. Digital with visible lines?Camelot (lattice) first; fall back to stream. (Camelot)
  3. Bulk and consistent layout, Java OK?tabula-py for speed and simplicity. (Tabula-Py)
  4. Weird/irregular layouts / need control?pdfplumber with tuned settings. (GitHub)
  5. Still failing (merged cells, rotated, camera shots)?TaTR (ML). (GitHub)

##Practical tips

  • Try multiple modes (Camelot lattice vs stream) per page; some pages switch styles. (Camelot)
  • For rotated pages, rotate first (pdfplumber can detect/rotate). (GitHub)
  • With tabula-py, you can pass area or templates for consistent cropping; great for reports where tables live in fixed regions. (Tabula-Py)
  • For no ruling lines, increase tolerance and rely on whitespace clustering (Camelot stream / pdfplumber strategies). (Camelot, GitHub)
  • For end-to-end pipelines, standardize outputs to CSV/Parquet and add validation (row/column counts, headers present, totals match).

If you share a representative PDF (one digital, one scanned), I can run a quick bake-off with 2–3 approaches and give you tuned code per page type.

このスクラップブックについて
こちらもおすすめ

「k1ito-tech」の他のスクラップ

diskcache

いい選択です 👍 diskcache は ディスクにキャッシュを保存できるライブラリ で、メモリを圧迫せずに大量のデータをキャッシュできます。しかも API がシンプルで、Webアプリや機械学習の前処理結果キャッシュなどにもよく使われます。 --- インストール bash pip inst...

about 1 month ago
#python caching#diskcache+3

Best mcp server development sdk?

If by “MCP server” you mean a server implementing the Model Context Protocol (MCP) to allow LLMs / AI agents to interact with external tools/data sour...

about 1 month ago
#model context protocol#mcp sdk+3

# [2508.20722] rStar2-Agent: Agentic Reasoning Technical Report

[2508.20722] rStar2-Agent: Agentic Reasoning Technical Report URL: https://www.arxiv.org/abs/2508.20722 Captured: 2025/9/6 17:39:22 --- Computer ...

about 2 months ago
#agentic reinforcement learning#large language models+3

Daytona Sandbox:開発環境の新たな可能性

Daytona Sandbox:開発環境の新たな可能性 Daytona Sandboxとは Daytona Sandboxは、開発者がクラウド上で瞬時に開発環境を構築・共有できる革新的なプラットフォームです。従来のローカル開発環境の制約を取り払い、どこからでもアクセス可能な統一された開発体験...

about 2 months ago
#daytona#sandbox+3

E2B example in Python

step-by-step E2B example in Python that shows stateful execution, installing packages, uploading a file, and doing a quick SQLite query—all inside a s...

about 2 months ago
#e2b#python+3

# Agentic workflow patterns - AWS Prescriptive Guidance

Agentic workflow patterns integrate modular software agents with structured large language model (LLM) workflows, enabling autonomous reasoning and ac...

2 months ago
#aws#agentic ai+3

自分の投稿を作成してみませんか?

Get Started