Here’s a practical, up-to-date comparison of the main Python options for extracting tables from PDFs—what they’re best at, trade-offs, and quickstart code.

#TL;DR — Which should you use?

Digital PDFs with clear ruling/borders: Start with Camelot (lattice/stream modes). (Camelot, Read the Docs)
Digital PDFs (simple/fast, Java OK): tabula-py (wrapper of tabula-java). Great defaults, very fast. Requires Java. (Tabula-Py)
Mixed/quirky layouts & you want control: pdfplumber (fine-grained, tunable table finding). (GitHub)
Scanned/image PDFs: Pre-OCR with OCRmyPDF, then run Camelot/tabula/pdfplumber. (OCRmyPDF, GitHub, Ubuntu Manpages)
Complex documents (forms, multi-span/merged cells) or images with no ruling: ML approach like Microsoft Table Transformer (TaTR) via Hugging Face / layoutparser. Heavier, but most robust for detection/structure. (GitHub)

##Side-by-side comparison

Library	Best for	Accuracy	Speed	Setup friction	Notes
Camelot	Digital PDFs with lines or consistent spacing	High (lattice); good (stream)	Fast	Pure Python	Two modes: lattice (uses ruling lines) and stream (white-space). Exports DataFrame/CSV easily. (Camelot)
tabula-py	Bulk extraction from consistent PDFs	High on “regular” tables	Very fast	Needs Java (tabula-java)	Simple API (`read_pdf`, `convert_into`), good for pipelines. (Tabula-Py)
pdfplumber	Tricky layouts & fine control	Good→High (with tuning)	Medium	Pure Python	Exposes characters/lines/rects; you can tune table settings per page. (GitHub)
OCRmyPDF (pre-step)	Scanned/image PDFs	—	Medium	External CLI	Adds searchable text layer; then use one of the above. (OCRmyPDF, GitHub)
Table Transformer (TaTR)	Complex or image-only tables	High (best on hard layouts)	Slowest	ML model + deps	Detects table regions & structure; works on PDFs/images. (GitHub)

##Minimal working examples

###1) Camelot

python

import camelot
tables = camelot.read_pdf("file.pdf", pages="1-3", flavor="lattice")  # or flavor="stream"
df = tables[0].df
tables.export("out.csv", f="csv")

Camelot’s lattice mode relies on ruling lines; use stream for whitespace-based tables. (Camelot)

###2) tabula-py

python

import tabula
dfs = tabula.read_pdf("file.pdf", pages="all", multiple_tables=True)  # returns list of DataFrames
tabula.convert_into("file.pdf", "out.csv", output_format="csv", pages="all")

Requires Java because it wraps tabula-java. Great for batch jobs. (Tabula-Py)

###3) pdfplumber

python

import pdfplumber, pandas as pd
rows = []
with pdfplumber.open("file.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table(
        table_settings={"vertical_strategy":"lines", "horizontal_strategy":"lines"}
    )
df = pd.DataFrame(table[1:], columns=table[0])

table_settings lets you adapt to non-standard layouts; you can also use edge/line tolerance strategies. (GitHub)

###4) Scanned PDFs: OCRmyPDF → (Camelot/tabula/pdfplumber)

bash

ocrmypdf --deskew --clean input_scanned.pdf input_searchable.pdf

Then run Camelot/tabula/pdfplumber on input_searchable.pdf. (OCRmyPDF, GitHub)

###5) ML route: Microsoft Table Transformer (TaTR)

Typical stack: layoutparser/transformers + TaTR weights → detect tables → reconstruct structure → export CSV/DF. Best on complex images and messy scans; heavier to set up and slower. (GitHub)

##Recommended decision flow

Is it scanned? → Yes: OCRmyPDF, then try Camelot (lattice) → Camelot (stream) → pdfplumber. (OCRmyPDF)
Digital with visible lines? → Camelot (lattice) first; fall back to stream. (Camelot)
Bulk and consistent layout, Java OK? → tabula-py for speed and simplicity. (Tabula-Py)
Weird/irregular layouts / need control? → pdfplumber with tuned settings. (GitHub)
Still failing (merged cells, rotated, camera shots)? → TaTR (ML). (GitHub)

##Practical tips

Try multiple modes (Camelot lattice vs stream) per page; some pages switch styles. (Camelot)
For rotated pages, rotate first (pdfplumber can detect/rotate). (GitHub)
With tabula-py, you can pass area or templates for consistent cropping; great for reports where tables live in fixed regions. (Tabula-Py)
For no ruling lines, increase tolerance and rely on whitespace clustering (Camelot stream / pdfplumber strategies). (Camelot, GitHub)
For end-to-end pipelines, standardize outputs to CSV/Parquet and add validation (row/column counts, headers present, totals match).

If you share a representative PDF (one digital, one scanned), I can run a quick bake-off with 2–3 approaches and give you tuned code per page type.

#TL;DR — Which should you use?

##Side-by-side comparison

##Minimal working examples

###1) Camelot

###2) tabula-py

###3) pdfplumber

###4) Scanned PDFs: OCRmyPDF → (Camelot/tabula/pdfplumber)

###5) ML route: Microsoft Table Transformer (TaTR)

##Recommended decision flow

##Practical tips

k1ito-tech

diskcache

Best mcp server development sdk?

# [2508.20722] rStar2-Agent: Agentic Reasoning Technical Report

Daytona Sandbox：開発環境の新たな可能性

E2B example in Python

# Agentic workflow patterns - AWS Prescriptive Guidance