Judicial Scraper
LegalTech · AI for lawyers
Large-scale extraction of Brazilian court records — 23 million cases from four courts, built for a LegalTech AI platform.
Delivered in one month for a client training legal AI. No public court APIs — only manual portals with pagination, CAPTCHAs, and anti-bot defenses.
2024The client was building an AI assistant for lawyers and needed a massive dataset of court records — case numbers, parties, subjects, rulings, procedural history, and attached documents. This data lives on court websites with no public API: only manual queries, one case at a time, behind pagination, CAPTCHAs, and anti-bot measures. Collecting 23 million records by hand was not feasible.
We built a distributed, resilient scraping system that extracts court data at scale from the public portals of four Brazilian courts: TJSP (5 million cases), TJBA, TJRJ, and TRF3. Up to 12 parallel headless browser instances query OAB record ranges in parallel, persisting results in real time to SQLite so failures never lose progress. The pipeline also opens individual case pages to extract status, procedural history, rulings, and PDF attachments stored as BLOBs for direct ingestion by the client's AI stack.
- 23 million structured court cases across TJSP, TJBA, TJRJ, and TRF3
- Automatic audio CAPTCHA solving without external services
- PDF documents stored for offline and AI pipeline ingestion
- Parallel headless browser farm with real-time SQLite checkpoints
Parallel collection without data loss
The system runs multiple simultaneous threads, each querying different OAB ranges. Results flush to SQLite continuously so a crashed worker or blocked session does not wipe hours of progress.
Audio CAPTCHA cracking with ML
TJRJ uses audio CAPTCHAs — five spoken digits. We built an in-house decoder: amplitude envelope segmentation, MFCC feature extraction per digit, and cosine-similarity classification against a labeled sample bank — no paid third-party CAPTCHA APIs.
Verdict and document extraction
Beyond metadata, crawlers navigate into each case to pull status, procedural history, and attached PDFs (petitions, decisions, sentences), stored as BLOBs in per-case databases for offline querying and AI ingestion.
Four court portals, one pipeline
Each court portal has different HTML flows and defenses. The architecture isolates court-specific adapters while sharing persistence, CAPTCHA, and document-download primitives.