Report Passer: bulk PDF-to-Excel data extraction for quality control

Step 01

Problem

Each concrete testing laboratory (Horysu, ICEC, LTE, Torralba...) sends reports in PDF with radically different layouts. Manually extracting strength values, W/C ratios, cement names, and slump values meant opening and transcribing each document. With hundreds of reports per batch, the task was tedious and highly prone to typing errors.

Step 02

Context and constraints

To optimize historical data loads and quality audits at AW Certificación, a robust, offline tool was required to quickly process massive PDF batches locally, respecting the original data structure of each provider without uploading sensitive customer information to the network.

Role: Full desktop application development: GUI design with drag-and-drop support (TkinterDnD), implementation of 7 independent regex-based extraction engines, multi-threaded concurrency management to prevent UI freezing, and standalone packaging with PyInstaller.

Step 03

Key decisions

Migrate from cloud-based scripts to a local Windows desktop application (.exe) to process reports locally, keeping client information off the network.
Multi-threaded architecture (Threading) that isolates batch processing from the main UI thread, letting users track progress and cancel tasks at any time without freezes.
Use of robust, tailored regex patterns per lab to isolate key fields such as delivery note code, cement type, W/C ratio, and specimen strengths.
Design of an intuitive dropdown selector in the UI to choose the correct extraction engine matching the batch of reports loaded.
Offline packaging with PyInstaller including the custom application icon and bundled dependencies, requiring no pre-installed Python or external libraries.

Step 04

Outcomes

Batch processing of dozens of reports in seconds, turning days of transcription work into a single click.
7 integrated extraction engines covering the main formats in the concrete sector (Horysu, ICEC, LTE/Horpresol, Torralba, etc.).
Clean and automated export to a single consolidated Excel file with all key fields extracted per column.

Metrics

7 laboratories and formats supported in a single executable.
Massive multi-threaded offline processing in seconds per batch.
Zero external network dependencies: works 100% locally.
Physical and irreversible digital redaction of sensitive data in sample PDFs.

Step 05

Learnings and next improvement

Application Architecture

The application was designed for usability in technical offices and speed of local extraction:

PDFs → [Tkinter GUI / Drop Area] → [Processing Thread] → [Regex Engines per Lab] → Excel

1. Intuitive User Interface

Drag-and-Drop: Allows dragging folders or PDF files directly into the window to list them and start processing.
Dropdown Selection Menu: The user can select the extraction engine corresponding to the format of the batch to be processed (e.g., LTE, ICEC, Horysu).
Real-time Log Console: Shows detailed reading progress, skipped files, errors, and Excel saving confirmations.

2. Secure Multi-threaded Processing

Extraction runs on an independent worker thread. This keeps the GUI responsive, allowing the user to cancel the process if needed.
All dialog popups and saving confirmations are safely scheduled on the main thread via self.after to prevent race conditions or UI lockups.

Downloads and Demo Resources

⚠️ IMPORTANT — This is NOT a program bug

The application does not invent data. If you see empty fields in the Excel output — such as slump cones, minimum cement content, or any other blank cell — it is because the original PDF report did not contain that data. The program faithfully extracts exactly what is in each report, nothing more, nothing less. This also includes human errors in the original PDFs, such as "HA‑30/F/16/XD2/HIDROFUGO" (a designation mistakenly written together in the original report); the application reflects it exactly as it was sent.

The tool works correctly and actually helps you detect these inconsistencies in the reports you receive.

You can test the application by downloading the executable and the censored test batch:

Download Desktop Application (.exe from Google Drive)
Download Batch of 49 Censored Test PDFs (.zip) (Contains real test reports from LTE lab physically redacted to protect client confidentiality)

📖 Demo Instructions

Download and extract the PDFs from the .zip file above.
Open the Pasador de Informes desktop application.
In the dropdown selector on the UI, select the “LTE (Horpresol)” engine (it is critical to select this engine because the test files follow this specific layout; other engines will not interpret fields correctly).
Drag the test PDFs to the green drop zone of the application and click PROCESAR Y GUARDAR EXCEL.

Confidentiality and Redaction Notice

For strict business confidentiality reasons, real customer reports cannot be distributed. To solve this and allow real-world software testing, we provide a package of 49 test reports in PDF format on which a physical redaction and censorship script has been executed.

This process irreversibly removes text streams containing company names, technician details, addresses, and specific project references from the PDF’s internal file structure, replacing them with black bands. The laboratory’s layout and numerical test values are preserved to demonstrate the precision of the automation.

Step 06

Visual resources

Project visual for Report Passer: bulk PDF-to-Excel data extraction for quality control