PDF extraction and report automation for concrete quality control
Python system to review reports, extract data from technical PDFs, and accelerate large concrete test data loads.
Step 01
Problem
Reviewing and reloading self-control and verification tests depended on PDFs, inconsistent spreadsheets, and manual checks across hundreds of records per company.
Step 02
Context and constraints
During an internal software migration and the 2025/2026 historical data load, many laboratories submitted strength, water penetration, and related test reports without consistent tabulated data.
Role: Designed and developed Python extractors, validators, and provider-specific templates, including executable-style packaging for operational use.
Step 03
Key decisions
- Build deterministic template-based extractors for recurring providers when the PDF structure was stable enough.
- Prioritize traceability, reproducible rules, and data cleaning instead of relying on external AI APIs with personal operating cost.
- Add validations to detect nonconformities, self-control/verification inconsistencies, and errors in names, certificates, or codes.
Step 04
Outcomes
- Workloads equivalent to a full manual day became executable workflows taking under one hour once the batch was prepared.
- Extraction becomes more productive as report volume grows, because each template setup cost is amortized across the batch.
- Data cleaning reduced errors from spacing, prefixes, groups, inconsistent certificates, and weak document standardization.
Metrics
- Hundreds of tests processable per batch in seconds after the right template is configured.
- Estimated operational reduction from about 8 hours to under 1 hour on comparable data-load tasks.
- Coverage across concrete strength, water penetration, self-control, verification, and document review workflows.
Step 05
Learnings and next improvement
Learnings
- The highest-value quality automation is not only fast extraction; it is traceable review logic that reduces manual error.
- Provider-specific templates are pragmatic when PDF structures repeat and API costs are not justified for every batch.
- Data cleaning belongs inside the quality system: inconsistent names, certificates, and codes can block progress as much as a wrong calculation.
Next improvement
- Turn the templates into an internal interface with previews, issue logging, and a future AI option when the cost makes sense.
Step 06