Introduction
In the world of document processing and testing, having access to poor quality scans can be crucial for developing and testing robust document analysis systems. However, obtaining real-world bad scans can be challenging and inconsistent. This tool was inspired by the challenges described in Properly Handling Criminal Discovery, where legal professionals often face issues with degraded document quality in discovery materials.
Example Output
Here’s an example of the kind of degraded document quality that bad_scanner
can simulate:
Example output: A simulated poor-quality scan of a police report demonstrating various artifacts including dust spots, scratches, and reduced contrast.
This example demonstrates several key features of the tool:
- Realistic paper texture and aging effects
- Subtle rotation of the document
- Dust and particle artifacts
- Reduced contrast and clarity
- Light scratches across the surface
Enter bad_scanner
, a Python tool that simulates various scanner artifacts and degradation effects on PDF documents. It’s particularly valuable for:
- Testing document processing systems
- Developing OCR systems that need to handle degraded documents
- Creating training data for machine learning models
- Simulating real-world scanning conditions
- Testing legal document processing systems against common discovery document issues
What is Bad Scanner?
bad_scanner
is a command-line tool that takes a PDF document and applies a series of visual effects to simulate a poor quality scanner. It’s particularly useful for:
- Testing document processing systems
- Developing OCR systems that need to handle degraded documents
- Creating training data for machine learning models
- Simulating real-world scanning conditions
Key Features
1. Dust and Particle Effects
The tool simulates dust particles and spots that commonly appear on scanner glass:
# Example dust spot parameters
dust_density = 1.0
dust_min_radius = 1
dust_max_radius = 5
dust_alpha_min = 30
dust_alpha_max = 100
2. Scratches and Marks
Adds realistic scratches and marks that might appear on scanned documents:
# Example scratch parameters
scratch_count = 3
scratch_line_width_min = 1
scratch_line_width_max = 3
scratch_alpha_min = 50
scratch_alpha_max = 150
3. Image Quality Adjustments
Fine-tune various aspects of the image quality:
- Blur effects
- Contrast adjustments
- Sharpness modifications
- Brightness control
- Paper rotation
Usage Examples
Basic Usage
python bad_scanner.py input.pdf output.pdf
Advanced Configuration
python bad_scanner.py input.pdf output.pdf \
--blur-radius 3.0 \
--dust-density 2.0 \
--scratch-count 5 \
--contrast 0.8 \
--brightness 0.9
Batch Processing
For processing multiple files:
./run_batch.sh input_directory output_directory
Technical Details
The tool is built using:
- Python 3.x
- pdf2image for PDF processing
- Pillow (PIL) for image manipulation
- Poppler for PDF rendering
Installation
- Clone the repository:
git clone https://github.com/LucidTruthTechnologies/bad_scanner.git cd bad_scanner
- Set up the virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Real-World Applications
Legal Document Processing
The tool was specifically developed to address common issues in legal discovery documents, where poor scanning quality can significantly impact document analysis and OCR accuracy. This is particularly relevant for:
- Criminal defense document review
- Legal discovery processing
- Court document analysis
- Evidence processing
Document Processing Testing
When developing document processing systems, it’s crucial to test against various input conditions. bad_scanner
helps create consistent test cases with known degradation levels.
OCR Development
OCR systems need to handle poor quality inputs. This tool can generate training data with controlled degradation levels.
System Validation
Use bad_scanner
to validate that your document processing pipeline can handle real-world scanning conditions.
Conclusion
bad_scanner
fills an important niche in document processing development and testing. By providing consistent, controllable degradation effects, it helps developers create more robust systems that can handle real-world scanning conditions. This is particularly valuable in the legal field, where document quality can significantly impact case outcomes.
Resources
- GitHub Repository
- Issue Tracker
- Properly Handling Criminal Discovery - The article that inspired this tool
Note: This tool is open source and available under the MIT License. Contributions are welcome!