Introduction

In the world of document processing and testing, having access to poor quality scans can be crucial for developing and testing robust document analysis systems. However, obtaining real-world bad scans can be challenging and inconsistent. This tool was inspired by the challenges described in Properly Handling Criminal Discovery, where legal professionals often face issues with degraded document quality in discovery materials.

Example Output

Here’s an example of the kind of degraded document quality that bad_scanner can simulate:

Example of a degraded police report generated using bad_scanner Example output: A simulated poor-quality scan of a police report demonstrating various artifacts including dust spots, scratches, and reduced contrast.

This example demonstrates several key features of the tool:

  • Realistic paper texture and aging effects
  • Subtle rotation of the document
  • Dust and particle artifacts
  • Reduced contrast and clarity
  • Light scratches across the surface

Enter bad_scanner, a Python tool that simulates various scanner artifacts and degradation effects on PDF documents. It’s particularly valuable for:

  • Testing document processing systems
  • Developing OCR systems that need to handle degraded documents
  • Creating training data for machine learning models
  • Simulating real-world scanning conditions
  • Testing legal document processing systems against common discovery document issues

What is Bad Scanner?

bad_scanner is a command-line tool that takes a PDF document and applies a series of visual effects to simulate a poor quality scanner. It’s particularly useful for:

  • Testing document processing systems
  • Developing OCR systems that need to handle degraded documents
  • Creating training data for machine learning models
  • Simulating real-world scanning conditions

Key Features

1. Dust and Particle Effects

The tool simulates dust particles and spots that commonly appear on scanner glass:

# Example dust spot parameters
dust_density = 1.0
dust_min_radius = 1
dust_max_radius = 5
dust_alpha_min = 30
dust_alpha_max = 100

2. Scratches and Marks

Adds realistic scratches and marks that might appear on scanned documents:

# Example scratch parameters
scratch_count = 3
scratch_line_width_min = 1
scratch_line_width_max = 3
scratch_alpha_min = 50
scratch_alpha_max = 150

3. Image Quality Adjustments

Fine-tune various aspects of the image quality:

  • Blur effects
  • Contrast adjustments
  • Sharpness modifications
  • Brightness control
  • Paper rotation

Usage Examples

Basic Usage

python bad_scanner.py input.pdf output.pdf

Advanced Configuration

python bad_scanner.py input.pdf output.pdf \
    --blur-radius 3.0 \
    --dust-density 2.0 \
    --scratch-count 5 \
    --contrast 0.8 \
    --brightness 0.9

Batch Processing

For processing multiple files:

./run_batch.sh input_directory output_directory

Technical Details

The tool is built using:

  • Python 3.x
  • pdf2image for PDF processing
  • Pillow (PIL) for image manipulation
  • Poppler for PDF rendering

Installation

  1. Clone the repository:
    git clone https://github.com/LucidTruthTechnologies/bad_scanner.git
    cd bad_scanner
    
  2. Set up the virtual environment:
    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install dependencies:
    pip install -r requirements.txt
    

Real-World Applications

The tool was specifically developed to address common issues in legal discovery documents, where poor scanning quality can significantly impact document analysis and OCR accuracy. This is particularly relevant for:

  • Criminal defense document review
  • Legal discovery processing
  • Court document analysis
  • Evidence processing

Document Processing Testing

When developing document processing systems, it’s crucial to test against various input conditions. bad_scanner helps create consistent test cases with known degradation levels.

OCR Development

OCR systems need to handle poor quality inputs. This tool can generate training data with controlled degradation levels.

System Validation

Use bad_scanner to validate that your document processing pipeline can handle real-world scanning conditions.

Conclusion

bad_scanner fills an important niche in document processing development and testing. By providing consistent, controllable degradation effects, it helps developers create more robust systems that can handle real-world scanning conditions. This is particularly valuable in the legal field, where document quality can significantly impact case outcomes.

Resources


Note: This tool is open source and available under the MIT License. Contributions are welcome!