Skip to content

RyanEiri/Scans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scans

A workspace for digitizing and OCR-processing scanned historical documents, primarily Norwegian and Icelandic texts.

Dependencies

  • Ghostscript (gs) — PDF extraction and combining
  • Tesseract + language packs — OCR
  • Scan Tailor — GUI tool for cleaning and splitting page images
  • ImageMagick (convert) — JPG to TIFF conversion (if needed)
  • tiff2pdf — TIFF to PDF conversion (if needed)

Install on macOS with MacPorts:

sudo port install ghostscript tesseract ImageMagick tiff
sudo port install scantailor
# Install Tesseract language packs as needed, e.g.:
sudo port install tesseract-nor tesseract-isl tesseract-dan tesseract-eng

Workflow

Each project directory contains source scans and an out/ subdirectory for Scan Tailor output. All scripts are run from this root directory.

1. Extract

Extract pages from source PDFs as PNG images at 400 DPI:

./getit "Project Dir"               # process all *.pdf files in the directory
./getit "Project Dir" path/to.pdf   # process specific file(s)

If the source material is JPGs rather than PDFs, convert them to TIFF first:

./convertit "Project Dir"

2. Clean in Scan Tailor

Open the extracted PNGs in Scan Tailor to deskew, crop margins, and split two-page spreads into individual pages. Export the cleaned pages as TIFFs to the project's out/ subdirectory.

3. OCR and Combine

Run from the root once TIFFs are in project-dir/out/:

# Searchable PDF with OCR (recommended)
./makeit-pdf "Project Dir" output.pdf nor   # Norwegian
./makeit-pdf "Project Dir" output.pdf isl   # Icelandic
./makeit-pdf "Project Dir" output.pdf dan   # Danish
./makeit-pdf "Project Dir" output.pdf eng   # English

# Plain text output (Icelandic)
./makeit-txt "Project Dir" output.txt

# PDF without OCR (image-only)
./makeit-no_tesseract "Project Dir" output.pdf

# Combine pre-existing per-page PDFs
./makeit "Project Dir" output.pdf

Pages are combined in natural numeric order, with title pages first.

Other Utilities

# Combine PDFs in a project directory whose filenames match a keyword
./compactit "Project Dir" keyword output.pdf

About

Scantailor and Tesseract pipeline for scans.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages