A workspace for digitizing and OCR-processing scanned historical documents, primarily Norwegian and Icelandic texts.
- Ghostscript (
gs) — PDF extraction and combining - Tesseract + language packs — OCR
- Scan Tailor — GUI tool for cleaning and splitting page images
- ImageMagick (
convert) — JPG to TIFF conversion (if needed) tiff2pdf— TIFF to PDF conversion (if needed)
Install on macOS with MacPorts:
sudo port install ghostscript tesseract ImageMagick tiff
sudo port install scantailor
# Install Tesseract language packs as needed, e.g.:
sudo port install tesseract-nor tesseract-isl tesseract-dan tesseract-engEach project directory contains source scans and an out/ subdirectory for Scan Tailor output. All scripts are run from this root directory.
Extract pages from source PDFs as PNG images at 400 DPI:
./getit "Project Dir" # process all *.pdf files in the directory
./getit "Project Dir" path/to.pdf # process specific file(s)If the source material is JPGs rather than PDFs, convert them to TIFF first:
./convertit "Project Dir"Open the extracted PNGs in Scan Tailor to deskew, crop margins, and split two-page spreads into individual pages. Export the cleaned pages as TIFFs to the project's out/ subdirectory.
Run from the root once TIFFs are in project-dir/out/:
# Searchable PDF with OCR (recommended)
./makeit-pdf "Project Dir" output.pdf nor # Norwegian
./makeit-pdf "Project Dir" output.pdf isl # Icelandic
./makeit-pdf "Project Dir" output.pdf dan # Danish
./makeit-pdf "Project Dir" output.pdf eng # English
# Plain text output (Icelandic)
./makeit-txt "Project Dir" output.txt
# PDF without OCR (image-only)
./makeit-no_tesseract "Project Dir" output.pdf
# Combine pre-existing per-page PDFs
./makeit "Project Dir" output.pdfPages are combined in natural numeric order, with title pages first.
# Combine PDFs in a project directory whose filenames match a keyword
./compactit "Project Dir" keyword output.pdf