Scans

A workspace for digitizing and OCR-processing scanned historical documents, primarily Norwegian and Icelandic texts.

Dependencies

Ghostscript (gs) — PDF extraction and combining
Tesseract + language packs — OCR
Scan Tailor — GUI tool for cleaning and splitting page images
ImageMagick (convert) — JPG to TIFF conversion (if needed)
tiff2pdf — TIFF to PDF conversion (if needed)

Install on macOS with MacPorts:

sudo port install ghostscript tesseract ImageMagick tiff
sudo port install scantailor
# Install Tesseract language packs as needed, e.g.:
sudo port install tesseract-nor tesseract-isl tesseract-dan tesseract-eng

Workflow

Each project directory contains source scans and an out/ subdirectory for Scan Tailor output. All scripts are run from this root directory.

1. Extract

Extract pages from source PDFs as PNG images at 400 DPI:

./getit "Project Dir"               # process all *.pdf files in the directory
./getit "Project Dir" path/to.pdf   # process specific file(s)

If the source material is JPGs rather than PDFs, convert them to TIFF first:

./convertit "Project Dir"

2. Clean in Scan Tailor

Open the extracted PNGs in Scan Tailor to deskew, crop margins, and split two-page spreads into individual pages. Export the cleaned pages as TIFFs to the project's out/ subdirectory.

3. OCR and Combine

Run from the root once TIFFs are in project-dir/out/:

# Searchable PDF with OCR (recommended)
./makeit-pdf "Project Dir" output.pdf nor   # Norwegian
./makeit-pdf "Project Dir" output.pdf isl   # Icelandic
./makeit-pdf "Project Dir" output.pdf dan   # Danish
./makeit-pdf "Project Dir" output.pdf eng   # English

# Plain text output (Icelandic)
./makeit-txt "Project Dir" output.txt

# PDF without OCR (image-only)
./makeit-no_tesseract "Project Dir" output.pdf

# Combine pre-existing per-page PDFs
./makeit "Project Dir" output.pdf

Pages are combined in natural numeric order, with title pages first.

Other Utilities

# Combine PDFs in a project directory whose filenames match a keyword
./compactit "Project Dir" keyword output.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scans

Dependencies

Workflow

1. Extract

2. Clean in Scan Tailor

3. OCR and Combine

Other Utilities

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
compactit		compactit
convertit		convertit
getit		getit
makeit		makeit
makeit-no_tesseract		makeit-no_tesseract
makeit-pdf		makeit-pdf
makeit-txt		makeit-txt

Folders and files

Latest commit

History

Repository files navigation

Scans

Dependencies

Workflow

1. Extract

2. Clean in Scan Tailor

3. OCR and Combine

Other Utilities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages