This Python project automates the extraction of key data from invoice images using Tesseract OCR and OpenCV. It is designed to handle batch processing of multiple invoices stored in a folder, extracting essential fields like Invoice Number, Date, and Total Amount, validating completeness, and saving all results into a structured Excel file. The script supports English and Arabic invoices, performs basic image preprocessing, and provides a summary of missing fields for incomplete invoices. It is ideal for businesses or developers who need a quick and reliable way to digitize invoices for record-keeping, reporting, or integration with other systems.
- Batch process multiple invoice images from a folder
- Extract essential invoice data:
- Invoice Number
- Date
- Total Amount
- Validate required fields and flag incomplete invoices
- Support for English and Arabic invoices
- Image preprocessing to improve OCR accuracy (grayscale, Gaussian blur, thresholding)
- Append results to existing Excel file or create a new one
- Track processing timestamp and source file name
- Python 3.8+
- Libraries:
pip install opencv-python pytesseract pandas openpyxl- Tesseract OCR installed and path configured:
pytesseract.pytesseract.tesseract_cmd = r"C:\tesseract\tesseract.exe"
- Place all invoice images (PNG, JPG, JPEG, TIFF) in a folder, e.g., images/.
- Set the
INPUT_FOLDERandOUTPUT_EXCELpaths in the script. - After processing, check the
data/invoices.xlsxfile for extracted data.
| Invoice Number | Date | Total Amount | File Name | Processed At | Status | Missing Fields |
|---|---|---|---|---|---|---|
| INV-001 | 2026-01-31 | 450.00 | invoice1.png | 2026-02-03 13:00:00 | Completed | None |
| INV-002 | 780.00 | invoice2.png | 2026-02-03 13:01:00 | Incomplete | Date |
- The script loads each image and converts it to grayscale.
- Applies Gaussian blur and thresholding to enhance text for OCR.
- Uses Tesseract OCR to extract text from the image.
- Validates if the document contains the word invoice (English or Arabic).
- Extracts
Invoice Number,Date, andTotal Amountusing regex. - Marks missing fields and sets invoice status as
CompletedorIncomplete. - Saves all results to an Excel spreadsheet, appending new invoices to existing records.