Welcome to AI-Powered OCR Solution

Upload, Scan, and Extract Text from Documents Instantly

Start Scanning

About This Project

This project — "Intelligently Extract Text & Data from Documents using OCR & NER" — was developed by Subrat Gupta as a practical implementation of modern Computer Vision and NLP techniques. The goal is to accurately extract key entities such as names, phone numbers, emails, and organizations from scanned documents, with a special focus on business cards. While this demonstration uses business cards for data privacy reasons, the framework is easily adaptable to documents like invoices, shipping bills, or other financial records.

To build this project, I integrated two core technologies in Data Science:

Computer Vision
Natural Language Processing (NLP)

In the Computer Vision module, the system scans the uploaded document, enhances its quality, and detects the position of textual elements. Then, using NLP techniques, it processes the extracted text, cleans it, and uses a custom-trained NER model to identify structured data.

Python Libraries used in Computer Vision Module:

OpenCV – for image preprocessing and contour detection
NumPy – for numerical operations
Pytesseract – for Optical Character Recognition (OCR)

Python Libraries used in Natural Language Processing:

SpaCy – for training and using the custom NER model
Pandas – for data manipulation and structuring
Regular Expressions – for pattern-based text cleaning
String – for handling character-level processing

The entire project is divided into multiple development stages for better understanding and modular implementation:

Stage 1: Project Setup

Install Python and necessary libraries
Organize the file structure

Stage 2: Data Preparation

Collect document images (e.g., business cards)
Extract raw text using Pytesseract
Clean and prepare text for annotation

Stage 3: Manual Data Labeling (BIO Tagging)

Label entities using the BIO format:
- B – Beginning of an entity
- I – Inside of an entity
- O – Outside or not part of any entity

Stage 4: Data Preprocessing

Convert the labeled text into SpaCy-compatible training format
Split data into training and testing sets

Stage 5: Model Training

Configure and train the custom NER model using SpaCy
Validate model accuracy on test samples

Stage 6: Deployment and Prediction

Load and serve the trained model using Flask
Render results with Displacy visualization
Draw bounding boxes around predicted entities
Build a complete document scanner web app

All of these stages come together in a fully functional web application built using Flask, where users can upload document images and get real-time predictions through the browser interface.

Developed By: Subrat Gupta
B.Tech CSE, GITAM University