Skip to content

End-to-end OCR and data-extraction system that processes healthcare PDFs using a modern, scalable backend architecture.

Notifications You must be signed in to change notification settings

ankurrokad/document-classifier

Repository files navigation

🏥 Clinical Document Intelligence Pipeline (CDIP)

Transform healthcare document chaos into organized, actionable insights — automatically.

Version 1.0 🎉

Healthcare clinics and pharmacies receive hundreds of documents every day. Manually sorting, classifying, and filing them is time-consuming, error-prone, and takes staff away from what matters most: patient care.

CDIP automates this entire process. Upload a PDF, and watch as our intelligent pipeline automatically reads, understands, classifies, and matches documents to the right patients — in seconds, not hours.

🚀 Version 1.0 Highlights

We've achieved production-ready performance with comprehensive load testing and optimization:

  • Scalable Architecture - Handles 200+ concurrent users with healthy performance
  • Optimized Queue Processing - Eliminated queue overload through multi-worker scaling
  • Excellent Response Times - Average 155ms API response time, P95 under 250ms
  • 99.4% Job Completion Rate - Reliable processing even under high load
  • System Health Score: 80/100 - Production-ready metrics

📊 See our Load Test Performance Analysis for detailed metrics and optimization journey.

What's Next: Version 2.0

Version 2.0 will transform CDIP into a full-featured application with:

  • Modern web UI for document management
  • Enhanced user experience and workflows
  • Advanced features and integrations
  • And much more coming soon!

📋 Table of Contents


About the Project

CDIP is an intelligent document processing pipeline designed for healthcare environments. It automatically:

  • Reads documents using PDF text extraction (pdfjs-dist) and OCR (planned for scanned documents)
  • Classifies document types (prescriptions, lab reports, clinical notes) using rule-based keyword matching
  • Extracts structured data (patient names, health card numbers, medications, provider info, dates, etc.)
  • Matches documents to the correct patient records using health card numbers and fuzzy name/DOB matching
  • Stores everything in a searchable, organized format
  • Monitors system health and performance with a real-time metrics dashboard

Built with a production-ready architecture featuring:

  • Stateless backend API
  • Queue-based processing for scalability
  • Object storage for files
  • Metadata in MongoDB for fast queries
  • Real-time metrics dashboard and monitoring
  • Complete document processing pipeline (OCR, classification, extraction, matching)

Tech Stack

Backend & API

  • NestJS - Modern Node.js framework
  • TypeScript - Type-safe development
  • MongoDB - Document database
  • Mongoose - MongoDB ODM

Processing & Queue

  • BullMQ - Job queue system
  • Redis - Queue backend and metrics pub/sub
  • pdfjs-dist - PDF text extraction
  • Tesseract.js - OCR engine (for scanned documents - planned)
  • Sharp - Image processing (planned)

Storage

  • MinIO - S3-compatible object storage

Architecture

  • pnpm workspaces - Monorepo management
  • Docker Compose - MinIO container orchestration
  • Socket.IO - WebSocket server for real-time metrics

Monitoring & Metrics

  • Real-time Dashboard - Live metrics visualization
  • WebSocket Gateway - Real-time updates
  • Metrics API - Programmatic access to system metrics

Prerequisites

Before you begin, ensure you have the following installed:

Verify Installation

node --version    # Should be v18+
pnpm --version    # Should be 8.0+
docker --version  # Should be 20.0+
docker-compose --version

Installation

1. Clone the Repository

git clone https://github.com/ankurrokad/document-classifier.git
cd document-classifier

2. Install Dependencies

pnpm install

This will install all dependencies for all packages in the monorepo.

3. Start Infrastructure Services

Start MinIO (Docker)

Start MinIO using Docker Compose:

docker-compose up -d

This starts:

  • MinIO S3 API on port 9000
  • MinIO Console on port 9001 (UI for managing buckets)

Access MinIO Console at http://localhost:9001 (login: minioadmin / minioadmin123)

Setup MongoDB Atlas

  1. Create a MongoDB Atlas account at mongodb.com/cloud/atlas
  2. Create a new cluster (free tier available)
  3. Create a database user and get your connection string
  4. Add your connection string to .env (see Configuration section below)

Setup Local Redis

Install and start Redis locally:

Windows:

  • Download from redis.io/download or use WSL
  • Or install via Chocolatey: choco install redis-64

Mac:

brew install redis
brew services start redis

Linux:

sudo apt-get install redis-server
sudo systemctl start redis

Verify Redis is running:

redis-cli ping
# Should return: PONG

4. Configure Environment Variables

Create a .env file in the root directory:

cp .env.example .env  # If you have an example file
# Or create .env manually

Add the following environment variables:

# MongoDB Atlas
MONGO_URI=mongodb+srv://username:password@cluster.mongodb.net/document-classifier?retryWrites=true&w=majority

# Redis (Local)
REDIS_HOST=localhost
REDIS_PORT=6379

# MinIO (Docker)
MINIO_ENDPOINT=localhost
MINIO_PORT=9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin123
MINIO_BUCKET_ORIGINAL=documents-original
MINIO_SSL=false

# Optional: Synthetic Data Generation
# Note: Synthetic data is now stored locally in .data/ folder for load testing
SYNTH_DOC_COUNT=200

Notes:

  • Replace MONGO_URI with your actual MongoDB Atlas connection string
  • Redis runs locally on default port 6379
  • MinIO credentials match the default Docker Compose setup

Running the Application

The application consists of two main processes that need to run simultaneously:

Option 1: Development Mode (Using pnpm)

For development and testing, you can run both processes using pnpm:

1. Start the API Server

In your first terminal:

pnpm dev:api

This will:

  • Build the storage library
  • Start the NestJS API server
  • Run on http://localhost:3000

You should see:

API running on http://localhost:3000

2. Start the Worker

In a second terminal:

pnpm dev:worker

This will:

  • Build the storage library
  • Start a single BullMQ worker process
  • Connect to Redis and begin processing jobs

You should see:

Worker started...

Option 2: Production Mode with PM2 (Recommended for Performance)

For better performance and throughput, use PM2 to run multiple worker instances and manage both apps:

Prerequisites:

  • Install PM2 globally: npm install -g pm2
  • Build all packages: pnpm build:all

Start all apps:

pm2 start ecosystem.config.js

This will:

  • Start the API server (1 instance)
  • Start 4 worker instances in parallel
  • Auto-restart apps on crashes
  • Log all output to logs/ directory

Start individual apps:

pm2 start ecosystem.config.js --only doc-api      # Start only API server
pm2 start ecosystem.config.js --only doc-worker    # Start only workers

Manage apps:

pm2 status                    # Check status of all apps
pm2 logs                      # View logs from all apps
pm2 logs doc-api              # View API logs only
pm2 logs doc-worker           # View worker logs only
pm2 restart doc-api           # Restart API server
pm2 restart doc-worker        # Restart all workers
pm2 stop doc-api              # Stop API server
pm2 stop doc-worker           # Stop all workers
pm2 stop all                  # Stop all apps
pm2 delete all                # Remove all apps from PM2

Note: PM2 mode is recommended when processing high volumes of documents or when you need better performance. The development mode is suitable for testing and development. You can also mix approaches - for example, run the API with pnpm dev:api and workers with PM2.

3. Verify Everything is Running

  • API: Visit http://localhost:3000 (should respond or show 404 for unknown routes)
  • Metrics Dashboard: Visit http://localhost:3000/dashboard (real-time monitoring dashboard)
  • MinIO Console: Visit http://localhost:9001 (login with minioadmin / minioadmin123)
  • MongoDB Atlas: Check your Atlas dashboard to verify cluster is running
  • Redis: Run redis-cli ping (should return PONG)

Project Structure

document-classifier/
├── packages/
│   ├── backend/              # NestJS API server
│   │   ├── src/
│   │   │   ├── modules/
│   │   │   │   ├── documents/    # Document upload & retrieval
│   │   │   │   ├── patients/     # Patient management
│   │   │   │   └── metrics/      # Metrics dashboard & monitoring
│   │   │   └── schemas/           # MongoDB schemas (legacy, using DAL now)
│   │   └── package.json
│   │
│   ├── pipeline/            # Worker processes
│   │   ├── src/
│   │   │   ├── processors/        # Pipeline stage processors
│   │   │   │   ├── ocr.processor.ts
│   │   │   │   ├── classify.processor.ts
│   │   │   │   ├── extract.processor.ts
│   │   │   │   └── match.processor.ts
│   │   │   ├── metrics/           # Metrics reporting
│   │   │   └── worker.ts          # BullMQ worker
│   │   └── package.json
│   │
│   ├── libs/
│   │   ├── storage/         # Shared MinIO client
│   │   │   ├── src/
│   │   │   │   └── minio.client.ts
│   │   │   └── package.json
│   │   └── dal/             # Data Access Layer
│   │       ├── src/
│   │       │   ├── schemas/       # MongoDB schemas
│   │       │   ├── models/        # Data models
│   │       │   └── connection.ts
│   │       └── package.json
│   │
│   └── synth-data/          # Synthetic data generator
│       ├── src/
│       │   ├── templates/         # Document templates
│       │   └── generate.ts
│       └── package.json
│
├── docker-compose.yaml      # Infrastructure services
├── package.json            # Root package.json (workspace config)
├── pnpm-workspace.yaml     # Workspace configuration
└── .env                    # Environment variables (create this)

API Endpoints

Document Endpoints

Upload Document

Upload a PDF document for processing.

POST /documents
Content-Type: multipart/form-data

# Using curl
curl -X POST http://localhost:3000/documents \
  -F "file=@path/to/document.pdf"

# Response
{
  "documentId": "507f1f77bcf86cd799439011"
}

Get Document

Retrieve a single document by ID with full details.

GET /documents/:id

# Using curl
curl http://localhost:3000/documents/507f1f77bcf86cd799439011

# Response
{
  "_id": "507f1f77bcf86cd799439011",
  "status": "completed",
  "originalObjectKey": "documents/original/507f1f77bcf86cd799439011.pdf",
  "classification": {
    "label": "prescription",
    "confidence": 85
  },
  "extractedData": {
    "patientName": "John Doe",
    "healthCard": "1234567890",
    "dob": "1990-01-01T00:00:00.000Z",
    "medications": [...]
  },
  "matchedPatientId": {...},
  "processingLogs": [...]
}

List Documents

List all documents with pagination and filtering.

GET /documents?limit=50&skip=0&status=completed&type=prescription

# Query Parameters:
# - limit: Number of results (default: 50)
# - skip: Number of results to skip (default: 0)
# - status: Filter by status (uploaded, processing, completed, failed)
# - type: Filter by document type (prescription, lab_report, clinic_note)

# Response
{
  "documents": [...],
  "total": 150,
  "limit": 50,
  "skip": 0
}

Patient Endpoints

List Patients

List all patients with pagination.

GET /patients?limit=50&skip=0

# Response
{
  "patients": [...],
  "total": 75,
  "limit": 50,
  "skip": 0
}

Get Patient

Retrieve a single patient by ID with associated documents.

GET /patients/:id

# Response
{
  "_id": "507f1f77bcf86cd799439012",
  "firstName": "John",
  "lastName": "Doe",
  "fullName": "John Doe",
  "dob": "1990-01-01T00:00:00.000Z",
  "healthCard": "1234567890",
  "documents": [...]
}

Create Patient

Create a new patient record.

POST /patients
Content-Type: application/json

# Request Body
{
  "firstName": "John",
  "lastName": "Doe",
  "dob": "1990-01-01",
  "healthCard": "1234567890"
}

# Response
{
  "_id": "507f1f77bcf86cd799439012",
  "firstName": "John",
  "lastName": "Doe",
  ...
}

Metrics Endpoints

Metrics Dashboard

Access the real-time metrics dashboard in your browser.

GET /dashboard

Visit http://localhost:3000/dashboard to see:

  • Queue metrics (waiting, active, completed jobs)
  • Processing metrics (avg time, percentiles, job counts)
  • System metrics (memory, CPU, event loop lag)
  • API metrics (requests per minute, response times)
  • Active worker count

The dashboard uses WebSocket for real-time updates (updates every 2 seconds by default).

Metrics API

Get metrics data programmatically.

GET /api/metrics

# Response
{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "queue": {
    "waiting": 5,
    "active": 2,
    "completed": 150,
    "failed": 3
  },
  "processing": {
    "avgTime": 8500,
    "p50": 8000,
    "p95": 12000,
    "p99": 15000,
    "totalProcessed": 150
  },
  "system": {
    "memory": {...},
    "cpu": {...},
    "eventLoopLag": 2.5
  },
  "api": {
    "requestsPerMinute": 12.5,
    "avgResponseTime": 45,
    "totalRequests": 500
  },
  "workers": {
    "active": 2
  }
}

Development

Available Scripts

Root Level

# Development
pnpm dev:api          # Start API server in watch mode
pnpm dev:worker       # Start worker in watch mode

# Build
pnpm build:dal        # Build DAL library
pnpm build:storage    # Build storage library
pnpm build:synth-data # Build synthetic data generator
pnpm build:backend    # Build backend (includes dependencies)
pnpm build:pipeline   # Build pipeline (includes dependencies)
pnpm build:all        # Build all packages

# Data Generation
pnpm gen:data         # Generate and upload synthetic documents

# Code Quality
pnpm lint             # Run ESLint
pnpm format           # Format code with Prettier

Package Level

You can also run scripts in specific packages:

# From root
pnpm --filter backend start:dev
pnpm --filter pipeline start:dev
pnpm --filter @doc-clf/synth-data gen

Building for Production

# Build all packages
pnpm build:backend
pnpm build:pipeline

# Run production builds
node packages/backend/dist/main.js
node packages/pipeline/dist/worker.js

Generating Synthetic Data

To generate test documents for load testing:

# Generate patients first (if not already done)
pnpm gen:patients 100

# Generate synthetic documents (saves to .data/documents/original/)
pnpm gen:documents

This will:

  • Generate PDFs (prescriptions, lab reports, clinic notes)
  • Save them locally to .data/documents/original/ folder at project root
  • Include JSON metadata files
  • Note: Files are stored locally (not in MinIO) for optimal load testing performance

The load test reads from this local .data directory, eliminating MinIO dependency during load tests.


Configuration

Environment Variables

Required Variables

Variable Description Example
MONGO_URI MongoDB Atlas connection string mongodb+srv://user:pass@cluster.mongodb.net/db
REDIS_HOST Redis hostname (local) localhost
REDIS_PORT Redis port (local) 6379
MINIO_ENDPOINT MinIO server hostname (Docker) localhost
MINIO_PORT MinIO server port (Docker) 9000
MINIO_ACCESS_KEY MinIO access key minioadmin
MINIO_SECRET_KEY MinIO secret key minioadmin123
MINIO_BUCKET_ORIGINAL Bucket for original documents documents-original
MINIO_SSL Use SSL for MinIO false

Optional Variables

Variable Description Default
SYNTH_DOC_COUNT Number of synthetic documents to generate 200
SYNTH_DOC_COUNT Number of synthetic documents to generate 200
METRICS_HISTORY_SIZE Maximum number of metrics records to keep in memory 1000
METRICS_UPDATE_INTERVAL Dashboard WebSocket update interval (ms) 2000

MinIO Bucket Setup

The storage library automatically creates buckets if they don't exist. However, you can also manage them via the MinIO Console:

  1. Visit http://localhost:9001
  2. Login with minioadmin / minioadmin123
  3. Create buckets manually if needed:
    • documents-original (for uploaded PDFs)
    • documents-processed (for processed artifacts - future)

MongoDB Atlas Setup

  1. Log in to MongoDB Atlas
  2. Create a new cluster (free M0 tier is sufficient for development)
  3. Create a database user:
    • Go to Database Access → Add New Database User
    • Choose password authentication
    • Save the username and password
  4. Whitelist your IP:
    • Go to Network Access → Add IP Address
    • Add 0.0.0.0/0 for development (or your specific IP)
  5. Get your connection string:
    • Go to Clusters → Connect → Connect your application
    • Copy the connection string
    • Replace <password> with your database user password
    • Add your database name: ?retryWrites=true&w=majoritydocument-classifier?retryWrites=true&w=majority
  6. Add the connection string to your .env file as MONGO_URI

Troubleshooting

Common Issues

1. Port Already in Use

If you see port conflicts:

# Check what's using the port
# Windows
netstat -ano | findstr :3000
# Mac/Linux
lsof -i :3000

# Stop the conflicting process or change ports in .env

2. MongoDB Atlas Connection Failed

# Verify your connection string in .env
# Format: mongodb+srv://username:password@cluster.mongodb.net/database

# Check if your IP is whitelisted in Atlas
# Go to Network Access in Atlas dashboard

# Test connection string
mongosh "your-connection-string-here"

3. Redis Connection Failed

# Check if Redis is running locally
redis-cli ping
# Should return: PONG

# If not running, start Redis:
# Mac: brew services start redis
# Linux: sudo systemctl start redis
# Windows: Start Redis service or use WSL

4. MinIO Connection Failed

# Check if MinIO is running
docker ps | grep minio

# Verify MinIO is accessible
curl http://localhost:9000/minio/health/live

5. Missing Environment Variables

The application will exit with a clear error message if required environment variables are missing. Check your .env file is in the root directory and contains all required variables.

6. Build Errors

If you see build errors:

# Clean and rebuild
rm -rf node_modules packages/*/node_modules packages/*/dist
pnpm install
pnpm build:storage

7. Worker Not Processing Jobs

  • Verify Redis is running and accessible
  • Check worker logs for errors
  • Ensure the queue name matches (doc:process)
  • Verify MongoDB connection in worker

Getting Help

  • Check the Design Document for architecture details
  • Review package-specific README files (if available)
  • Check Docker logs: docker-compose logs
  • Verify all services are running: docker-compose ps

Current Status - Version 1.0

✅ Completed Features

  • Complete Pipeline: OCR, classification, extraction, and matching processors
  • API Endpoints: Document upload, retrieval, listing, and patient management
  • Metrics Dashboard: Real-time monitoring with WebSocket updates
  • Data Access Layer: Centralized MongoDB schemas and models
  • Queue Processing: Full BullMQ integration with Redis
  • Performance Optimization: Multi-worker scaling with PM2 cluster mode
  • Load Testing & Validation: Comprehensive performance testing with 200+ concurrent users
  • Production-Ready Scalability: System health score of 80/100, handles high-volume loads

🎯 Performance Achievements (v1.0)

  • Queue Optimization: Reduced queue depth from 265 to manageable levels (96% improvement)
  • Worker Scaling: Implemented 4-worker parallel processing architecture
  • CPU Efficiency: Improved from 99.4% to 86.1% CPU usage (better resource distribution)
  • Job Completion: Achieved 99.4% completion rate (497/500 jobs) under load
  • Response Times: Maintained excellent API performance (155ms average, 238ms P95)
  • System Health: Improved from 60/100 (DEGRADED) to 80/100 (HEALTHY)

📊 Detailed performance analysis: See Load Test Analysis

🚧 In Progress

  • ML classification model training
  • Processed artifacts storage
  • Enhanced error handling and retry mechanisms

📋 Planned for Version 2.0

  • Full-Featured Application: Complete web UI and user experience
  • Tesseract OCR for scanned documents
  • Image preprocessing (deskew, grayscale)
  • Advanced ML models
  • Batch processing optimizations
  • Unit and integration tests
  • Enhanced workflows and integrations

License

This project is private and for educational/demonstration purposes.


Ready to get started? Follow the Installation steps above! 🚀