Transform healthcare document chaos into organized, actionable insights — automatically.
Version 1.0 🎉
Healthcare clinics and pharmacies receive hundreds of documents every day. Manually sorting, classifying, and filing them is time-consuming, error-prone, and takes staff away from what matters most: patient care.
CDIP automates this entire process. Upload a PDF, and watch as our intelligent pipeline automatically reads, understands, classifies, and matches documents to the right patients — in seconds, not hours.
We've achieved production-ready performance with comprehensive load testing and optimization:
- ✅ Scalable Architecture - Handles 200+ concurrent users with healthy performance
- ✅ Optimized Queue Processing - Eliminated queue overload through multi-worker scaling
- ✅ Excellent Response Times - Average 155ms API response time, P95 under 250ms
- ✅ 99.4% Job Completion Rate - Reliable processing even under high load
- ✅ System Health Score: 80/100 - Production-ready metrics
📊 See our Load Test Performance Analysis for detailed metrics and optimization journey.
Version 2.0 will transform CDIP into a full-featured application with:
- Modern web UI for document management
- Enhanced user experience and workflows
- Advanced features and integrations
- And much more coming soon!
- About the Project
- Tech Stack
- Prerequisites
- Installation
- Configuration
- Running the Application
- Project Structure
- API Endpoints
- Development
- Troubleshooting
CDIP is an intelligent document processing pipeline designed for healthcare environments. It automatically:
- Reads documents using PDF text extraction (pdfjs-dist) and OCR (planned for scanned documents)
- Classifies document types (prescriptions, lab reports, clinical notes) using rule-based keyword matching
- Extracts structured data (patient names, health card numbers, medications, provider info, dates, etc.)
- Matches documents to the correct patient records using health card numbers and fuzzy name/DOB matching
- Stores everything in a searchable, organized format
- Monitors system health and performance with a real-time metrics dashboard
Built with a production-ready architecture featuring:
- Stateless backend API
- Queue-based processing for scalability
- Object storage for files
- Metadata in MongoDB for fast queries
- Real-time metrics dashboard and monitoring
- Complete document processing pipeline (OCR, classification, extraction, matching)
- NestJS - Modern Node.js framework
- TypeScript - Type-safe development
- MongoDB - Document database
- Mongoose - MongoDB ODM
- BullMQ - Job queue system
- Redis - Queue backend and metrics pub/sub
- pdfjs-dist - PDF text extraction
- Tesseract.js - OCR engine (for scanned documents - planned)
- Sharp - Image processing (planned)
- MinIO - S3-compatible object storage
- pnpm workspaces - Monorepo management
- Docker Compose - MinIO container orchestration
- Socket.IO - WebSocket server for real-time metrics
- Real-time Dashboard - Live metrics visualization
- WebSocket Gateway - Real-time updates
- Metrics API - Programmatic access to system metrics
Before you begin, ensure you have the following installed:
- Node.js v18 or higher (Download)
- pnpm package manager (Installation Guide)
- Docker and Docker Compose (Download Docker Desktop) - For MinIO
- MongoDB Atlas Account (Sign up) - Cloud database
- Redis - Installed locally (Installation Guide)
- PM2 (optional, for production worker management) - Install globally:
npm install -g pm2 - Git (for cloning the repository)
node --version # Should be v18+
pnpm --version # Should be 8.0+
docker --version # Should be 20.0+
docker-compose --versiongit clone https://github.com/ankurrokad/document-classifier.git
cd document-classifierpnpm installThis will install all dependencies for all packages in the monorepo.
Start MinIO using Docker Compose:
docker-compose up -dThis starts:
- MinIO S3 API on port
9000 - MinIO Console on port
9001(UI for managing buckets)
Access MinIO Console at http://localhost:9001 (login: minioadmin / minioadmin123)
- Create a MongoDB Atlas account at mongodb.com/cloud/atlas
- Create a new cluster (free tier available)
- Create a database user and get your connection string
- Add your connection string to
.env(see Configuration section below)
Install and start Redis locally:
Windows:
- Download from redis.io/download or use WSL
- Or install via Chocolatey:
choco install redis-64
Mac:
brew install redis
brew services start redisLinux:
sudo apt-get install redis-server
sudo systemctl start redisVerify Redis is running:
redis-cli ping
# Should return: PONGCreate a .env file in the root directory:
cp .env.example .env # If you have an example file
# Or create .env manuallyAdd the following environment variables:
# MongoDB Atlas
MONGO_URI=mongodb+srv://username:password@cluster.mongodb.net/document-classifier?retryWrites=true&w=majority
# Redis (Local)
REDIS_HOST=localhost
REDIS_PORT=6379
# MinIO (Docker)
MINIO_ENDPOINT=localhost
MINIO_PORT=9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin123
MINIO_BUCKET_ORIGINAL=documents-original
MINIO_SSL=false
# Optional: Synthetic Data Generation
# Note: Synthetic data is now stored locally in .data/ folder for load testing
SYNTH_DOC_COUNT=200Notes:
- Replace
MONGO_URIwith your actual MongoDB Atlas connection string - Redis runs locally on default port
6379 - MinIO credentials match the default Docker Compose setup
The application consists of two main processes that need to run simultaneously:
For development and testing, you can run both processes using pnpm:
In your first terminal:
pnpm dev:apiThis will:
- Build the storage library
- Start the NestJS API server
- Run on
http://localhost:3000
You should see:
API running on http://localhost:3000
In a second terminal:
pnpm dev:workerThis will:
- Build the storage library
- Start a single BullMQ worker process
- Connect to Redis and begin processing jobs
You should see:
Worker started...
For better performance and throughput, use PM2 to run multiple worker instances and manage both apps:
Prerequisites:
- Install PM2 globally:
npm install -g pm2 - Build all packages:
pnpm build:all
Start all apps:
pm2 start ecosystem.config.jsThis will:
- Start the API server (1 instance)
- Start 4 worker instances in parallel
- Auto-restart apps on crashes
- Log all output to
logs/directory
Start individual apps:
pm2 start ecosystem.config.js --only doc-api # Start only API server
pm2 start ecosystem.config.js --only doc-worker # Start only workersManage apps:
pm2 status # Check status of all apps
pm2 logs # View logs from all apps
pm2 logs doc-api # View API logs only
pm2 logs doc-worker # View worker logs only
pm2 restart doc-api # Restart API server
pm2 restart doc-worker # Restart all workers
pm2 stop doc-api # Stop API server
pm2 stop doc-worker # Stop all workers
pm2 stop all # Stop all apps
pm2 delete all # Remove all apps from PM2Note: PM2 mode is recommended when processing high volumes of documents or when you need better performance. The development mode is suitable for testing and development. You can also mix approaches - for example, run the API with pnpm dev:api and workers with PM2.
- API: Visit
http://localhost:3000(should respond or show 404 for unknown routes) - Metrics Dashboard: Visit
http://localhost:3000/dashboard(real-time monitoring dashboard) - MinIO Console: Visit
http://localhost:9001(login withminioadmin/minioadmin123) - MongoDB Atlas: Check your Atlas dashboard to verify cluster is running
- Redis: Run
redis-cli ping(should returnPONG)
document-classifier/
├── packages/
│ ├── backend/ # NestJS API server
│ │ ├── src/
│ │ │ ├── modules/
│ │ │ │ ├── documents/ # Document upload & retrieval
│ │ │ │ ├── patients/ # Patient management
│ │ │ │ └── metrics/ # Metrics dashboard & monitoring
│ │ │ └── schemas/ # MongoDB schemas (legacy, using DAL now)
│ │ └── package.json
│ │
│ ├── pipeline/ # Worker processes
│ │ ├── src/
│ │ │ ├── processors/ # Pipeline stage processors
│ │ │ │ ├── ocr.processor.ts
│ │ │ │ ├── classify.processor.ts
│ │ │ │ ├── extract.processor.ts
│ │ │ │ └── match.processor.ts
│ │ │ ├── metrics/ # Metrics reporting
│ │ │ └── worker.ts # BullMQ worker
│ │ └── package.json
│ │
│ ├── libs/
│ │ ├── storage/ # Shared MinIO client
│ │ │ ├── src/
│ │ │ │ └── minio.client.ts
│ │ │ └── package.json
│ │ └── dal/ # Data Access Layer
│ │ ├── src/
│ │ │ ├── schemas/ # MongoDB schemas
│ │ │ ├── models/ # Data models
│ │ │ └── connection.ts
│ │ └── package.json
│ │
│ └── synth-data/ # Synthetic data generator
│ ├── src/
│ │ ├── templates/ # Document templates
│ │ └── generate.ts
│ └── package.json
│
├── docker-compose.yaml # Infrastructure services
├── package.json # Root package.json (workspace config)
├── pnpm-workspace.yaml # Workspace configuration
└── .env # Environment variables (create this)
Upload a PDF document for processing.
POST /documents
Content-Type: multipart/form-data
# Using curl
curl -X POST http://localhost:3000/documents \
-F "file=@path/to/document.pdf"
# Response
{
"documentId": "507f1f77bcf86cd799439011"
}Retrieve a single document by ID with full details.
GET /documents/:id
# Using curl
curl http://localhost:3000/documents/507f1f77bcf86cd799439011
# Response
{
"_id": "507f1f77bcf86cd799439011",
"status": "completed",
"originalObjectKey": "documents/original/507f1f77bcf86cd799439011.pdf",
"classification": {
"label": "prescription",
"confidence": 85
},
"extractedData": {
"patientName": "John Doe",
"healthCard": "1234567890",
"dob": "1990-01-01T00:00:00.000Z",
"medications": [...]
},
"matchedPatientId": {...},
"processingLogs": [...]
}List all documents with pagination and filtering.
GET /documents?limit=50&skip=0&status=completed&type=prescription
# Query Parameters:
# - limit: Number of results (default: 50)
# - skip: Number of results to skip (default: 0)
# - status: Filter by status (uploaded, processing, completed, failed)
# - type: Filter by document type (prescription, lab_report, clinic_note)
# Response
{
"documents": [...],
"total": 150,
"limit": 50,
"skip": 0
}List all patients with pagination.
GET /patients?limit=50&skip=0
# Response
{
"patients": [...],
"total": 75,
"limit": 50,
"skip": 0
}Retrieve a single patient by ID with associated documents.
GET /patients/:id
# Response
{
"_id": "507f1f77bcf86cd799439012",
"firstName": "John",
"lastName": "Doe",
"fullName": "John Doe",
"dob": "1990-01-01T00:00:00.000Z",
"healthCard": "1234567890",
"documents": [...]
}Create a new patient record.
POST /patients
Content-Type: application/json
# Request Body
{
"firstName": "John",
"lastName": "Doe",
"dob": "1990-01-01",
"healthCard": "1234567890"
}
# Response
{
"_id": "507f1f77bcf86cd799439012",
"firstName": "John",
"lastName": "Doe",
...
}Access the real-time metrics dashboard in your browser.
GET /dashboardVisit http://localhost:3000/dashboard to see:
- Queue metrics (waiting, active, completed jobs)
- Processing metrics (avg time, percentiles, job counts)
- System metrics (memory, CPU, event loop lag)
- API metrics (requests per minute, response times)
- Active worker count
The dashboard uses WebSocket for real-time updates (updates every 2 seconds by default).
Get metrics data programmatically.
GET /api/metrics
# Response
{
"timestamp": "2024-01-15T10:30:00.000Z",
"queue": {
"waiting": 5,
"active": 2,
"completed": 150,
"failed": 3
},
"processing": {
"avgTime": 8500,
"p50": 8000,
"p95": 12000,
"p99": 15000,
"totalProcessed": 150
},
"system": {
"memory": {...},
"cpu": {...},
"eventLoopLag": 2.5
},
"api": {
"requestsPerMinute": 12.5,
"avgResponseTime": 45,
"totalRequests": 500
},
"workers": {
"active": 2
}
}# Development
pnpm dev:api # Start API server in watch mode
pnpm dev:worker # Start worker in watch mode
# Build
pnpm build:dal # Build DAL library
pnpm build:storage # Build storage library
pnpm build:synth-data # Build synthetic data generator
pnpm build:backend # Build backend (includes dependencies)
pnpm build:pipeline # Build pipeline (includes dependencies)
pnpm build:all # Build all packages
# Data Generation
pnpm gen:data # Generate and upload synthetic documents
# Code Quality
pnpm lint # Run ESLint
pnpm format # Format code with PrettierYou can also run scripts in specific packages:
# From root
pnpm --filter backend start:dev
pnpm --filter pipeline start:dev
pnpm --filter @doc-clf/synth-data gen# Build all packages
pnpm build:backend
pnpm build:pipeline
# Run production builds
node packages/backend/dist/main.js
node packages/pipeline/dist/worker.jsTo generate test documents for load testing:
# Generate patients first (if not already done)
pnpm gen:patients 100
# Generate synthetic documents (saves to .data/documents/original/)
pnpm gen:documentsThis will:
- Generate PDFs (prescriptions, lab reports, clinic notes)
- Save them locally to
.data/documents/original/folder at project root - Include JSON metadata files
- Note: Files are stored locally (not in MinIO) for optimal load testing performance
The load test reads from this local .data directory, eliminating MinIO dependency during load tests.
| Variable | Description | Example |
|---|---|---|
MONGO_URI |
MongoDB Atlas connection string | mongodb+srv://user:pass@cluster.mongodb.net/db |
REDIS_HOST |
Redis hostname (local) | localhost |
REDIS_PORT |
Redis port (local) | 6379 |
MINIO_ENDPOINT |
MinIO server hostname (Docker) | localhost |
MINIO_PORT |
MinIO server port (Docker) | 9000 |
MINIO_ACCESS_KEY |
MinIO access key | minioadmin |
MINIO_SECRET_KEY |
MinIO secret key | minioadmin123 |
MINIO_BUCKET_ORIGINAL |
Bucket for original documents | documents-original |
MINIO_SSL |
Use SSL for MinIO | false |
| Variable | Description | Default |
|---|---|---|
SYNTH_DOC_COUNT |
Number of synthetic documents to generate | 200 |
SYNTH_DOC_COUNT |
Number of synthetic documents to generate | 200 |
METRICS_HISTORY_SIZE |
Maximum number of metrics records to keep in memory | 1000 |
METRICS_UPDATE_INTERVAL |
Dashboard WebSocket update interval (ms) | 2000 |
The storage library automatically creates buckets if they don't exist. However, you can also manage them via the MinIO Console:
- Visit
http://localhost:9001 - Login with
minioadmin/minioadmin123 - Create buckets manually if needed:
documents-original(for uploaded PDFs)documents-processed(for processed artifacts - future)
- Log in to MongoDB Atlas
- Create a new cluster (free M0 tier is sufficient for development)
- Create a database user:
- Go to Database Access → Add New Database User
- Choose password authentication
- Save the username and password
- Whitelist your IP:
- Go to Network Access → Add IP Address
- Add
0.0.0.0/0for development (or your specific IP)
- Get your connection string:
- Go to Clusters → Connect → Connect your application
- Copy the connection string
- Replace
<password>with your database user password - Add your database name:
?retryWrites=true&w=majority→document-classifier?retryWrites=true&w=majority
- Add the connection string to your
.envfile asMONGO_URI
If you see port conflicts:
# Check what's using the port
# Windows
netstat -ano | findstr :3000
# Mac/Linux
lsof -i :3000
# Stop the conflicting process or change ports in .env# Verify your connection string in .env
# Format: mongodb+srv://username:password@cluster.mongodb.net/database
# Check if your IP is whitelisted in Atlas
# Go to Network Access in Atlas dashboard
# Test connection string
mongosh "your-connection-string-here"# Check if Redis is running locally
redis-cli ping
# Should return: PONG
# If not running, start Redis:
# Mac: brew services start redis
# Linux: sudo systemctl start redis
# Windows: Start Redis service or use WSL# Check if MinIO is running
docker ps | grep minio
# Verify MinIO is accessible
curl http://localhost:9000/minio/health/liveThe application will exit with a clear error message if required environment variables are missing. Check your .env file is in the root directory and contains all required variables.
If you see build errors:
# Clean and rebuild
rm -rf node_modules packages/*/node_modules packages/*/dist
pnpm install
pnpm build:storage- Verify Redis is running and accessible
- Check worker logs for errors
- Ensure the queue name matches (
doc:process) - Verify MongoDB connection in worker
- Check the Design Document for architecture details
- Review package-specific README files (if available)
- Check Docker logs:
docker-compose logs - Verify all services are running:
docker-compose ps
- Complete Pipeline: OCR, classification, extraction, and matching processors
- API Endpoints: Document upload, retrieval, listing, and patient management
- Metrics Dashboard: Real-time monitoring with WebSocket updates
- Data Access Layer: Centralized MongoDB schemas and models
- Queue Processing: Full BullMQ integration with Redis
- Performance Optimization: Multi-worker scaling with PM2 cluster mode
- Load Testing & Validation: Comprehensive performance testing with 200+ concurrent users
- Production-Ready Scalability: System health score of 80/100, handles high-volume loads
- Queue Optimization: Reduced queue depth from 265 to manageable levels (96% improvement)
- Worker Scaling: Implemented 4-worker parallel processing architecture
- CPU Efficiency: Improved from 99.4% to 86.1% CPU usage (better resource distribution)
- Job Completion: Achieved 99.4% completion rate (497/500 jobs) under load
- Response Times: Maintained excellent API performance (155ms average, 238ms P95)
- System Health: Improved from 60/100 (DEGRADED) to 80/100 (HEALTHY)
📊 Detailed performance analysis: See Load Test Analysis
- ML classification model training
- Processed artifacts storage
- Enhanced error handling and retry mechanisms
- Full-Featured Application: Complete web UI and user experience
- Tesseract OCR for scanned documents
- Image preprocessing (deskew, grayscale)
- Advanced ML models
- Batch processing optimizations
- Unit and integration tests
- Enhanced workflows and integrations
This project is private and for educational/demonstration purposes.
Ready to get started? Follow the Installation steps above! 🚀