An AI-powered semantic document search platform for learning, research, and real-world experimentation
This project is a hands-on learning tool and experimentation platform that demonstrates the complete journey from Data Ingestion to Query Filtering using modern vector search technology. It's designed for:
- Students & College Grads: Learn how semantic search works in practice, not just theory
- Researchers: Test and experiment with your own documents and models
- Developers: Understand the fundamentals of vector databases and embeddings
- Data Scientists: Perform A/B testing on different models and document processing techniques
Think of it as your personal laboratory for understanding how modern search engines actually understand what you're looking for, not just match keywords.
- 🎨 Google-Style Interface: Clean, professional search experience with centered search box
- ⚙️ Fully Configurable: Customize models, chunking, search parameters via UI
- 📁 Document Upload: Drag & drop PDF, Word, and text files
- 🔍 Smart Search: Type-ahead search that understands context (3+ characters, 500ms debounce)
- 🧠 AI-Powered: Supports any HuggingFace sentence-transformer model (default: DistilBERT)
- ⚡ Fast Search: FAISS vector database for lightning-quick similarity search
- 📊 Enhanced Results: View document name, page number, chunk number, and source
- 🔄 Sort & Filter: Sort results by relevance or recent uploads
- 📱 Fully Responsive: Works perfectly on desktop, tablet, and mobile
- 💾 Smart Storage: Timestamp-based file organization (YYYYMMDD/filename_HHMMSS.ext)
- 🔒 Privacy First: Everything runs locally on your machine
- 📈 Version Tracking: Built-in version display and changelog
- See the Process: Watch how documents get chunked, embedded, and indexed
- Experiment Freely: Upload your own documents and test different queries
- Model Comparison: Swap between different embedding models with a few clicks
- Real-World Testing: Compare search results with different document types
- A/B Testing Ready: Perfect for testing different chunking strategies or models
- Transparent Architecture: Clean, readable code that's easy to understand and modify
- Configuration Playground: Tune parameters and see immediate effects on search quality
The application follows a simple, modular architecture perfect for learning and experimentation:
- Document Upload: User uploads documents (PDF, DOCX, TXT)
- Text Extraction: System extracts plain text from documents (with page tracking for PDFs)
- Configurable Chunking: Text is split into chunks (default: 500 words with 50-word overlap)
- Embedding Generation: Each chunk is converted to vectors using configured model (default: DistilBERT 768-dim)
- FAISS Indexing: Vectors are stored in FAISS index for fast similarity search
- Search Processing: User queries are vectorized and matched against indexed chunks
- Smart Results: Top results returned with metadata (page, chunk, relevance score)
- Flexible Sorting: Results can be sorted by relevance or recency
The system is fully configurable through the UI:
- Model Selection: Any HuggingFace sentence-transformer model
- Chunk Size: 100-2000 words (default: 500)
- Overlap: 0-500 words (default: 50)
- Result Count: 1-20 results (default: 5)
- Top K: 5-50 candidates (default: 10)
- Dimension: Match your model's output (default: 768)
context-search-engine/
│
├── app.py # Flask web application & API endpoints
├── document_processor.py # Document processing, embedding & indexing
├── config.py # Configuration management system
├── search_engine.py # Legacy (can be removed)
├── create_index.py # Legacy (can be removed)
│
├── templates/
│ └── index.html # Main UI (Google-style search interface)
│
├── static/
│ ├── css/
│ │ └── style.css # Modern minimalist styling
│ └── images/ # UI assets & screenshots
│
├── uploads/ # Document storage (auto-organized)
│ └── YYYYMMDD/ # Date-based folders
│ └── filename_HHMMSS.ext # Timestamped files
│
├── faiss_index.idx # FAISS vector index (generated)
├── index_to_chunk.pkl # Chunk-to-text mapping (generated)
├── document_metadata.pkl # Document metadata (generated)
├── app_config.json # User configuration (generated)
│
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── CHANGELOG.md # Version history & release notes
├── LICENSE # MIT License
└── README.md # This file
- Python: 3.8 or higher
- pip: Python package manager
- RAM: 4GB minimum (8GB recommended for larger documents)
- Internet: First run only (downloads embedding model ~250MB)
- Disk Space: ~500MB for models + your documents
-
Clone the repository
git clone https://github.com/inboxpraveen/context-search-engine.git cd context-search-engine -
Install dependencies
pip install -r requirements.txt
This installs:
Flask 3.1.2- Web frameworktransformers 4.57.3- HuggingFace transformers for embeddingstorch- PyTorch for model inferencefaiss-cpu 1.13.0- Vector similarity searchPyPDF2 3.0.1- PDF text extractionpython-docx 1.2.0- Word document processingnumpy 2.3.5- Numerical operations- Other supporting libraries
-
Run the application
python app.py
-
Open your browser
Navigate to: http://localhost:5000
- You'll see a Google-style search interface
- If no documents exist, you'll see "Please upload some documents to start searching"
- Click "Upload Documents" button (top right)
- Drag & drop your files or click to browse
- Upload PDF, Word (.docx), or text (.txt) files
- Wait for indexing to complete (progress shown)
- Start searching with the main search box!
- Type at least 3 characters to trigger search
- Search happens automatically as you type (500ms delay)
- Press Enter for immediate search
- Results show:
- Rank (#1, #2, etc.)
- Document Name
- Page Number (for PDFs)
- Chunk Number
- Matched Text
- Relevance Score
- "View Source" button
Use the dropdown to sort by:
- Relevance: Best semantic matches first (default)
- Recent: Newest documents first
- Click the "Documents" stat card to view all documents
- Table shows: Sr. No, Name, Type, Pages, Upload Date, Actions
- View: See full document content
- Delete: Remove document (with confirmation)
Click "Configure" button (top right) to customize:
-
Model: Enter any HuggingFace repo ID
- Examples:
distilbert-base-uncased,sentence-transformers/all-MiniLM-L6-v2 - Must be compatible with sentence-transformers
- Examples:
-
Chunk Size: Words per chunk (100-2000)
- Smaller = more precise, more chunks
- Larger = more context, fewer chunks
-
Overlap: Words of overlap between chunks (0-500)
- Prevents information loss at chunk boundaries
-
Search Results: Number of results to display (1-20)
-
Top K: Size of candidate pool (5-50)
- Higher = better quality, slower search
-
Dimension: Model output dimension
- Must match your chosen model's embedding size
Note: Changing model or chunking requires rebuilding the index (automatic prompt).
- Upload a textbook chapter and search for concepts
- See how similar concepts are grouped together
- Compare semantic vs keyword matching
- Try different phrasings of the same query
- Observe how context affects results
- Experiment with synonyms and related terms
- Upload multiple documents on same topic
- See how the system ranks relevance
- Learn about precision and recall
- Compare different chunk sizes (300 vs 500 vs 1000 words)
- Experiment with overlap percentages
- Analyze impact on retrieval quality
Test different embedding models:
distilbert-base-uncased(fast, good quality)sentence-transformers/all-MiniLM-L6-v2(balanced)sentence-transformers/all-mpnet-base-v2(high quality)bert-base-uncased(standard BERT)
- Upload domain-specific documents (medical, legal, technical)
- Test if general models work or if fine-tuning is needed
- Measure accuracy on domain queries
- Use as microservice in larger applications
- Test API endpoints programmatically
- Integrate with existing document workflows
- Test with different document volumes (10, 100, 1000+ docs)
- Measure indexing time vs document size
- Optimize chunk size for your use case
- Modify the frontend for specific use cases
- Add custom features (filters, tags, etc.)
- Create domain-specific variants
-
Model Comparison
- Test: DistilBERT vs BERT vs Sentence-BERT
- Measure: Query response time, relevance scores
- Goal: Find best speed/accuracy tradeoff
-
Chunk Size Impact
- Test: 300 vs 500 vs 1000 word chunks
- Measure: Result precision, context coverage
- Goal: Optimal chunk size for your documents
-
Overlap Effects
- Test: 0% vs 10% vs 20% overlap
- Measure: Information continuity, duplicate results
- Goal: Best overlap to prevent context loss
-
Pooling Methods
- Modify code to test: Mean vs Max vs CLS pooling
- Measure: Semantic understanding quality
- Goal: Best pooling for your document type
To use a custom model:
- Click "Configure" → Enter HuggingFace repo ID
- Set correct dimension for your model
- Save configuration
- Rebuild index when prompted
Popular Models:
sentence-transformers/all-MiniLM-L6-v2(384-dim, fast)sentence-transformers/all-mpnet-base-v2(768-dim, accurate)sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(384-dim, multilingual)
The application exposes these endpoints:
GET /- Main UIPOST /search- Search documentsPOST /upload- Upload filesGET /documents- List all documentsGET /documents/<id>- Get document contentDELETE /documents/<id>- Delete documentGET /config- Get configurationPOST /config- Update configurationPOST /rebuild-index- Rebuild index
Settings are stored in app_config.json:
{
"model_repo_id": "distilbert-base-uncased",
"chunk_size": 500,
"chunk_overlap": 50,
"num_search_results": 5,
"top_k": 10,
"dimension": 768
}You can edit this file directly or use the UI.
- Multi-User Support: User authentication and personal document spaces
- Document Folders: Organize documents into categories
- Search History: Track and revisit previous searches
- Export Results: Download search results as CSV/JSON
- Query Analytics: Track popular queries and patterns
- Batch Upload: Upload entire folders at once
- Advanced Filters: Filter by date range, document type, custom tags
- Text Highlighting: Highlight matched text within results
- Query Suggestions: Autocomplete and related queries
- Document Preview: Quick preview without full view
- RESTful API: Full API with authentication
- Docker Support: Container images and deployment guides
- Cloud Templates: Deploy to AWS, GCP, Azure
- Analytics Dashboard: Visualize search patterns and document stats
- Multilingual Support: UI and search in multiple languages
- OCR Integration: Extract text from scanned PDFs and images
- GPU Support: FAISS GPU for faster indexing and search
- Incremental Indexing: Add documents without full rebuild
- Advanced Chunking: Smart chunking based on document structure
- Query Expansion: Automatic query reformulation
- Result Caching: Cache frequent queries
- Hybrid Search: Combine semantic and keyword search
See CHANGELOG.md for detailed version history.
We welcome contributions from everyone! Whether you're fixing a bug, adding a feature, improving documentation, or sharing ideas.
📖 Read our Contributing Guide for detailed information on how to contribute.
- Fork the repository
- Create a feature branch
git checkout -b feature/your-feature-name
- Make your changes
- Write clean, readable code
- Add comments for complex logic
- Follow existing code style
- Update documentation if needed
- Test thoroughly
- Test with different document types
- Check responsive design on mobile
- Verify search accuracy
- Test edge cases
- Submit a Pull Request
- Describe what you changed and why
- Reference any related issues
- Include screenshots for UI changes
- Update CHANGELOG.md
- 🐛 Bug Fixes: Found a bug? Fix it!
- ✨ New Features: Implement features from the roadmap
- 📚 Documentation: Improve README, add tutorials, create guides
- 🎨 UI/UX: Enhance interface, add themes, improve accessibility
- 🧪 Testing: Add unit tests, integration tests, performance tests
- 🌍 Localization: Translate UI to other languages
- 📊 Examples: Create Jupyter notebooks, video tutorials, blog posts
- 🔧 Tools: Build deployment scripts, Docker configs, CI/CD pipelines
- Python: Follow PEP 8
- JavaScript: Use ES6+ features, prefer const/let over var
- HTML: Semantic HTML5, proper indentation
- CSS: BEM naming convention preferred
- Comments: Explain why, not what
- Commits: Clear, descriptive commit messages
Found a bug or have a suggestion? Open an issue with:
- Clear title describing the issue
- Detailed description with context
- Steps to reproduce (for bugs)
- Expected vs actual behavior
- Screenshots if applicable
- Environment details (OS, Python version, etc.)
Have an idea? We'd love to hear it!
- Check existing issues first to avoid duplicates
- Explain the use case and benefit
- Provide examples or mockups if possible
- Tag with "enhancement" label
Want to dive deeper into the concepts?
This project is licensed under the MIT License - see the LICENSE file for details.
- ✅ Use commercially
- ✅ Modify and adapt
- ✅ Distribute
- ✅ Use privately
- ✅ Use in patent claims
- Include the original license and copyright notice
- State significant changes made
- Include the full license text in distributions
This project stands on the shoulders of giants:
- Hugging Face: For the amazing Transformers library and model hub
- Facebook AI: For FAISS vector search library
- Flask Team: For the excellent web framework
- PyPDF2: For PDF processing capabilities
- python-docx: For Word document support
- Open Source Community: For all the supporting libraries and tools
Special thanks to all contributors and users who provide feedback and help improve this project!
- Issues: GitHub Issues
- Email: inboxpraveen.17@gmail.com
- Watch this repo for updates
- Star ⭐ if you find it useful
- Fork to experiment and contribute
- Share with others who might benefit
If this project helped you learn something new or solved a problem, please consider giving it a star!
Version 2.0.0 | Made with ❤️ for learners, researchers, and developers exploring the world of semantic search.
View Changelog | Contributing Guide | Report Bug | Request Feature








