Context Search Engine

An AI-powered semantic document search platform for learning, research, and real-world experimentation

🎯 What is this Project?

This project is a hands-on learning tool and experimentation platform that demonstrates the complete journey from Data Ingestion to Query Filtering using modern vector search technology. It's designed for:

Students & College Grads: Learn how semantic search works in practice, not just theory
Researchers: Test and experiment with your own documents and models
Developers: Understand the fundamentals of vector databases and embeddings
Data Scientists: Perform A/B testing on different models and document processing techniques

Think of it as your personal laboratory for understanding how modern search engines actually understand what you're looking for, not just match keywords.

✨ Key Features

Current Features (v2.0.0)

🎨 Google-Style Interface: Clean, professional search experience with centered search box
⚙️ Fully Configurable: Customize models, chunking, search parameters via UI
📁 Document Upload: Drag & drop PDF, Word, and text files
🔍 Smart Search: Type-ahead search that understands context (3+ characters, 500ms debounce)
🧠 AI-Powered: Supports any HuggingFace sentence-transformer model (default: DistilBERT)
⚡ Fast Search: FAISS vector database for lightning-quick similarity search
📊 Enhanced Results: View document name, page number, chunk number, and source
🔄 Sort & Filter: Sort results by relevance or recent uploads
📱 Fully Responsive: Works perfectly on desktop, tablet, and mobile
💾 Smart Storage: Timestamp-based file organization (YYYYMMDD/filename_HHMMSS.ext)
🔒 Privacy First: Everything runs locally on your machine
📈 Version Tracking: Built-in version display and changelog

What Makes It Special for Learning

See the Process: Watch how documents get chunked, embedded, and indexed
Experiment Freely: Upload your own documents and test different queries
Model Comparison: Swap between different embedding models with a few clicks
Real-World Testing: Compare search results with different document types
A/B Testing Ready: Perfect for testing different chunking strategies or models
Transparent Architecture: Clean, readable code that's easy to understand and modify
Configuration Playground: Tune parameters and see immediate effects on search quality

📸 Screenshots

Upload & Configure Interface

Search

Document Management

🏗️ Architecture

The application follows a simple, modular architecture perfect for learning and experimentation:

How It Works

Document Upload: User uploads documents (PDF, DOCX, TXT)
Text Extraction: System extracts plain text from documents (with page tracking for PDFs)
Configurable Chunking: Text is split into chunks (default: 500 words with 50-word overlap)
Embedding Generation: Each chunk is converted to vectors using configured model (default: DistilBERT 768-dim)
FAISS Indexing: Vectors are stored in FAISS index for fast similarity search
Search Processing: User queries are vectorized and matched against indexed chunks
Smart Results: Top results returned with metadata (page, chunk, relevance score)
Flexible Sorting: Results can be sorted by relevance or recency

Configuration Options

The system is fully configurable through the UI:

Model Selection: Any HuggingFace sentence-transformer model
Chunk Size: 100-2000 words (default: 500)
Overlap: 0-500 words (default: 50)
Result Count: 1-20 results (default: 5)
Top K: 5-50 candidates (default: 10)
Dimension: Match your model's output (default: 768)

📂 Project Structure

context-search-engine/
│
├── app.py                      # Flask web application & API endpoints
├── document_processor.py       # Document processing, embedding & indexing
├── config.py                   # Configuration management system
├── search_engine.py           # Legacy (can be removed)
├── create_index.py            # Legacy (can be removed)
│
├── templates/
│   └── index.html             # Main UI (Google-style search interface)
│
├── static/
│   ├── css/
│   │   └── style.css          # Modern minimalist styling
│   └── images/                # UI assets & screenshots
│
├── uploads/                   # Document storage (auto-organized)
│   └── YYYYMMDD/             # Date-based folders
│       └── filename_HHMMSS.ext  # Timestamped files
│
├── faiss_index.idx            # FAISS vector index (generated)
├── index_to_chunk.pkl         # Chunk-to-text mapping (generated)
├── document_metadata.pkl      # Document metadata (generated)
├── app_config.json           # User configuration (generated)
│
├── requirements.txt           # Python dependencies
├── .gitignore                # Git ignore rules
├── CHANGELOG.md              # Version history & release notes
├── LICENSE                   # MIT License
└── README.md                 # This file

🚀 Getting Started

Prerequisites

Python: 3.8 or higher
pip: Python package manager
RAM: 4GB minimum (8GB recommended for larger documents)
Internet: First run only (downloads embedding model ~250MB)
Disk Space: ~500MB for models + your documents

Installation

Clone the repository

git clone https://github.com/inboxpraveen/context-search-engine.git
cd context-search-engine

Install dependencies
```
pip install -r requirements.txt
```
This installs:
- Flask 3.1.2 - Web framework
- transformers 4.57.3 - HuggingFace transformers for embeddings
- torch - PyTorch for model inference
- faiss-cpu 1.13.0 - Vector similarity search
- PyPDF2 3.0.1 - PDF text extraction
- python-docx 1.2.0 - Word document processing
- numpy 2.3.5 - Numerical operations
- Other supporting libraries
Run the application
```
python app.py
```
Open your browser
```
Navigate to: http://localhost:5000
```

First Time Use

You'll see a Google-style search interface
If no documents exist, you'll see "Please upload some documents to start searching"
Click "Upload Documents" button (top right)
Drag & drop your files or click to browse
Upload PDF, Word (.docx), or text (.txt) files
Wait for indexing to complete (progress shown)
Start searching with the main search box!

Using the Search

Type at least 3 characters to trigger search
Search happens automatically as you type (500ms delay)
Press Enter for immediate search
Results show:
- Rank (#1, #2, etc.)
- Document Name
- Page Number (for PDFs)
- Chunk Number
- Matched Text
- Relevance Score
- "View Source" button

Sorting Results

Use the dropdown to sort by:

Relevance: Best semantic matches first (default)
Recent: Newest documents first

Managing Documents

Click the "Documents" stat card to view all documents
Table shows: Sr. No, Name, Type, Pages, Upload Date, Actions
View: See full document content
Delete: Remove document (with confirmation)

Configuration

Click "Configure" button (top right) to customize:

Model: Enter any HuggingFace repo ID
- Examples: distilbert-base-uncased, sentence-transformers/all-MiniLM-L6-v2
- Must be compatible with sentence-transformers
Chunk Size: Words per chunk (100-2000)
- Smaller = more precise, more chunks
- Larger = more context, fewer chunks
Overlap: Words of overlap between chunks (0-500)
- Prevents information loss at chunk boundaries
Search Results: Number of results to display (1-20)
Top K: Size of candidate pool (5-50)
- Higher = better quality, slower search
Dimension: Model output dimension
- Must match your chosen model's embedding size

Note: Changing model or chunking requires rebuilding the index (automatic prompt).

🧪 Use Cases & Experiments

For Students

Learn Vector Search

Upload a textbook chapter and search for concepts
See how similar concepts are grouped together
Compare semantic vs keyword matching

Understand Embeddings

Try different phrasings of the same query
Observe how context affects results
Experiment with synonyms and related terms

Study Information Retrieval

Upload multiple documents on same topic
See how the system ranks relevance
Learn about precision and recall

For Researchers

Test Document Processing

Compare different chunk sizes (300 vs 500 vs 1000 words)
Experiment with overlap percentages
Analyze impact on retrieval quality

Model Comparison

Test different embedding models:

distilbert-base-uncased (fast, good quality)
sentence-transformers/all-MiniLM-L6-v2 (balanced)
sentence-transformers/all-mpnet-base-v2 (high quality)
bert-base-uncased (standard BERT)

Domain Testing

Upload domain-specific documents (medical, legal, technical)
Test if general models work or if fine-tuning is needed
Measure accuracy on domain queries

For Developers

Integration Testing

Use as microservice in larger applications
Test API endpoints programmatically
Integrate with existing document workflows

Performance Benchmarking

Test with different document volumes (10, 100, 1000+ docs)
Measure indexing time vs document size
Optimize chunk size for your use case

UI/UX Experiments

Modify the frontend for specific use cases
Add custom features (filters, tags, etc.)
Create domain-specific variants

A/B Testing Ideas

Model Comparison
- Test: DistilBERT vs BERT vs Sentence-BERT
- Measure: Query response time, relevance scores
- Goal: Find best speed/accuracy tradeoff
Chunk Size Impact
- Test: 300 vs 500 vs 1000 word chunks
- Measure: Result precision, context coverage
- Goal: Optimal chunk size for your documents
Overlap Effects
- Test: 0% vs 10% vs 20% overlap
- Measure: Information continuity, duplicate results
- Goal: Best overlap to prevent context loss
Pooling Methods
- Modify code to test: Mean vs Max vs CLS pooling
- Measure: Semantic understanding quality
- Goal: Best pooling for your document type

🔧 Advanced Usage

Custom Models

To use a custom model:

Click "Configure" → Enter HuggingFace repo ID
Set correct dimension for your model
Save configuration
Rebuild index when prompted

Popular Models:

sentence-transformers/all-MiniLM-L6-v2 (384-dim, fast)
sentence-transformers/all-mpnet-base-v2 (768-dim, accurate)
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (384-dim, multilingual)

Programmatic Access

The application exposes these endpoints:

GET / - Main UI
POST /search - Search documents
POST /upload - Upload files
GET /documents - List all documents
GET /documents/<id> - Get document content
DELETE /documents/<id> - Delete document
GET /config - Get configuration
POST /config - Update configuration
POST /rebuild-index - Rebuild index

Configuration File

Settings are stored in app_config.json:

{
  "model_repo_id": "distilbert-base-uncased",
  "chunk_size": 500,
  "chunk_overlap": 50,
  "num_search_results": 5,
  "top_k": 10,
  "dimension": 768
}

You can edit this file directly or use the UI.

🔮 What's Next

Version 2.1 (Planned)

Multi-User Support: User authentication and personal document spaces
Document Folders: Organize documents into categories
Search History: Track and revisit previous searches
Export Results: Download search results as CSV/JSON
Query Analytics: Track popular queries and patterns

Version 2.2 (Planned)

Batch Upload: Upload entire folders at once
Advanced Filters: Filter by date range, document type, custom tags
Text Highlighting: Highlight matched text within results
Query Suggestions: Autocomplete and related queries
Document Preview: Quick preview without full view

Version 3.0 (Planned)

RESTful API: Full API with authentication
Docker Support: Container images and deployment guides
Cloud Templates: Deploy to AWS, GCP, Azure
Analytics Dashboard: Visualize search patterns and document stats
Multilingual Support: UI and search in multiple languages
OCR Integration: Extract text from scanned PDFs and images

Future Enhancements

GPU Support: FAISS GPU for faster indexing and search
Incremental Indexing: Add documents without full rebuild
Advanced Chunking: Smart chunking based on document structure
Query Expansion: Automatic query reformulation
Result Caching: Cache frequent queries
Hybrid Search: Combine semantic and keyword search

See CHANGELOG.md for detailed version history.

🤝 Contributing

We welcome contributions from everyone! Whether you're fixing a bug, adding a feature, improving documentation, or sharing ideas.

📖 Read our Contributing Guide for detailed information on how to contribute.

How to Contribute

Fork the repository

Create a feature branch

git checkout -b feature/your-feature-name

Make your changes
- Write clean, readable code
- Add comments for complex logic
- Follow existing code style
- Update documentation if needed
Test thoroughly
- Test with different document types
- Check responsive design on mobile
- Verify search accuracy
- Test edge cases
Submit a Pull Request
- Describe what you changed and why
- Reference any related issues
- Include screenshots for UI changes
- Update CHANGELOG.md

Areas for Contribution

🐛 Bug Fixes: Found a bug? Fix it!
✨ New Features: Implement features from the roadmap
📚 Documentation: Improve README, add tutorials, create guides
🎨 UI/UX: Enhance interface, add themes, improve accessibility
🧪 Testing: Add unit tests, integration tests, performance tests
🌍 Localization: Translate UI to other languages
📊 Examples: Create Jupyter notebooks, video tutorials, blog posts
🔧 Tools: Build deployment scripts, Docker configs, CI/CD pipelines

Code Style

Python: Follow PEP 8
JavaScript: Use ES6+ features, prefer const/let over var
HTML: Semantic HTML5, proper indentation
CSS: BEM naming convention preferred
Comments: Explain why, not what
Commits: Clear, descriptive commit messages

Reporting Issues

Found a bug or have a suggestion? Open an issue with:

Clear title describing the issue
Detailed description with context
Steps to reproduce (for bugs)
Expected vs actual behavior
Screenshots if applicable
Environment details (OS, Python version, etc.)

Feature Requests

Have an idea? We'd love to hear it!

Check existing issues first to avoid duplicates
Explain the use case and benefit
Provide examples or mockups if possible
Tag with "enhancement" label

📚 Learning Resources

Want to dive deeper into the concepts?

Vector Embeddings

Semantic Search

FAISS

BERT & Transformers

Information Retrieval

Introduction to Information Retrieval

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

You are free to:

✅ Use commercially
✅ Modify and adapt
✅ Distribute
✅ Use privately
✅ Use in patent claims

Just remember to:

Include the original license and copyright notice
State significant changes made
Include the full license text in distributions

🙏 Acknowledgments

This project stands on the shoulders of giants:

Hugging Face: For the amazing Transformers library and model hub
Facebook AI: For FAISS vector search library
Flask Team: For the excellent web framework
PyPDF2: For PDF processing capabilities
python-docx: For Word document support
Open Source Community: For all the supporting libraries and tools

Special thanks to all contributors and users who provide feedback and help improve this project!

📬 Contact & Support

Get Help

Issues: GitHub Issues
Email: inboxpraveen.17@gmail.com

Stay Updated

Watch this repo for updates
Star ⭐ if you find it useful
Fork to experiment and contribute
Share with others who might benefit

🌟 Star History

If this project helped you learn something new or solved a problem, please consider giving it a star!

Version 2.0.0 | Made with ❤️ for learners, researchers, and developers exploring the world of semantic search.

View Changelog | Contributing Guide | Report Bug | Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
static		static
templates		templates
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.py		config.py
create_index.py		create_index.py
document_processor.py		document_processor.py
requirements.txt		requirements.txt
search_engine.py		search_engine.py

License

inboxpraveen/context-search-engine

Folders and files

Latest commit

History

Repository files navigation