Skip to content

akshayjava/forceps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FORCEPS — Forensic Optimized Retrieval & Clustering of Evidence via Perceptual Search

Overview

FORCEPS is a forensic image similarity and natural-language search tool designed for massive-scale, high-speed, and private on-site investigations. It is built on a scalable, distributed architecture to handle terabyte-sized datasets.

Key Features

  • Distributed Architecture: Uses a Redis queue to distribute work to multiple parallel workers for extreme scalability.
  • Case Management: Organizes analysis into distinct cases, with all output stored in case-specific directories.
  • Forensic Hashing: Employs a pluggable framework to compute multiple cryptographic (SHA-256) and perceptual (pHash, aHash, dHash) hashes for each piece of evidence.
  • Hybrid Search: Combines semantic vector search (FAISS) with traditional keyword search (Whoosh) for highly accurate and relevant results.
  • Automatic Clustering: Leverages machine learning to automatically group search results into visual themes, helping investigators quickly identify patterns.
  • Optimized Models: Supports GPU-accelerated ONNX models for high-speed embedding computation.
  • Comprehensive Reporting: Generates detailed PDF and CSV reports for bookmarked evidence, including all hashes, metadata, and user-provided notes.
  • All processing is local: No data leaves your environment.

User Interface Mock-up

+----------------------------------------------------------------------------------------------------------------------+
| [FORCEPS] Forensic Optimized Retrieval & Clustering of Evidence via Perceptual Search                                |
+----------------------------------------------------------------------------------------------------------------------+
| [ Sidebar ]                                        | [ Search & Results ]  [ Reporting ]                               |
|                                                    |-------------------------------------------------------------------+
| [FORCEPS Indexing Controls]                        | [Search]                                                          |
|  Folder to index: [ /path/to/images      ]         |  Natural language query: [ a dog near a red car          ]         |
|  [ Start Indexing Job ]                            |  [ Run Search ]                                                   |
|                                                    |                                                                   |
| [Backend Controls]                                 | [Top results]                                                     |
|  Redis Host: [ localhost ]                         |  +----------------+  +----------------+  +----------------+       |
|  Redis Port: [ 6379      ]                         |  | [Image]        |  | [Image]        |  | [Image]        |       |
|                                                    |  | img_001.jpg    |  | img_002.jpg    |  | ...            |       |
| [Load Case for Searching]                          |  | [Show Details]v|  | [Show Details]v|  | [Show Details]v|       |
|  Cases Directory: [ output_index ]                 |  | [Manage Bkmk]v |  | [Manage Bkmk]v |  | [Manage Bkmk]v |       |
|  Select Case: [ case-001                v]         |  +----------------+  +----------------+  +----------------+       |
|  [ Load Selected Case ]                            |                                                                   |
|                                                    | ---                                                               |
|                                                    | [Cluster Analysis]                                                |
|                                                    |  Number of Clusters: [ 2 ----o--------------- 50 ] (10)           |
|                                                    |  [ Cluster Displayed Results ]                                    |
+----------------------------------------------------------------------------------------------------------------------+

How It Works: Architecture

The system is composed of several key components that work together:

  1. Redis: A message broker that manages the queue of images to be processed.
  2. Enqueuer (enqueue_jobs.py): A script that scans a directory, computes forensic hashes, and populates the Redis job queue.
  3. Worker (engine.py): The core processing engine. You can run many workers in parallel. Each worker computes embeddings for images and pushes the results to a results queue.
  4. Index Builder (build_index.py): A script that consumes results, aggregates all embeddings, and builds the final search indexes (FAISS for vectors, Whoosh for text).
  5. User Interface (main.py): A Streamlit application for controlling the backend and analyzing results.

Installation & Setup

Option 1: Running with Docker Compose (Recommended for Production)

The easiest way to run the entire FORCEPS stack is with docker-compose.

1. Prerequisites

  • Install Docker and Docker Compose.
  • NVIDIA GPU Users: Ensure you have the NVIDIA Container Toolkit installed.
  • Prepare your data: Create host directories for your images (e.g., ./images) and for the output (e.g., ./output_index). The docker-compose.yml file maps these to the correct paths inside the containers.
  • (Optional) Convert models to ONNX: For maximum performance, run python app/convert_models.py --output_dir models/onnx.

2. Launch the Application

From the root of the project directory, run:

docker-compose up --build

To run multiple workers for faster processing, use the --scale flag:

docker-compose up --build --scale worker=4

Option 2: Command Line Setup (for Development/Testing)

1. Prerequisites

  • Python 3.9+
  • Redis Server installed and running
  • Git for version control
  • Required libraries (see requirements.txt)

2. Clone the Repository

git clone https://github.com/akshayjava/forceps.git
cd foreceps

3. Create a Virtual Environment

python -m venv venv_forceps
source venv_forceps/bin/activate  # On Windows: venv_forceps\Scripts\activate

4. Install Dependencies

pip install -r requirements.txt

5. Setup Redis

Ensure Redis is running on localhost:6379 (default):

# On macOS with Homebrew
brew install redis
brew services start redis

# On Ubuntu/Debian
sudo apt install redis-server
sudo systemctl start redis-server

# Verify Redis is running
redis-cli ping  # Should return PONG

Command-Line End-to-End Workflow

This section describes the recommended workflow for indexing and querying images entirely from the command line on a single machine. This is the simplest and fastest way to get started with FORCEPS.

Step 1: Indexing Images (run_cli.py)

The run_cli.py script is a standalone tool for quickly creating a searchable index from a directory of images. It computes image embeddings, builds a FAISS index, and saves metadata.

Usage:

# Ensure you are in the project's root directory
# Make sure required packages are installed:
# pip install torch torchvision transformers faiss-cpu pillow tqdm opencv-python

PYTHONPATH=. python3 run_cli.py \
  --image_dir "/path/to/your/images" \
  --output_dir "index_output" \
  --device auto \
  --batch_size 16

Optional: Generate Text Captions If you have ollama installed with the llava model, you can automatically generate descriptive captions for each image. This is required for text-based search.

# Add the --captions flag to the command above
PYTHONPATH=. python3 run_cli.py \
  --image_dir "/path/to/your/images" \
  --output_dir "index_output" \
  --captions

This command will create the index_output directory (or the name you specified) containing the following files:

  • image_index.faiss: The FAISS vector index for similarity search.
  • image_paths.pkl: A list of the indexed image paths.
  • metadata.pkl: Index metadata.
  • exif.json: EXIF data extracted from images, useful for filtering.
  • captions.tsv (if --captions is used): A tab-separated file of image paths and their text descriptions.

For performance tuning options (e.g., using a GPU), see the Optimizing the Command-Line Indexer section in OPTIMIZATION_GUIDE.md.

Step 2: Querying by Image

Once the index is built, you can find the top 10 most visually similar images to a query image using the following command.

Usage: Replace index_output with your output directory and /path/to/query.jpg with the path to your query image.

python3 - <<'PY'
import sys, pickle, numpy as np, torch
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import faiss, os
# --- Configuration ---
outdir = "index_output"  # <-- Set your output directory here
query_img = sys.argv[1]   # <-- Query image path is passed as an argument
# --- Script ---
if not os.path.isdir(outdir) or not os.path.isfile(query_img):
    print(f"Error: Ensure output directory '{outdir}' exists and query image '{query_img}' is valid.")
    sys.exit(1)
paths = pickle.load(open(os.path.join(outdir,"image_paths.pkl"),"rb"))
index = faiss.read_index(os.path.join(outdir,"image_index.faiss"))
proc = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k"); model.eval()
im = Image.open(query_img).convert("RGB")
inp = proc(images=im, return_tensors="pt")
with torch.no_grad():
    feat = model(**inp).last_hidden_state[:,0].cpu().float().numpy()
feat /= (np.linalg.norm(feat,axis=1,keepdims=True)+1e-8)
D,I = index.search(feat.astype("float32"), 10)
print(f"Top 10 most similar images to '{query_img}':")
for r,(d,idx) in enumerate(zip(D[0],I[0]),1):
    if idx==-1: continue
    print(f"{r:02d}\t{paths[idx]}")
PY "/path/to/query.jpg"

Step 3: Querying by Text

If you generated captions during indexing (Step 1), you can perform a text-based search using grep. This command searches the captions.tsv file for a specific phrase and returns the paths of the matching images.

Usage: Replace index_output with your output directory and "your phrase here" with your search query.

grep -i "your phrase here" index_output/captions.tsv | cut -f1 | head -n 10

Using the Web Application (for Distributed Processing)

This section describes the original, more advanced workflow that uses a web interface to manage a distributed processing backend. This is suitable for processing massive datasets across multiple machines.

1. Configure Your Case

Edit the configuration file (app/config.yaml or app/config_optimized.yaml) to specify your input directory, case name, and Redis connection details.

2. Start the Backend Infrastructure

This workflow requires a Redis server to be running.

# Start Redis (if not already running)
redis-server

3. Start the Web UI and Workers

You will need multiple terminals.

Terminal 1: Start the Web Application

# Ensure PYTHONPATH includes the project root
PYTHONPATH=. streamlit run app/main.py

This launches the UI at http://localhost:8501.

Terminal 2 (and others): Start Background Workers For optimal performance, start multiple worker processes to process images in parallel.

PYTHONPATH=. python app/optimized_worker.py --config app/config_optimized.yaml

4. Enqueue Images for Processing

Use the enqueue_jobs.py script to scan a directory and add jobs to the Redis queue for the workers to process.

PYTHONPATH=. python app/enqueue_jobs.py --config app/config_optimized.yaml --input_dir /path/to/evidence

5. Monitor and Analyze in the UI

Use the web interface to:

  • Monitor the progress of the indexing job.
  • Load the completed case for analysis.
  • Search, filter, cluster, and bookmark results.
  • Generate reports.

Performance Optimization Tips

  • Use the optimized configuration: The app/config_optimized.yaml file contains settings optimized for high-performance processing.
  • Run multiple workers: Start 4-8 worker processes for optimal performance on a typical workstation.
  • Adjust batch sizes: For machines with limited memory, decrease batch_size in the config. For powerful machines, increase it.
  • Enable GPU acceleration: Set use_gpu: true in the config file when running on a CUDA-capable GPU.
  • Tune Redis: For very large datasets, consider adjusting Redis configuration for higher memory limits.
  • Disk I/O: For best performance, place input images and output directory on fast SSDs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages