forked from t-redactyl/ocr-llm-agent
-
Notifications
You must be signed in to change notification settings - Fork 0
Quick Start Tutorial
Udit Asopa edited this page Oct 16, 2025
·
1 revision
Get up and running with Vision Text Extractor in under 5 minutes! This tutorial assumes you've completed the Installation Guide.
cd vision-text-extractor
pixi run test-setupYou should see: π All dependencies imported successfully!
# Use the built-in sample image
pixi run demo-ocr-huggingfaceExpected Output:
π Processing image: images/chocolate_cake_recipe.png
π¬ Using prompt: Please transcribe the provided image.
π€ Model provider: huggingface
π§ Model: HuggingFaceTB/SmolVLM-Instruct
π Loading SmolVLM vision model...
π Text extraction completed!
==================================================
Chocolate Cake Recipe
Ingredients:
- 2 cups all-purpose flour
- 2 cups sugar
- 3/4 cup cocoa powder
...
==================================================
π Congratulations! You just extracted text from an image using AI!
# Basic extraction
python main.py path/to/your/image.jpg
# With custom prompt
python main.py receipt.jpg --prompt "Extract the total amount and date"
# Different file types
python main.py document.pdf
python main.py screenshot.png# Extract from web image
python main.py "https://example.com/menu.jpg"
# With custom prompt
python main.py "https://example.com/receipt.jpg" \
--prompt "List all items and prices"# Setup (one-time)
pixi run setup-ollama
# Use Ollama
python main.py image.jpg --provider ollama --model llava:7b# Requires API key in .env file
python main.py image.jpg --provider openai --model gpt-4o# Extract contract details
python main.py contract.pdf \
--prompt "Extract party names, dates, and key terms"
# Process invoice
python main.py invoice.jpg \
--prompt "Extract invoice number, total amount, and due date"# Get recipe ingredients
python main.py recipe-photo.jpg \
--prompt "List all ingredients with quantities"
# Extract menu prices
python main.py menu.png \
--prompt "Extract menu items and their prices"# Process receipt
python main.py receipt.jpg \
--prompt "Extract store name, items, prices, and total"
# Bank statement
python main.py statement.pdf \
--prompt "List all transactions with dates and amounts"# Quick demos (no arguments needed)
pixi run demo-ocr-huggingface
pixi run demo-ocr-ollama
pixi run demo-ocr-openai
# Flexible tasks (your image as argument)
pixi run ocr_llm "my-image.jpg"
pixi run ocr_ollama "my-document.pdf"# Process multiple files
for img in *.jpg; do
python main.py "$img" --prompt "Extract key information"
done# Redirect output to file
python main.py document.jpg > extracted_text.txt
# Or use with timestamp
python main.py receipt.jpg > "receipt_$(date +%Y%m%d_%H%M%S).txt"# Extract only phone numbers
python main.py business-card.jpg \
--prompt "Extract only phone numbers from this business card"
# Get nutritional information
python main.py nutrition-label.jpg \
--prompt "Extract calories, protein, carbs, and fat content"
# Focus on dates and amounts
python main.py invoice.pdf \
--prompt "Extract all dates and monetary amounts"# Request JSON format
python main.py receipt.jpg \
--prompt "Extract receipt data as JSON with fields: store, date, items, total"
# Table format
python main.py price-list.jpg \
--prompt "Extract as a table with columns: item, description, price"- SmolVLM (Hugging Face): Best for privacy, decent accuracy
- LLaVA (Ollama): Good alternative, different strengths
- GPT-4o (OpenAI): Highest accuracy, but costs money
# For handwriting
python main.py handwritten.jpg \
--prompt "Carefully transcribe this handwritten text"
# For low-quality images
python main.py blurry-scan.jpg \
--prompt "Extract text even if image quality is poor"
# For multilingual content
python main.py multilingual.jpg \
--prompt "Extract text and identify the languages used"# Re-download SmolVLM if corrupted
rm -rf ~/.cache/huggingface/hub/models--HuggingFaceTB--SmolVLM-Instruct
pixi run setup-smolvlm# Use Ollama instead of SmolVLM for lower memory usage
pixi run setup-ollama
python main.py image.jpg --provider ollama# Check if OpenAI key is set
pixi run check-env
# Reset environment file
pixi run setup-envNow that you're comfortable with the basics:
- π Deep Dive: Read Basic Usage for more details
- π― Specific Tutorials: Try Document Processing
- βοΈ Advanced Features: Explore Advanced Features
- π§ Configuration: Learn about Configuration options
You now know how to:
- β Extract text from any image or document
- β Use different AI providers
- β Customize prompts for specific needs
- β Handle common use cases
Happy text extracting! π
Need help? Check Troubleshooting or ask in GitHub Issues