Skip to content

Quick Start Tutorial

Udit Asopa edited this page Oct 16, 2025 · 1 revision

Quick Start Tutorial

Get up and running with Vision Text Extractor in under 5 minutes! This tutorial assumes you've completed the Installation Guide.

🎯 Your First Text Extraction

Step 1: Verify Installation

cd vision-text-extractor
pixi run test-setup

You should see: πŸŽ‰ All dependencies imported successfully!

Step 2: Try the Demo

# Use the built-in sample image
pixi run demo-ocr-huggingface

Expected Output:

πŸ” Processing image: images/chocolate_cake_recipe.png
πŸ’¬ Using prompt: Please transcribe the provided image.
πŸ€– Model provider: huggingface
🧠 Model: HuggingFaceTB/SmolVLM-Instruct
πŸ”„ Loading SmolVLM vision model...

πŸŽ‰ Text extraction completed!
==================================================
Chocolate Cake Recipe

Ingredients:
- 2 cups all-purpose flour
- 2 cups sugar
- 3/4 cup cocoa powder
...
==================================================

πŸŽ‰ Congratulations! You just extracted text from an image using AI!

πŸ“Έ Try Your Own Images

Local Image Files

# Basic extraction
python main.py path/to/your/image.jpg

# With custom prompt
python main.py receipt.jpg --prompt "Extract the total amount and date"

# Different file types
python main.py document.pdf
python main.py screenshot.png

Online Images (URLs)

# Extract from web image
python main.py "https://example.com/menu.jpg"

# With custom prompt
python main.py "https://example.com/receipt.jpg" \
  --prompt "List all items and prices"

πŸ€– Try Different AI Models

Ollama LLaVA (Alternative Local Model)

# Setup (one-time)
pixi run setup-ollama

# Use Ollama
python main.py image.jpg --provider ollama --model llava:7b

OpenAI GPT-4o (Highest Accuracy)

# Requires API key in .env file
python main.py image.jpg --provider openai --model gpt-4o

🎯 Common Use Cases

πŸ“„ Business Documents

# Extract contract details
python main.py contract.pdf \
  --prompt "Extract party names, dates, and key terms"

# Process invoice
python main.py invoice.jpg \
  --prompt "Extract invoice number, total amount, and due date"

🍽️ Food & Recipes

# Get recipe ingredients
python main.py recipe-photo.jpg \
  --prompt "List all ingredients with quantities"

# Extract menu prices
python main.py menu.png \
  --prompt "Extract menu items and their prices"

πŸ’° Financial Documents

# Process receipt
python main.py receipt.jpg \
  --prompt "Extract store name, items, prices, and total"

# Bank statement
python main.py statement.pdf \
  --prompt "List all transactions with dates and amounts"

πŸš€ Power User Tips

Pixi Task Shortcuts

# Quick demos (no arguments needed)
pixi run demo-ocr-huggingface
pixi run demo-ocr-ollama  
pixi run demo-ocr-openai

# Flexible tasks (your image as argument)
pixi run ocr_llm "my-image.jpg"
pixi run ocr_ollama "my-document.pdf"

Batch Processing Multiple Images

# Process multiple files
for img in *.jpg; do
  python main.py "$img" --prompt "Extract key information"
done

Save Output to File

# Redirect output to file
python main.py document.jpg > extracted_text.txt

# Or use with timestamp
python main.py receipt.jpg > "receipt_$(date +%Y%m%d_%H%M%S).txt"

πŸ”§ Customizing Your Prompts

Specific Information Extraction

# Extract only phone numbers
python main.py business-card.jpg \
  --prompt "Extract only phone numbers from this business card"

# Get nutritional information
python main.py nutrition-label.jpg \
  --prompt "Extract calories, protein, carbs, and fat content"

# Focus on dates and amounts
python main.py invoice.pdf \
  --prompt "Extract all dates and monetary amounts"

Structured Output

# Request JSON format
python main.py receipt.jpg \
  --prompt "Extract receipt data as JSON with fields: store, date, items, total"

# Table format
python main.py price-list.jpg \
  --prompt "Extract as a table with columns: item, description, price"

⚑ Performance Tips

Choose the Right Model

  • SmolVLM (Hugging Face): Best for privacy, decent accuracy
  • LLaVA (Ollama): Good alternative, different strengths
  • GPT-4o (OpenAI): Highest accuracy, but costs money

Optimize for Your Use Case

# For handwriting
python main.py handwritten.jpg \
  --prompt "Carefully transcribe this handwritten text"

# For low-quality images
python main.py blurry-scan.jpg \
  --prompt "Extract text even if image quality is poor"

# For multilingual content
python main.py multilingual.jpg \
  --prompt "Extract text and identify the languages used"

🚨 Common Issues & Quick Fixes

Model Loading Issues

# Re-download SmolVLM if corrupted
rm -rf ~/.cache/huggingface/hub/models--HuggingFaceTB--SmolVLM-Instruct
pixi run setup-smolvlm

Memory Issues

# Use Ollama instead of SmolVLM for lower memory usage
pixi run setup-ollama
python main.py image.jpg --provider ollama

API Key Issues

# Check if OpenAI key is set
pixi run check-env

# Reset environment file
pixi run setup-env

➑️ What's Next?

Now that you're comfortable with the basics:

  1. πŸ“– Deep Dive: Read Basic Usage for more details
  2. 🎯 Specific Tutorials: Try Document Processing
  3. βš™οΈ Advanced Features: Explore Advanced Features
  4. πŸ”§ Configuration: Learn about Configuration options

πŸŽ‰ You're Ready!

You now know how to:

  • βœ… Extract text from any image or document
  • βœ… Use different AI providers
  • βœ… Customize prompts for specific needs
  • βœ… Handle common use cases

Happy text extracting! πŸš€


Need help? Check Troubleshooting or ask in GitHub Issues

Clone this wiki locally