StudySetCreator is a Python tool that generates study sets (flashcards) from PDF files using OpenAI's language models. It processes the content of a PDF file, extracts text and images, and uses the OpenAI API to create question-answer pairs suitable for studying or revision purposes. The generated study set is saved as a CSV file, ready to be imported into flashcard applications or used directly.
- PDF Processing: Extracts text and images from PDF files.
- OpenAI Integration: Utilizes OpenAI's GPT models to generate study cards from the extracted content.
- Batch Processing: Supports processing in chunks to handle large PDF files efficiently.
- Resume Capability: Can resume processing from where it left off in case of interruptions.
- Language Support: Generates study sets in the specified language.
- Customization: Allows customization of various parameters like model selection, output file name, chunk size, etc.
- Python: Version 3.7 or higher.
- OpenAI API Key: Required to access OpenAI's language models.
-
Clone the Repository
git clone https://github.com/jaylann/StudySetCreator.git cd StudySetCreator -
Create a Virtual Environment (optional but recommended)
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
The application requires an OpenAI API key to function. This key should be stored in a .env file in the project's root directory.
-
Create the
.envFileThere is a
.env.templatefile provided in the project. Copy this template to create your.envfile:cp .env.template .env
-
Edit the
.envFileOpen the
.envfile in a text editor and add your OpenAI API key:OPENAI_API_KEY=your-openai-api-key-here
Replace
your-openai-api-key-herewith your actual OpenAI API key.
Run the main.py script with the required arguments to generate a study set from a PDF file.
python main.py [options]--model: OpenAI model to use (default:gpt-4o-mini).--output: Output CSV file name (default:study_set.csv).--input: Input PDF file to process (required).--in_dir: Input directory containing PDF files to process.--out_dir: Output directory to save the study sets.--chunk_size: Number of pages to process at once (default:10).--use_batch: Use OpenAI Batch API for processing.--text_only: Extract text only, ignore images.--language: Language for the study set (default:english).--no_resume: Whether to resume processing from the last checkpoint. WARNING: If set and a progress file exists, it will be overwritten.
-
Basic Usage
Generate a study set from
document.pdfusing the default settings.python main.py --input document.pdf --output document.csv
-
Specify Output File and Model
Generate a study set from
lecture_notes.pdf, using thegpt-4omodel, and save the output toflashcards.csv.python main.py --model gpt-4o --output flashcards.csv --input lecture_notes.pdf
-
Process Only Text Content
Generate a study set ignoring images in the PDF.
python main.py --text_only --input textbook.pdf --output textbook.csv
-
Use Batch Processing
Use OpenAI's Batch API to process the PDF (suitable for large PDFs. Reduces cost by ~50% but may take longer).
python main.py --use_batch --input large_document.pdf --output large_document.csv
-
Specify Language
Generate a study set in Spanish.
python main.py --language spanish --input notas_de_clase.pdf --output notas_de_clase.csv
The system prompt used by the OpenAI API can be customized to change how study cards are generated.
-
Prompt File:
./storage/prompt.txtEdit this file to modify the prompt. The placeholder
[LANGUAGE]in the prompt will be replaced with the language specified via the--languageargument.
The JSON schema defines the expected structure of the API responses.
-
Schema File:
./storage/schema.jsonEdit this file to change the schema if you need the responses in a different format.
The application uses logging to provide information about its operation.
- Log Output: The application outputs logs to the console. You can modify the logging configuration in
src/utils/logging.pyif you need to change log levels or output formats.
- Resume Processing: If the processing is interrupted, the application can resume from where it left off using the progress saved in
progress.json. - Progress File: The file
progress.jsonis used to keep track of progress. It can be deleted to start processing from the beginning. - Batch Processing Errors: If a batch job fails or is still in progress, an error message will be logged.
All required Python packages are listed in requirements.txt. Install them using:
pip install -r requirements.txt- API Usage: Be mindful of your OpenAI API usage and billing.
- Supported Models: Ensure that the model you specify (e.g.,
gpt-4) is available to your OpenAI account.
Contributions are welcome! Please open an issue or submit a pull request for any bugs or feature requests.
This project is licensed under the MIT License.
Read more about this on my blog
Made with ❤️ by Justin Lanfermann