Description: In essence, this script automates the process of extracting specific metadata from a collection of PDF files (such as master theses) using the capabilities of the OpenAI API, and then stores this metadata in structured formats (CSV and XLSX) for further use or analysis.
Authors Maurice Vanderfeesten, in collaboration with ChatGPT-4
Ensure you have the following installed on your system:
-
Clone the Repository:
git clone https://github.com/ubvu/gpt_metadata_generator.git cd gpt_metadata_generator
-
Set Up a Virtual Environment (recommended):
python3 -m venv env source env/bin/activate # On Windows use: env\Scripts\activate
-
Install Required Packages:
pip install -r requirements.txt
-
Rename
config-example.yaml
toconfig.yaml
:mv config-example.yaml config.yaml # On Windows use: rename config-example.yaml config.yaml
-
Open
config.yaml
in your preferred text editor. Update the various fields to match your setup and preferences. For example:openai_api_key
: Your OpenAI API key.input_directory
: Path to the directory containing input PDF files.results_directory
: Path where you'd like to save the results.max_tokens
: Maximum number of tokens for the OpenAI API response.MAX_TOKENS_FOR_CONTENT
: Setting a buffer size for the segments / chunk the pdf to ensure we don't feed the model too many tokens, exeeding its limit.model
: Setting the model to use default is "gpt-3.5-turbo" , include the quotes!
Once everything is set up and configured, run the main script:
python main.py
This will process the PDF files as per the configurations and save the results in the specified directory.