This project is implemented within the framework of LlamaIndex, using multiple custom components to achieve a fully localized multimodal document-QA system without relying on any APIs or remote resources.
Below are the main tools/models used:
-
PDF parser: SciPDF Parser
-
RAG Framework: LlamaIndex
-
Vector DataBase: Qdrant
-
Vison-LLM: A gguf quantized 7B version of LLaVA
-
LLM Inference Framework: llama.cpp & llama-cpp-python
-
Embedding Models:
WSL2 on Windows 10, Ubuntu 22.04
CUDA Toolkit Version 12.3
SciPDF Parser
Follow the steps from https://github.com/titipata/scipdf_parser/blob/master/README.md.
Install this branch to avoid exceeding the request time limit when dealing with large PDF files.
pip install https://github.com/Virgil-L/scipdf_parser.git
To save memory and computing cost, this project uses a lightweight image of GROBID, run serve_grobid_light.sh
to get it.
LlamaIndex
pip install -q llama-index llama-index-embeddings-huggingface
Qdrant
pip install qdrant-client
llama-cpp-python
# Linux and Mac
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
Download the sample pdf data
wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/paper_PDF/llama2.pdf"
wget --user-agent "Mozilla" "https://arxiv.org/pdf/2310.03744.pdf" -O "data/paper_PDF/llava-vl.pdf"
wget --user-agent "Mozilla" "https://arxiv.org/pdf/1706.03762.pdf" -O "data/paper_PDF/attention.pdf"
wget --user-agent "Mozilla" "https://arxiv.org/pdf/1810.04805.pdf" -O "data/paper_PDF/bert.pdf"
Parse pdf files to extract text and image elemnts
python parse_pdf.py \
--pdf_folder "./data/paper_PDF" \
--image_resolution 300 \
--max_timeout 120
Create Qdrant collections, build and ingest text and image nodes
python build_qdrant_collections.py \
--pdf_folder "./data/paper_PDF" \
--storage_path "./qdrant_db" \
--text_embedding_model "./embedding_models/bge-small-en-v1.5" \
--sparse_text_embedding_model "./embedding_models/efficient-splade-VI-BT-large-doc" \
--image_embedding_model "./embedding_models/clip-vit-base-patch32" \
--chunk_size 384 \
--chunk_overlap 32