Skip to content

Latest commit

 

History

History
112 lines (73 loc) · 7.8 KB

README.md

File metadata and controls

112 lines (73 loc) · 7.8 KB

Document Question-Answering with Local RAG in Android

A simple Android app that allows the user to add a PDF/DOCX document and ask natural-language questions whose answers are generated by the means of an LLM

app_demo.mp4

(The PDF used in the demo can be found in resources directory)

YT Video

Goals

  • Demonstrate the collective use of an on-device vector database, embeddings model and a custom text-splitter to build a retrieval-augmented generation (RAG) based pipeline for simple document question-answering
  • Use modern Android development practices and recommended architecture guidelines
  • Explore and suggest better tools/alternatives for building fully offline, on-device RAG pipeline for Android with minimum compute and storage requirements
Feature On-Device Remote
Sentence Embedding
Text Splitter
Vector Database
LLM

Setup

  1. Clone the main branch,
$> git clone --depth=1 https://github.com/shubham0204/Android-Document-QA
  1. Get an API key from Google AI Studio to use the Gemini API. Copy the key and paste it in local.properties present in the root directory of the project,
geminiKey="AIza[API_KEY_HERE]"

Perform a Gradle sync, and run the application.

Tools

  1. Apache POI and iTextPDF for parsing DOCX and PDF documents
  2. ObjectBox for on-device vector-store and NoSQL database
  3. Sentence Embeddings (all-MiniLM-L6-V2) for generating on-device text/sentence embeddings
  4. Gemini Android SDK as a hosted large-language model (Uses Gemini-1.5-Flash)

Working

The basic working flow on the app is as follows:

  1. When the user selects a PDF/DOCX document (the only ones which can be imported for now), the text is parsed with the libraries mentioned in (1) of Tools. See PDFReader.kt and DOCXReader.kt for reference.
  2. Chunks or overlapping sub-sequences are produced from the text, given the size of sequence (chunkSize) and the extent of overlap between two sequences (chunkOverlap). See WhiteSpaceSplitter.kt for reference.
  3. Each chunk is encoded into a fixed-size vector i.e. a text embedding. The embeddings are inserted in the vector database, with each chunk/embedding having a distinct chunkId. See SentenceEmbeddingProvider.kt for reference.
  4. When the user submits a query, we find the top-K most similar chunks from the database by comparing their embeddings.
  5. The chunks corresponding to the nearest embeddings are injected into a pre-built prompt along with the query, which is provided to the LLM. The LLM generates a well-formed natural language answer to the user's query. See GeminiRemoteAPI.kt for reference.

See the prompt,

You are an intelligent search engine. You will be provided with some retrieved context, as well as the users query.
Your job is to understand the request, and answer based on the retrieved context.
Strictly Use ONLY the following pieces of context to answer the question at the end.
Provide only the answer as a response

Here is the retrieved context:
    $CONTEXT

Here is the users query:
    $QUERY

Discussion

Why not use on-device LLMs instead of the Gemini's Cloud SDK?

Using an on-device LLM is possible in Android, but at the expense of a large app size (>1GB) and compute requirements. Google's Edge AI SDK has some options where models like Gemma, MS Phi-2, Falcon can be used completely on-device and accessed via Mediapipe's Android/iOS/Web APIs. See the official documentation for Mediapipe LLM Inference, it also includes instructions ofr LoRA fine-tuning.

Moreover, the same docs specific for Android mention the fact,

During development, you can use adb to push the model to your test device for a simpler workflow. For deployment, host the model on a server and download it at runtime. The model is too large to be bundled in an APK.

The integration using Mediapipe LLM inference API is easy. Due to the absence of a good Android device, I went ahead with the Cloud API, but it would be great to have an on-device option. Gemini Nano currently available on limited devices is also an on-device solution.

Other tools for using LLMs on Android:

  1. mlc (Also see Llama3 on Android)
  2. llama.cpp for Android

(Solved) Better alternatives for the Universal Sentence Encoder (embedding model)

Problem:

The app currently uses the Universal Sentence Encoder model from Google, as it was the only possible way to generate text/sentence embeddings on an Android device, with a builtin API and tokenizer. It generates an embedding of size 100.

After checking the retrieved context (similar chunks) for a few questions, I recognized that the embedding model was not able to understand the context of the sentence to a significant extent. I couldn't find a metric to validate this point on HuggingFace's MTEB. Models such as sentence-transformers were particular great at understanding context, but their integration in an Android app remains an open problem.

Solution

The all-MiniLM-L2-V6 model from sentence-transformers has been ported to Android with the help of ONNX/onnxruntime and the Rust-implementation of huggingface/tokenziers. See the app's assets folder to find the ONNX model tokenizer.json

See the main repository shubham0204/Sentence-Embeddings-Android for more details.

Contributions and Open Problems

Feel free to raise an issue or open a PR. The following can be improved in the app:

  1. Add on-device LLM capabilities
  2. Build a new text-splitter, taking inspiration from Langchain or LlamaIndex
  3. Port a good, small-in-size embedding model from HuggingFace to work in the app