Skip to content

Latest commit

 

History

History
193 lines (132 loc) · 23.1 KB

README.md

File metadata and controls

193 lines (132 loc) · 23.1 KB

VeritasTrial

Team Members: Yao Xiao, Bowen Xu, Tong Xiao

This project was the final project for the Harvard AC215 (Fall 2024) course.

Project license app api blog video
Repository black ruff prettier eslint ts
Workflow build test deploy-app deploy-pipeline deploy-chromadb

Warning

This project is no longer live online because it it not cheap to host the project on Google Cloud in the long term. If one is interested in creating a similar project, see the Recreating from Scratch section.

Table of Contents

Subdirectory READMEs:

Introduction

Clinical trial data includes structured information collected during research studies designed to evaluate the safety, efficacy, and outcomes of medical interventions, treatments, or devices on human participants. In recent years, this type of data has expanded rapidly, especially in large repositories like ClinicalTrials.gov, creating immense opportunities to advance healthcare research and improve patient outcomes. However, the large volume and complexity of such data create challenges for researchers, clinicians, and patients in finding and understanding trials relevant to their specific needs.

One key limitation lies in the search functionality of platforms like ClinicalTrials.gov. The current search is based on fuzzy string matching, which struggles to deliver accurate results when user queries are not precise or involve long, complex sentences. Additionally, even when users locate a specific trial, understanding its details can be challenging due to the technical language and dense structure of the information.

Solution: To overcome these challenges, we aim to develop an AI-powered application that improves the information retrieval process for clinical trials. By leveraging state-of-the-art embedding models, our system will retrieve the most relevant trials from ClinicalTrials.gov based on user queries, even if those queries are less structured or precise. After retrieval, an intuitive conversational AI chatbot will enable users to explore specific details of a trial, such as endpoints, results, and eligibility criteria. Our app can benefit a lot of different user groups. For clinical trial researchers, they can find trials to their needs more accurately and interpret the result more efficiently. For patients, it might help them identify target recruiting clinical trials that they can participate in. This interactive approach streamlines access to critical information, empowering users to make more informed decisions in clinical research and patient care.

image

Project organization

├── .github                  > GitHub workflows
│  ├── workflows
│  └── dependabot.yaml
├── app
│  ├── backend/              > VeritasTrial backend
│  ├── frontend/             > VeritasTrial frontend
│  ├── docker-compose.yaml   > VeritasTrial app compose
│  └── Makefile
├── deploy
│  ├── app/                  > App deployment
│  ├── chromadb/             > ChromaDB deployment
│  ├── pipeline/             > Pipeline deployment
│  ├── inventory.yaml        > Ansible inventory
│  └── ...
├── misc/                    > Miscellaneous
├── secrets/                 > Secrets (private)
├── src
│  ├── data-pipeline/        > Data pipeline
│  ├── embedding-model/      > Embedding model
│  ├── construct-qa/         > QA construction (legacy)
│  └── finetune-model/       > Model finetuning (legacy)
├── .gitignore
├── LICENSE.md
└── README.md

Data Pipeline

Raw data collection: We begin by collecting clinical trial data from ClinicalTrials.gov using their API. This data is preprocessed to retain only the necessary columns, focusing on clinical trials that are completed and have results available. These filtered trials are stored as JSONL files in Google Cloud Storage (GCS) buckets for easy access and scalability in downstream task. For more details, see /src/data-pipeline/.

Training data curation (embedding model): To finetune our embedding model, we curate a triplet dataset specifically designed to improve its ability to match brief trial titles to corresponding summary-level information. The triplet structure is as follows:

Query Positive Negative
Title Summary Summaries from 5 other random trials

This dataset is tailored for contrastive learning, enabling the embedding model to distinguish between relevant and irrelevant matches effectively. By learning from these structured triplets, the model is better equipped to embed clinical trial titles and summaries into a shared semantic space for improved retrieval accuracy. This step is not included in our pipeline but manually executed in Google Colab; see /misc/Finetune-BGE.ipynb for more details.

Training data curation (LLM): To finetune the LLM, we use an existing PubMedQA dataset and a self-curated ClinicalTrialQA dataset. For the former, we utilize the PubMedQA dataset available on Hugging Face, which contains 211K biomedical question-answer pairs derived from PubMed abstracts. This dataset provides a strong foundation for fine-tuning Gemini on domain-specific QA tasks. For the latter, to further specialize the chatbot for clinical trials in our dataset, we generate additional QA pairs directly from trial documents. Using Gemini 1.5 Flash on Vertex AI, we prompt the model to create relevant question-answer pairs based on the context of individual trial documents. This augmented dataset ensures that Gemini is well-equipped to handle nuanced questions about clinical trials, such as eligibility criteria, study results, or endpoints. For more details, see /src/construct-qa (legacy).

Vector database: After fine-tuning the embedding model, we embed the summary text for each clinical trial into a high-dimensional vector space. These embeddings, along with relevant metadata (e.g., study phases, conditions, eligibility criteria, etc), are stored in a ChromaDB vector database. The database lives in a VM instance in the GCP compute engine, exposing its service. See /deploy about deploying ChromaDB on GCP. For more details about the pipeline, see /src/embedding-model.

We have validated the quality of our embeddings in the vector database by generating N random samples, with one of them being the correct sample, and see if the model can accurately retrieve the correct one. Both the AUROC score (area under the receiver operating characteristic curve) and the MRR score (mean reciprocal rank) are above 0.99, meaning high retrieval accuracy.

Model Training & Optimization

Our application design involves two model training processes: finetuning the embedding model for trial retrieval and finetuning the LLM for trial interpretation.

Fintuning the embedding model: We use BGE-small-en-v1.5 as the base embedding model and finetuned it with the sentence-transformers package with the triplet dataset. We adopt a contrastive learning approach, finetuning the embedding model with the triplet loss function. This step is not included in our pipeline but manually executed in Google Colab; see /misc/Finetune-BGE.ipynb for more details.

image

Finetuning the LLM: We finetuned the Gemini 1.5 Flash model (gemini-1.5-flash-002) with 29,800 messages (15,764,259 tokens) and 3 epochs. The learing rate muliplier is 0.1 and the adapter size is 4. No sample is too long to be truncated. The training metrics during the supervised finetuning progress are as follows:

image image image

The validation metrics (on a validation set of size 256) during the supervised finetuning progress are as follows:

image image image

Frontend Interface

See /app for details about the application.

We implemented a frontend prototype application using React and TypeScript to streamline clinical trial retrieval. The interface features a range of filters, including options for eligible sex, study type, study phases, patient types, age range, and result dates, allowing users to customize their search. Users can input their query in a text box, and then the system will retrieve relevant clinical trials that are displayed with the trial title, a clickable link to the trial's endpoint, and a chat icon that enables users to initiate a detailed conversation about the trial. Users can also adjust the number of results to display using the Top K option in the bottom-right corner, with choices of 1, 3, 5, 10, 20, or 30. Additionally, the top-right corner provides quick access to the project's GitHub repository and a toggle for switching between light and dark modes, ensuring a user-friendly experience.

image

After selecting a specific trial from the retrieval results, users can proceed to ask detailed questions about the trial, such as its summary, outcome measures, or sponsor information. This functionality is supported through the side chat panel. If the chat icon for a trial is clicked for the first time, a new chat is created in the panel. However, if the trial has already been accessed before, the interface automatically redirects the user to the existing chat for that trial, ensuring continuity and convenience. To return to the retrieval interface, users can simply click the "Trial Retrieval" button at the top of the side panel.

image

There are many more UI/UX designs to enhance user experience that we will not cover here. Some of them include: responsive design for mobile devices, copy button for queries and responses, automatic scroll-into-view for sidebars and chat panels, GitHub-flavored markdown support, etc.

Backend Service

See /app for details about the application.

We implemented the backend for VeritasTrial using FastAPI to manage RESTful APIs that facilitate seamless communication with the frontend. The backend handles clinical trial retrieval, filtering, and conversational interactions. Below are the implemented API endpoints:

  • /heartbeat: A GET endpoint that checks server health by returning the current timestamp in nanoseconds.
  • /retrieve: A GET endpoint that retrieves clinical trials based on user queries, specified filters (e.g., study type, age range, date range), and the desired number of results (Top K).
  • /meta/{item_id}: A GET endpoint that retrieves metadata for a specific trial using its unique ID.
  • /chat/{model}/{item_id}: A POST endpoint that enables interaction with a generative AI model about a specific trial. Users can ask questions (e.g., trial outcomes, sponsors), and the system provides context-aware answers. Chat sessions are automatically created and destroyed on demand.

Deployment

For deployment instructions and details, see /deploy. Here we provide a brief overview.

We use Ansible playbooks to automate the deployment of the VeritasTrial application on Google Kubernetes Engine (GKE). We create or update a Kubernetes cluster with the specified configuration, including node pools and machine types. Then, we deploy the frontend and backend services as Kubernetes Deployments, exposing them via Kubernetes Services. To enable external access and SSL termination, we set up an Nginx Ingress controller. The Ingress routes incoming traffic to the appropriate service based on URL paths. Additionally, we manage secrets for SSL certificates and service account credentials. This automated deployment process ensures consistency, reduces manual effort, and facilitates efficient scaling of the VeritasTrial application.

The deployment of ChromaDB uses Terraform, as suggested in ChromaDB docs. It will deploy a VM instance that runs ChromaDB service. This is separate from the Kubernetes cluster of the frontend and the backend. This is because the vector database is stateful and recreating it is expensive. This isolated design allows us to disaggregate the pipeline workflow (that updates data in the database, executed less frequently) from the app workflow (that accesses data in the database, executed more frequently).

All deployment steps can be triggered by GitHub Actions workflows.

Future Steps

Taking our goals and objectives into consideration, we aim to expand our project to reach a larger audience and provide greater utility for diverse user groups. Some additional work we might consider includes:

  • Multilingual Support: Expand the application to support multiple languages beyond English, enabling users to retrieve and understand clinical trial data in their preferred language.
  • Integration with Other Databases: Extend the system to integrate with additional clinical trial databases or medical resources, such as WHO ICTRP or PubMed, to provide users with a more comprehensive dataset.
  • Real-Time Updates: Implement real-time updates for clinical trial information to ensure users have access to the most current data, including ongoing trial statuses and newly published results.
  • Enhanced Conversational Capabilities: Improve the chatbot’s capabilities to handle more complex and contextual queries, such as comparing multiple trials or answering follow-up questions about a specific trial.
  • Data Visualization: Add interactive data visualization tools to help users better understand clinical trial results and other relevant information.

References

  • Chen J, Xiao S, Zhang P, et al. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[J]. arXiv preprint arXiv:2402.03216, 2024. https://arxiv.org/abs/2402.03216
  • Jin Q, Dhingra B, Liu Z, et al. Pubmedqa: A dataset for biomedical research question answering[J]. arXiv preprint arXiv:1909.06146, 2019. https://arxiv.org/abs/1909.06146
  • Gao T, Yao X, Chen D. Simcse: Simple contrastive learning of sentence embeddings[J]. arXiv preprint arXiv:2104.08821, 2021. https://arxiv.org/abs/2104.08821

Recreating from Scratch

In this section we will describe how to recreate a similar from project from scratch. Note that this is not tested to work, but you are welcome to open an issue in the issue tracker if the instructions do not directly work, so that we can refine them gradually.

  • Clone this repository. Make sure to have GitHub Actions available.
  • Create a project on Google Cloud Platform. Visit your dashboard, where you can see your project ID. Replace all occurrences in the codebase of veritastrial with your project ID (case-sensitive). Also pick the region and zone for your project. Replace all occurrences of us-central1-a with your zone, and us-central1 with your region. You may want to exclude the README.md file in this process.
  • Go to the APIs & Services dashboard and enable the following APIs: Cloud Monitoring API, Compute Engine API, Cloud Logging API, Vertex AI API, Kubernetes Engine API, Artifact Registry API, Cloud Resource Manager API, Cloud Run Admin API, Network Connectivity API, Notebooks API. Note that this list may not be complete, and you may enable other APIs when needed.
  • Go to IAM & Admin> Service Accounts and create two service accounts. Let your-project-name be a project name which you can choose at random. The first service account should be named your-project-name-service and granted the following accesses: Storage Admin, Vertex AI Administrator. The second service account should be named your-project-name-deployment and granted the following accesses: Artifact Registry Administrator, Compute Admin, Compute OS Admin, Kubernetes Engine Admin, Service Account User, Storage Admin, Vertex AI Administrator. Click into your service acccounts after creation, go to "Keys", click "Add key" then "Create new key" and download as JSON file. Name the downloaded JSON files your-project-name-service.json and your-project-name-deployment.json, respectively. Put them under the /secrets/ directory - they will be automatically git ignored. Then change all occurrences in the codebase of veritas-trial with your-project-name (case sensitive).
  • Go to Cloud Storage > Buckets and create a bucket named your-project-name. Inside the bucket create the following folders: data-pipeline, embedding-model. Make sure the region and zone is correct.
  • Go to Artifact Registry and create a repository named docker. Make sure that the region and zone is correct.
  • On your forked repository in GitHub, go to "Actions", choose "Deploy ChromaDB", and click "Run workflow" with both checkboxes unchecked (which is the default). This should take a long time. After it succeeds, click into the workflow run, click into the job deploy-chromadb, and search in the logs Nginx ingress IP address (supose it says 1.2.3.4), then your deployment should be ready at http://1.2.3.4.sslip.io/ in a few minutes. If you want https (so as to activate the clipboard API in the app), see Obtaining SSL certificate from ZeroSSL. The backend service will be ready at http://1.2.3.4.sslip.io/api/.
  • A little bit more details about the previous step: the "Deploy ChromaDB" workflow actually deploys a ChromaDB service via GCP Compute Engine. Then it (1) deploys the pipeline, i.e. uploads some Docker images to repository you created in the artifact registry, prepares the data in the created buckets, and adds vector embeddings into the deployed ChromaDB database, (2) deploys the app, i.e., deploys some Docker images to the repository you created in the artifact registry, and deploys a Kubernetes cluster in GKE hosting the frontend and the backend. Normally "Deploy ChromaDB" should be run only once. In the future if you make changes to the pipeline (/src/) you should run the "Deploy Pipeline" workflow, and if you make changes to the app (/app/) you should run the "Deploy App" workflow. See /deploy for more details in deployment.
  • After the deployment a pull request will be created and it will be auto-merged (unless CI fails). That will update /deploy/.docker-tag-app, /deploy/.docker-tag-pipeline, and /deploy/chromadb/.instance-ip. /deploy/chromadb/.instance-ip contains the IP address where your deployed ChromaDB service is accessible.

For details of local development, check out the subdirectory READMEs. Again, feel free to open an issue in our issue tracker if you see anything wrong in the instructions, or if you have questions. Happy coding!