Real-time feature pipeline that:
- Ingests financial news from Alpaca
- Cleans & transforms the news documents into embeddings in real-time using Bytewax
- Stores the embeddings into the Qdrant Vector DB
The streaming pipeline is automatically deployed on an AWS EC2 machine using a CI/CD pipeline built in GitHub actions.
The best way to ingest real-time knowledge into an LLM without retraining the LLM too often is by using RAG.
To implement RAG at inference time, you need a vector DB always synced with the latest available data.
The role of this streaming pipeline is to listen 24/7 to available financial news from Alpaca, process the news in real-time using Bytewax, and store the news in the Qdrant Vector DB to make the information available for RAG.
Main dependencies you have to install yourself:
- Python 3.10
- Poetry 1.5.1
- GNU Make 4.3
- AWS CLI 2.11.22
Installing all the other dependencies is as easy as running:
make install
When developing run:
make install_dev
Prepare credentials:
cp .env.example .env
--> and complete the .env
file with your external services credentials.
Check out the Setup External Services section to see how to create API keys for them.
deploy the streaming pipeline to AWS [optional]
You can deploy the streaming pipeline to AWS in 2 ways:
- Running Make commands: install & configure your AWS CLI 2.11.22 as explained in the Setup External Services section
- Using the GitHub Actions CI/CD pipeline: only create an account and generate a pair of credentials as explained in the Setup External Services section
Run the streaming pipeline in real-time
and production
modes:
make run_real_time
To populate the vector DB, you can ingest historical data from the latest 8 days by running the streaming pipeline in batch
and production
modes:
make run_batch
For debugging & testing, run the streaming pipeline in real-time
and development
modes:
make run_real_time_dev
For debugging & testing, run the streaming pipeline in batch
and development
modes:
make run_batch_dev
To query the Qdrant vector DB, run the following:
make search PARAMS='--query_string "Should I invest in Tesla?"'
You can replace the --query_string
with any question.
First, build the Docker image:
make build
Then, run the streaming pipeline in real-time
mode inside the Docker image, as follows:
make run_real_time_docker
First, ensure you successfully configured your AWS CLI credentials.
After, run the following to deploy the streaming pipeline to a t2.small
AWS EC2 machine:
make deploy_aws
NOTE: You can log in to the AWS console, go to the EC2s section, and you can see your machine running.
To check the state of the deployment, run:
make info_aws
To destroy the EC2 machine, run:
make undeploy_aws
Here, you only have to ensure you generate your AWS credentials.
Afterward, you must fork the repository and add all the credentials within your .env
file into the forked repository secrets section.
Go to your forked repository -> Settings -> Secrets and variables -> Actions -> New repository secret
.
Now, add all the secrets as in the image below.
Now, to automatically deploy the streaming pipeline to AWS using the GitHub Action's CI/CD pipeline, follow the next steps: Actions Tab -> Continuous Deployment (CD) | Streaming Pipeline action (on the left) -> Press "Run workflow"
.
To automatically destroy all the AWS components created earlier, you have to call another GitHub Actions workflow as follows: Actions Tab -> Destroy AWS Infrastructure -> Press "Run workflow"
In both scenarios, to see if your EC2 initialized correctly, connect to your EC2 machine and run:
cat /var/log/cloud-init-output.log
Also, to see that the streaming pipeline Docker container is running, run the following:
docker ps
You should see the streaming_pipeline
docker container listed.
Note: You have to wait for ~5 minutes until everything is initialized fully.
Check the code for linting issues:
make lint_check
Fix the code for linting issues (note that some issues can't automatically be fixed, so you might need to solve them manually):
make lint_fix
Check the code for formatting issues:
make format_check
Fix the code for formatting issues:
make format_fix