-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAG Chatbot on docs.md #160
Open
Chrisyhjiang
wants to merge
377
commits into
main
Choose a base branch
from
RAG
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
377 commits
Select commit
Hold shift + click to select a range
b3e4760
fast api renamed
raphaeltm 8d51d64
Merge pull request #56 from DefangLabs/rename-fastapi
raphaeltm e6d5d0e
updated the prod command
Chrisyhjiang 1cebf5b
Merge branch 'angular' of https://github.com/DefangLabs/samples into …
Chrisyhjiang fdc79ed
Merge branch 'main' into sample-repo-automation
raphaeltm 6f2dbb8
testrun
raphaeltm f9e5efd
run another test
raphaeltm 611b081
change token
raphaeltm c9589b3
update subtree push
raphaeltm b6ee796
update subtree push
raphaeltm ed187da
took out unecessary Dockerfile.dev
Chrisyhjiang a22a85f
took out the log files
Chrisyhjiang 7aa059b
update subtree push
raphaeltm 8cd87ae
update subtree push
raphaeltm 9fb82de
update subtree push
raphaeltm ce95916
update subtree push
raphaeltm 14baf8c
fastapi-postgres
aRorschach fb18edd
split
raphaeltm 153810f
trying something...
raphaeltm f7d1afb
trying something...
raphaeltm 3ae686d
trying something...
raphaeltm d79ecef
success plz
raphaeltm 6740878
success plz
raphaeltm 225fd3c
this should be the last time...
raphaeltm 2784d7e
this should be the last time...
raphaeltm 53bb7df
this should be the last time...
raphaeltm 8ddf730
Merge pull request #39 from DefangLabs/angular
raphaeltm 0e6ebba
brought in main
Chrisyhjiang 91b9cc9
Merge pull request #45 from DefangLabs/sailsjs-postgres
Chrisyhjiang 1b117cd
remove .gitignore
aRorschach 8aac276
please, please, work.
raphaeltm 4470e0a
Merge remote-tracking branch 'origin/main' into sample-repo-automation
raphaeltm d5a18c8
testing
raphaeltm 5db73e2
test multiple samples
raphaeltm 4fb0395
checkout current branch afterward
raphaeltm 3a2f0a6
checkout gh ref
raphaeltm 950f3a4
current branch set
raphaeltm 5c78918
next template
raphaeltm 8865c65
update actions
raphaeltm b0074e7
Merge pull request #38 from DefangLabs/sample-repo-automation
raphaeltm c24ae93
publish sample template on main
raphaeltm 69bd57b
Merge pull request #58 from DefangLabs/sample-repo-automation
raphaeltm 630cacd
default to main
raphaeltm e22c088
Merge pull request #59 from DefangLabs/sample-repo-automation
raphaeltm 39db0bd
sample next template
raphaeltm 69b39ea
Merge pull request #60 from DefangLabs/sample-repo-automation
raphaeltm 427d8ce
sample next template
raphaeltm 4057d85
sample next template
raphaeltm 3cdf9cd
Merge pull request #61 from DefangLabs/sample-repo-automation
raphaeltm 5da2baf
sample next template
raphaeltm c364cd2
Merge branch 'main' of github.com:DefangLabs/samples
raphaeltm b65a425
update next
raphaeltm 79a1afd
update nextjs readme
raphaeltm 0a931b0
restructure
aRorschach 0b5f871
Merge branch 'main' into nodejs-react-postgres
raphaeltm 0d4f608
Update compose.yaml
raphaeltm 09da858
Update .env
raphaeltm 974e7a4
changes to get dev working
raphaeltm 2e8b6c7
set .env to localhost:3010
raphaeltm ad94983
fix for template manager
raphaeltm 8e34799
Merge pull request #32 from DefangLabs/nodejs-react-postgres
raphaeltm 2a6bbd7
functional elysia sample
raphaeltm 994a0d3
Merge pull request #63 from DefangLabs/25-elysia-sample
raphaeltm 7da9397
fixed readme no mentions of Hasura
Chrisyhjiang 2faa23b
fixed grammar
Chrisyhjiang 2eef7ed
reorg and working
raphaeltm 1f099d5
Update README.md
raphaeltm eab08eb
Merge pull request #102 from DefangLabs/101-fix-sails-example-readme
raphaeltm 54f0f09
Merge pull request #57 from DefangLabs/fastapi-postgres
raphaeltm 50c838c
fix readme to include local dev
raphaeltm c0ba32d
Merge pull request #105 from DefangLabs/fastapi-postgres
raphaeltm f616191
working version of the langchain sample need to write README
Chrisyhjiang af5d2a7
changed compose and README
Chrisyhjiang c69ce5a
changed readme
Chrisyhjiang e085adf
reorganized files
Chrisyhjiang e61d067
minor readme tweak
Chrisyhjiang 74a76a7
golang http refractoring
Chrisyhjiang 994daf2
finished refractoring for golang-http-form
Chrisyhjiang 8a1d438
changed compose yaml
Chrisyhjiang 9a57524
fixed golang mongodb
Chrisyhjiang f2e78a7
finished golang-openai
Chrisyhjiang df801fd
readme change
Chrisyhjiang d5ae4ab
fixed golang rest api
Chrisyhjiang d91eb85
fixed slackbot and s3 for go
Chrisyhjiang 556e479
fixed golang-s3
Chrisyhjiang 25a63f4
nextjs blog
Chrisyhjiang b9dc435
nextjs boilerplate fixed
Chrisyhjiang 986cc2f
nextjs documentation
Chrisyhjiang cd84d01
nextjs documentation README change
Chrisyhjiang 7b0705f
nodejs-chatroom fixed
Chrisyhjiang b4da94f
nodejs express refractored
Chrisyhjiang 14e3682
check in reorganizations
Chrisyhjiang 875bf7e
nodejs rest-api
Chrisyhjiang 042de37
refractored nodejs-as3
Chrisyhjiang 20798f7
flask form refractored
Chrisyhjiang 2da8d0c
working locally
raphaeltm 18b8dfa
much better bullmq sample
raphaeltm 6e9f93d
refractored implicit flask
Chrisyhjiang e4a306d
python-minimal refractored
Chrisyhjiang 6b1bb14
python openai refractored
Chrisyhjiang 2a6e142
refractored python rest api
Chrisyhjiang 535ead5
refractored python s3
Chrisyhjiang b9bb83a
django sample
Chrisyhjiang c0ed46b
refractor django postgres
Chrisyhjiang ac51f78
sample request to test
raphaeltm 8b13b87
remove healthcheck response
raphaeltm 3e0f4c7
remove healthcheck response
raphaeltm 61dcea0
show replicas 2
raphaeltm f40228e
remove version:
lionello 2665566
Merge pull request #111 from DefangLabs/lio-version
lionello 9e450f6
new refractoring done to README
Chrisyhjiang 9bb74a6
update readme and add project name
raphaeltm 89a2ec9
changed back into app waiting for Lio and Edward to add in projects l…
Chrisyhjiang a0ae044
Merge remote-tracking branch 'origin/main' into 107-reorg-older-samples
Chrisyhjiang 5f30e21
new changes to README
Chrisyhjiang f1a5e04
took out Defang in README
Chrisyhjiang f13023e
update compose files to remove name and version
raphaeltm d611c37
amend template-manager for new DefangSamples org
lionello 882a652
update template manager to point to DefangSamples
raphaeltm 7863e86
Merge branch 'update-sample-deploy' into 110-bullmq-bull-board-redis-…
raphaeltm 85f60ae
minor reame tweak
raphaeltm cdeaa39
renamed service 1
Chrisyhjiang d0fd403
Merge branch 'main' into lio-samples-org
raphaeltm c0b1820
Merge pull request #116 from DefangLabs/lio-samples-org
raphaeltm 2265a9d
add gh actions to all samples
raphaeltm 519ee4d
Merge branch '107-reorg-older-samples' of github.com:DefangLabs/sampl…
raphaeltm 829c334
reorg rails and svelte-mysql
raphaeltm e91c92e
make titles consistent
raphaeltm 1a99a58
remove compose project names
raphaeltm 5126f3f
make readmes consistent
raphaeltm ff065f1
Merge pull request #108 from DefangLabs/107-reorg-older-samples
raphaeltm bab43ed
add Dockerfile to huginn readme languages
raphaeltm 2671543
add GPU to python implicit
raphaeltm 25b2fab
pulumi remix postgres updated
raphaeltm 8ac8d1b
remove references to service1
raphaeltm 0d91b78
Merge pull request #121 from DefangLabs/113-warning-in-nodejs-http-sa…
raphaeltm 50fbeff
C# .NET Sample
Chrisyhjiang 229e880
C# sample with .NET
Chrisyhjiang 89a042e
rename redis service
lionello 2655ea2
support private repos in deploy action
lionello bfb8631
Merge pull request #127 from DefangLabs/lio-deploy-private-repo
lionello ca70e90
gpu tags
lionello 6600fbb
Merge pull request #128 from DefangLabs/lio-tags
raphaeltm 6ea7525
Merge branch 'main' into 110-bullmq-bull-board-redis-sample
raphaeltm ff945c3
add contents read
raphaeltm 138bd39
Merge branch '110-bullmq-bull-board-redis-sample' of github.com:Defan…
raphaeltm 56fa611
Merge pull request #114 from DefangLabs/110-bullmq-bull-board-redis-s…
raphaeltm 9e3e724
update compose
raphaeltm ea8ef95
Merge pull request #131 from DefangLabs/110-bullmq-bull-board-redis-s…
raphaeltm ba92152
fix compose errors
lionello 21ae962
Merge branch 'main' of https://github.com/DefangLabs/samples
Chrisyhjiang c683c00
validate config files
lionello d38dd7b
Merge pull request #133 from DefangLabs/lio-warnings
raphaeltm 5cad6dd
featherjs application
Chrisyhjiang bf82761
Merge branch 'main' of https://github.com/DefangLabs/samples
Chrisyhjiang 18bd62a
Update README.md
Prakash-Sundaresan 94d10f8
added empty debug field for testing if the parsing logic for env logi…
Chrisyhjiang f6b3723
removed config stuff
Chrisyhjiang 0c24328
tookout debug env var
Chrisyhjiang 23657ae
added lines in between README.md
Chrisyhjiang f2ea403
fix healthcheck for flask
lionello 9b330eb
Merge pull request #141 from DefangLabs/lio-flask-healthcheck
edwardrf 0f89b12
finished checking the FeatherJS sample
Chrisyhjiang e1f6653
flask app with langchain fixed
Chrisyhjiang f5d3db5
fixed the context
Chrisyhjiang 2403f51
changed README content
Chrisyhjiang 382ca83
Merge pull request #140 from DefangLabs/sailsjs-postgres
raphaeltm 0187990
Merge pull request #139 from DefangLabs/Prakash-Sundaresan-readme-1
raphaeltm f1a121f
fixed README
Chrisyhjiang 9a3d219
fixed README
Chrisyhjiang 5186a01
changes to README
Chrisyhjiang 1e540f2
fixed README
Chrisyhjiang 1f91c22
README changes
Chrisyhjiang 4b201c1
changes to README
Chrisyhjiang 1d0e5c5
Merge branch 'main' of https://github.com/DefangLabs/samples
Chrisyhjiang bc79aa9
Merge pull request #109 from DefangLabs/16-langchain-sample
raphaeltm e6f523e
Update README.md
raphaeltm 7ce2d9c
Merge pull request #136 from DefangLabs/73-feathersjs-sample
raphaeltm 4926a03
Merge branch 'main' of https://github.com/DefangLabs/samples
Chrisyhjiang 6c76146
Update README.md
raphaeltm 79f938d
Merge pull request #123 from Chrisyhjiang/c#
raphaeltm 0216c3c
Merge branch 'main' of https://github.com/DefangLabs/samples
Chrisyhjiang 706b917
updates to README to normalize capitalization and other display incon…
Chrisyhjiang b085996
made nodejs one word to address the parsing issue
Chrisyhjiang c3a9bd2
fixed inconsistencies
Chrisyhjiang f643e1c
Http -> http
Chrisyhjiang 2d88c82
langchain => LangChain
Chrisyhjiang 5f77e2a
normalization
Chrisyhjiang 211cf06
more normalization
Chrisyhjiang 8b14b03
more normalization
Chrisyhjiang 5747e30
fixed more README
Chrisyhjiang 885d68a
adsf
Chrisyhjiang 5ca0c15
finished revision
Chrisyhjiang 7a41a6b
fixed
Chrisyhjiang 134a42d
Merge pull request #149 from Chrisyhjiang/normalization
raphaeltm daaf69d
rm noco
raphaeltm 5f56a13
rm Node.js from languages in bullmq
raphaeltm 60089a4
compose validation temporarily disabled
raphaeltm 71d244f
trial
Chrisyhjiang 8a85685
fixed READMEs to have tags include languages and revert language capi…
Chrisyhjiang 4c6163b
fixed wrong READMEs
Chrisyhjiang 8b477fc
Merge pull request #153 from DefangLabs/typescript-addendum
raphaeltm 1bef62a
add space after languages
raphaeltm 68ebb8a
Merge remote-tracking branch 'origin' into RAG
Chrisyhjiang f45175c
current version
Chrisyhjiang 56d5077
current version
Chrisyhjiang 0de5976
RAG system working
Chrisyhjiang 6610f11
project info
Chrisyhjiang 9824194
working version of RAG but no query processing
Chrisyhjiang 72d76f8
add reservations placeholder
lionello 3431152
removed env var
Chrisyhjiang a77b10b
working version of processing query
Chrisyhjiang 712acfd
more accurate version of query processing
Chrisyhjiang 074239a
working version of RAG chatbot
Chrisyhjiang ec547e0
resolve merge conflict accepted main branch
Chrisyhjiang 320e11e
added README
Chrisyhjiang 77e4dff
changed readme
Chrisyhjiang 53d4cc4
changed README
Chrisyhjiang 9fcf7c2
parsed the docs website using the sitemap XML
Chrisyhjiang 7811970
commit save openai migrate
Chrisyhjiang 683ecde
before openai migrate
Chrisyhjiang 632b642
updated requirements
Chrisyhjiang d78b778
used model 4
Chrisyhjiang 7903e7e
Merge branch 'main' of https://github.com/DefangLabs/samples into RAG
Chrisyhjiang 6ccaeb9
README edits
Chrisyhjiang f3d8d39
added some changes to make sure the context is correct
Chrisyhjiang 2022717
added a LRU cache for common queries
Chrisyhjiang aa6b16d
added caching
Chrisyhjiang e2b9fad
improved UI
Chrisyhjiang 39017f5
normalize query for caching and improved UI to match the main Defang …
Chrisyhjiang d73fc30
README updates
Chrisyhjiang 72ef6c2
Merge branch 'main' of https://github.com/DefangLabs/samples into RAG
Chrisyhjiang f7fe8d8
added md display as well as increased max tokens to 1024 to finish th…
Chrisyhjiang 9e6c726
fixed incorrect output
Chrisyhjiang 0729785
enhanced knowledge base to parse directly from the docs repo with the…
Chrisyhjiang f16fc46
get knowledgebase implementation
Chrisyhjiang 5eda5a2
removed debugging
Chrisyhjiang e5ce451
reparsed knowledge base and added in fuzzy matching logic with the ab…
Chrisyhjiang 53229ba
parsing logic
Chrisyhjiang 592012b
updated the knowledge base and rag system
Chrisyhjiang ab26514
updated retrieval logic
Chrisyhjiang 3e2be81
testing
Chrisyhjiang 9b27e51
added context limit
Chrisyhjiang 19b5ac6
changes to parsing and knowledge base
Chrisyhjiang 0bd647c
fixed some stuff
Chrisyhjiang f5abd01
Merge branch 'main' of https://github.com/DefangLabs/samples into RAG
Chrisyhjiang ea69b4c
fixed typo and better UI (no typing when question being processed)
Chrisyhjiang 2fe8de2
added deploy.yaml
Chrisyhjiang 2af9f08
add openai config in deploy action for 1-click
raphaeltm a36cdc9
make it prod ready
Chrisyhjiang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
FROM mcr.microsoft.com/devcontainers/python:3.12-bookworm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{ | ||
"build": { | ||
"dockerfile": "Dockerfile", | ||
"context": ".." | ||
}, | ||
"features": { | ||
"ghcr.io/defanglabs/devcontainer-feature/defang-cli:1.0.4": {}, | ||
"ghcr.io/devcontainers/features/docker-in-docker:2": {} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
myenv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
name: Deploy | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
deploy: | ||
runs-on: ubuntu-latest | ||
permissions: | ||
contents: read | ||
id-token: write | ||
|
||
steps: | ||
- name: Checkout Repo | ||
uses: actions/checkout@v4 | ||
|
||
- name: Deploy | ||
uses: DefangLabs/[email protected] | ||
with: | ||
config-env-vars: OPENAI_API_KEY | ||
env: | ||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
myenv/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Scikit RAG + OpenAI | ||
|
||
This sample demonstrates how to deploy a Flask-based Retrieval-Augmented Generation (RAG) chatbot using OpenAI's GPT model. The chatbot retrieves relevant documents from a knowledge base using scikit-learn and Sentence Transformers and then generates responses using OpenAI's GPT model. There is an LRU caching scheme of 128 queries. | ||
|
||
## Prerequisites | ||
|
||
1. Download [Defang CLI](https://github.com/DefangLabs/defang) | ||
2. (Optional) If you are using [Defang BYOC](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) authenticated with your AWS account | ||
3. (Optional - for local development) [Docker CLI](https://docs.docker.com/engine/install/) | ||
|
||
## Deploying | ||
|
||
1. Open the terminal and type `defang login` | ||
2. Type `defang compose up` in the CLI. | ||
3. Your app will be running within a few minutes. | ||
|
||
## Local Development | ||
|
||
1. Clone the repository. | ||
2. Create a `.env` file in the root directory and set your OpenAI API key or add the OPENAI_API_KEY into your .zshrc or .bashrc file: | ||
3. Run the command `docker compose up --build` to spin up a docker container for this RAG chatbot | ||
|
||
## Configuration | ||
|
||
- The knowledge base is acquired via parsing an sitemap located at "https://docs.defang.io/sitemap.xml". | ||
- The file `scrape_sitemap.py` parses every webpage as specified into paragraphs and writes to `knowledge_base.json` for the RAG retrieval. | ||
- To obtain your own knowledge base, either use another sitemap or write your own parsing scheme to parse into knowledge_base.json. | ||
- A least recently used (LRU) caching scheme is also in place as can be seen in `rag_system.py`. This caches common queries to have a faster response time. Feel free to adjust as needed. | ||
|
||
--- | ||
|
||
Title: Scikit RAG + OpenAI | ||
|
||
Short Description: A short hello world application demonstrating how to deploy a Flask-based Retrieval-Augmented Generation (RAG) chatbot using OpenAI's GPT model onto Defang. | ||
|
||
Tags: Flask, Scikit, Python, RAG, OpenAI, GPT, Machine Learning | ||
|
||
Languages: python |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Use an official Python runtime as a parent image | ||
FROM python:3.9-slim | ||
|
||
# Set the working directory in the container | ||
WORKDIR /app | ||
|
||
# Copy the requirements file first to leverage Docker's cache | ||
COPY requirements.txt /app/ | ||
|
||
# Install any needed packages specified in requirements.txt | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
RUN pip install gunicorn | ||
|
||
# Install additional packages | ||
RUN pip install sentence-transformers openai | ||
|
||
# Copy the current directory contents into the container at /app | ||
COPY . /app | ||
|
||
# Make port 5000 available to the world outside this container | ||
EXPOSE 5000 | ||
|
||
# Define environment variable | ||
ENV FLASK_APP=app.py | ||
|
||
# Run app.py when the container launches | ||
CMD ["flask", "run", "--host=0.0.0.0"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from flask import Flask, request, jsonify, render_template | ||
from rag_system import rag_system | ||
|
||
app = Flask(__name__) | ||
|
||
@app.route('/', methods=['GET', 'POST']) | ||
def index(): | ||
if request.method == 'POST': | ||
query = request.form.get('query') | ||
if not query: | ||
return render_template('index.html', query=None, response="No query provided") | ||
|
||
try: | ||
response = rag_system.answer_query(query) | ||
return render_template('index.html', query=query, response=response) | ||
except Exception as e: | ||
print(f"Error in /ask endpoint: {e}") | ||
return render_template('index.html', query=query, response="Internal Server Error") | ||
return render_template('index.html', query=None, response=None) | ||
|
||
@app.route('/ask', methods=['POST']) | ||
def ask(): | ||
data = request.get_json() | ||
query = data.get('query') | ||
if not query: | ||
return jsonify({"error": "No query provided"}), 400 | ||
|
||
try: | ||
response = rag_system.answer_query(query) | ||
return jsonify({"response": response}) | ||
except Exception as e: | ||
print(f"Error in /ask endpoint: {e}") | ||
return jsonify({"error": "Internal Server Error"}), 500 | ||
|
||
if __name__ == '__main__': | ||
app.run(host='0.0.0.0', port=5000) |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
import re | ||
import json | ||
import os | ||
|
||
# Function to reset knowledge_base.json | ||
def reset_knowledge_base(): | ||
with open('knowledge_base.json', 'w') as output_file: | ||
json.dump([], output_file) | ||
|
||
def parse_markdown_file_to_json(file_path): | ||
try: | ||
# Load existing content if the file exists | ||
with open('knowledge_base.json', 'r') as existing_file: | ||
json_output = json.load(existing_file) | ||
current_id = len(json_output) + 1 # Start ID from the next available number | ||
except (FileNotFoundError, json.JSONDecodeError): | ||
# If the file doesn't exist or is empty, start fresh | ||
json_output = [] | ||
current_id = 1 | ||
|
||
with open(file_path, 'r', encoding='utf-8') as file: | ||
lines = file.readlines() | ||
|
||
# Skip the first 5 lines | ||
markdown_content = "".join(lines[5:]) | ||
|
||
# First pass: Determine headers for 'about' section | ||
sections = [] | ||
current_section = {"about": [], "text": []} | ||
has_main_header = False | ||
|
||
for line in markdown_content.split('\n'): | ||
header_match = re.match(r'^(#{1,6}|\*\*+)\s+(.*)', line) # Match `#`, `##`, ..., `######` and `**` | ||
if header_match: | ||
header_level = len(header_match.group(1).strip()) | ||
header_text = header_match.group(2).strip() | ||
|
||
if header_level == 1 or header_match.group(1).startswith('**'): # Treat `**` as a main header | ||
if current_section["about"] or current_section["text"]: | ||
sections.append(current_section) | ||
current_section = {"about": [header_text], "text": []} | ||
has_main_header = True | ||
else: | ||
if has_main_header: | ||
current_section["about"].append(header_text) | ||
else: | ||
if header_level == 2: | ||
if current_section["about"] or current_section["text"]: | ||
sections.append(current_section) | ||
current_section = {"about": [header_text], "text": []} | ||
else: | ||
current_section["about"].append(header_text) | ||
else: | ||
current_section["text"].append(line.strip()) | ||
|
||
if current_section["about"] or current_section["text"]: | ||
sections.append(current_section) | ||
|
||
# Second pass: Combine text while ignoring headers and discard entries with empty 'about' or 'text' | ||
for section in sections: | ||
about = ", ".join(section["about"]) | ||
text = " ".join(line for line in section["text"] if line) | ||
|
||
if about and text: # Only insert if both 'about' and 'text' are not empty | ||
json_output.append({ | ||
"id": current_id, | ||
"about": about, | ||
"text": text | ||
}) | ||
current_id += 1 | ||
|
||
# Write the augmented JSON output to knowledge_base.json | ||
with open('knowledge_base.json', 'w', encoding='utf-8') as output_file: | ||
json.dump(json_output, output_file, indent=2, ensure_ascii=False) | ||
|
||
def parse_cli_markdown(file_path): | ||
try: | ||
# Load existing content if the file exists | ||
with open('knowledge_base.json', 'r') as existing_file: | ||
json_output = json.load(existing_file) | ||
current_id = len(json_output) + 1 # Start ID from the next available number | ||
except (FileNotFoundError, json.JSONDecodeError): | ||
# If the file doesn't exist or is empty, start fresh | ||
json_output = [] | ||
current_id = 1 | ||
|
||
with open(file_path, 'r', encoding='utf-8') as file: | ||
lines = file.readlines() | ||
|
||
if len(lines) < 5: | ||
print(f"File {file_path} does not have enough lines to parse.") | ||
return | ||
|
||
# Extract 'about' from the 5th line (index 4) | ||
about = lines[4].strip() | ||
|
||
# Combine all remaining lines after the first 5 lines into 'text' | ||
text_lines = lines[5:] | ||
text = "".join(text_lines).strip() | ||
|
||
# Only append if both 'about' and 'text' are not empty | ||
if about and text: | ||
json_output.append({ | ||
"id": current_id, | ||
"about": about, | ||
"text": text | ||
}) | ||
current_id += 1 | ||
|
||
# Write the augmented JSON output to knowledge_base.json | ||
with open('knowledge_base.json', 'w', encoding='utf-8') as output_file: | ||
json.dump(json_output, output_file, indent=2, ensure_ascii=False) | ||
|
||
def recursive_parse_directory(root_dir): | ||
for dirpath, dirnames, filenames in os.walk(root_dir): | ||
for filename in filenames: | ||
file_path = os.path.join(dirpath, filename) | ||
if filename.lower().endswith('.md'): | ||
if 'cli' in dirpath.lower() or 'cli' in filename.lower(): | ||
parse_cli_markdown(file_path) | ||
else: | ||
parse_markdown_file_to_json(file_path) | ||
|
||
# Example usage: | ||
if __name__ == "__main__": | ||
reset_knowledge_base() # Reset knowledge_base.json to empty at the start | ||
recursive_parse_directory('/Users/chris/Desktop/tmp') # Parse the entire directory | ||
print("Parsing completed successfully.") |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accurate @Chrisyhjiang ? I thought you were using markdown files. Or is that just in the actual ask.defang.io implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks out of date, the docs chatbot has the final readme, I think it's because I never ended up editing and moving everything in the docs chatbot over to the real samples repo