Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medium articles not loading to Qdrant DBase ? #47

Open
ArthurSrz opened this issue Dec 2, 2024 · 3 comments
Open

Medium articles not loading to Qdrant DBase ? #47

ArthurSrz opened this issue Dec 2, 2024 · 3 comments

Comments

@ArthurSrz
Copy link

First, thanks for this wonderful course. I never thought I could learn how to create my twin !

A question, after I try to run the ingestion and feature pipeline, it seems the Qdrant database is only loaded with some repositories. Any idea on where it can come from.

My data/linkstxt file :

https://medium.com/decodingml/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin-2cc6bb01141f
https://medium.com/datactivist/5-jours-pour-initier-la-d%C3%A9marche-open-data-de-19-collectivit%C3%A9s-1b635a0b1645
https://medium.com/datactivist/pourquoi-ouvrir-les-donn%C3%A9es-des-territoires-de-montagne-2c69f0a35a8d
https://medium.com/datactivist/5-jours-pour-initier-la-d%C3%A9marche-open-data-de-19-collectivit%C3%A9s-1b635a0b1645
https://medium.com/datactivist/opendatacanvas-6de1f1fb49aa
https://medium.com/@SrzArthur/linstinct-de-connaissance-pour-une-approche-nieztsch%C3%A9enne-de-la-recherche-3c31a089809e
https://medium.com/@SrzArthur/using-data-clustering-to-get-new-ideas-to-improve-french-ski-resort-customer-experience-2494311e379e
https://medium.com/@SrzArthur/the-world-in-2035-with-open-data-the-3-most-likely-scenarios-e331554ebf09
https://github.com/ArthurSrz/athletes-paris2024
https://github.com/ArthurSrz/obsidian_to_knowledge_graph
https://github.com/ArthurSrz/love_and_quit
https://dataflow.hypotheses.org/1241
https://dataflow.hypotheses.org/718
https://dataflow.hypotheses.org/1189

The commands I used :

(llm-twin-course-py3.11) arthursarazin@MacBook-Pro-2 building-llm-twin % make local-test-medium              

curl -X POST "http://localhost:9010/2015-03-31/functions/function/invocations" \
                -d '{"user": "Paul Iusztin", "link": "https://medium.com/decodingml/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin-2cc6bb01141f"}'
{"statusCode": 200, "body": "Link processed successfully"}%  

My Qdrant database :
Screenshot 2024-12-02 at 11 55 26

Any idea of where it might come from ? Thanks.

@iusztinpaul
Copy link
Member

Hello @ArthurSrz ,

Are your Medium articles freely available?

@ArthurSrz
Copy link
Author

ArthurSrz commented Jan 9, 2025

Hello @iusztinpaul, yes they are. I tried with one of your medium articles and it is not populated into the vectorDB ("make local-ingest-data"). Only github repositories are. I tested "make local-ingest-data" with only one repo into my data/links.txt file and it got vectorized.

Any idea on where the pb come from ?

@ArthurSrz
Copy link
Author

@iusztinpaul, I think I found the bug. In src/data_crawling/main.py, the dispatcher did not point to the medium crawler. But I might be wrong...

Screenshot 2025-01-10 at 12 51 12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants