Ingest Folder #15

pdchristian · 2024-08-04T13:23:25Z

Hello,

I would very much like to ingest all my local text files (pdf, docx and txt). Therefore I replaced the loader with the DirectoryLoader, as shown below. This basically works, but only the last document is ingested (I have 4 pdfs for testing).

local_path = "../data"

Local PDF file uploads

if local_path:
loader = DirectoryLoader(local_path, glob='**/[!.]*', use_multithreading=True, show_progress=True)
data = loader.load()
data[0]

Output:
100%|██████████| 4/4 [00:31<00:00, 7.93s/it]

Add to vector database

vector_db = Chroma.from_documents(
documents=chunks,
embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
#embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
collection_name="local-rag"
)

Output OllamaEmbedings:
OllamaEmbeddings: 100%|██████████| 143/143 [00:11<00:00, 12.73it/s]
Should be a much higher number of chunks

It would be great if my local office documents could be ingested.

tonykipkemboi · 2024-08-09T18:01:14Z

@pdchristian, thanks for the question. I would like to know how you're chunking all the documents you're loading. You would need to chunk them and iteratively pass them to the embedding model to create vector embeddings and load to vector storage.

pdchristian · 2024-08-13T14:58:23Z

@tonykipkemboi,
thanks for your response.
I think the code I updated to load the documents seems to be buggy. Only loading the first document takes time. For the other 3 ones, the progress bar jumps quickly form 1 to 4.

Aufzeichnung.2024-08-13.165752.mp4

Is there a problem with the DirectoryLoader, how I am trying to use it?

tonykipkemboi · 2024-08-15T18:22:23Z

@pdchristian, thanks for the video. I'll recreate the issue and report back to you.

tonykipkemboi closed this as completed Aug 11, 2024

tonykipkemboi reopened this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest Folder #15

Ingest Folder #15

pdchristian commented Aug 4, 2024

tonykipkemboi commented Aug 9, 2024

pdchristian commented Aug 13, 2024

tonykipkemboi commented Aug 15, 2024

Ingest Folder #15

Ingest Folder #15

Comments

pdchristian commented Aug 4, 2024

Local PDF file uploads

Add to vector database

tonykipkemboi commented Aug 9, 2024

pdchristian commented Aug 13, 2024

tonykipkemboi commented Aug 15, 2024