How to combine Kor with a Chroma Embedding Database? #181
Replies: 2 comments
-
Hi @SHogenboom! Apologies for the delayed response -- I'm on vacation until end of July. :) The best approach will depend on the data, the extraction task and other constraints such as quality, latency, cost to develop and maintain, cost to run etc. The alternative to the brute force approach is to run extraction against a subset of segments (aka "text chunks") representing the most likely segments to contain relevant information. For an embeddings based approach to work, one needs to find a way a meaningful way to embed content and queries so that document segments containing information to be extracted would be likely to be returned. I'd try experimenting querying the vectorstore with:
For example, say you're looking for information about cost of video game consoles in different countries. For (1) -- I'd query the vector store with sample sentences like: "Xbox costs $200 in the US", "PS4 costs 500 euro in France", and hope that it would retrieve all relevant text passages that may discuss costs of gaming consoles. Trade offsThis process trades off quality for speed/cost, and how well it works will depend on your data. Other approachesIf the source documents look more or less the same, time can be invested up front to train a model or code in a set of heuristics to extract the relevant information from the documents. This approach is less general, but may end up being faster/cheaper at run time (at the expenses of time invested up front to code up the solution.) For example, if extracting content from PDFs of standard forms, it may be sufficient for a particular problem to extract all the text based on the X,Y coordinates -- say all the text in the region of (0, 30, 20, 30); or maybe all the text that's between the second and third headers etc. |
Beta Was this translation helpful? Give feedback.
-
@eyurtsev ; not a problem at all - enjoy the holidays! :) I’m not sure I understand all of your suggestions, but I think I created a combination between option 1 and 2. My flow at the moment is:
Not ideal as of yet, especially need some optimalization in a balance between instructions & similarity search for the vector embeddings. Anyway, for now I’ll not use |
Beta Was this translation helpful? Give feedback.
-
Hiya,
It’s my first time working with
langchain
, so I apologise if some assumptions are wrong!I’m trying to extract structured data from a collection of pdfs. When looking to query multiple sources, most docs/tutorials use an approach where the pdf is parsed, stored as embeddings in a
Chroma
database, and then queried sparsely (e.g., with the ‘stuff’ method in the chain). This approach is succesful and rather cheap.However the
Chroma
-query-route doesn’t really allow for the type of structured input as thekor
library does. I have not been able to find it in the docs, nor online in other tutorials. Your only information on parsing multiple documents withkor
mentions the ‘brute force’ approach with potentially associated costs. This is something I could reduce by using the embeddings/chroma approach.The question thus is how I can use the
kor
way of extracting structured information together with the sort of ‘preprocessing’ possible by using vector embeddings.Any help is appreciated! Please ELI5 though because both
langchain
andpython
are a bit of a mystery to me atm ;)Beta Was this translation helpful? Give feedback.
All reactions