How to combine Kor with a Chroma Embedding Database? #181

SHogenboom · 2023-06-28T20:51:26Z

SHogenboom
Jun 28, 2023

Hiya,

It’s my first time working with langchain, so I apologise if some assumptions are wrong!

I’m trying to extract structured data from a collection of pdfs. When looking to query multiple sources, most docs/tutorials use an approach where the pdf is parsed, stored as embeddings in a Chroma database, and then queried sparsely (e.g., with the ‘stuff’ method in the chain). This approach is succesful and rather cheap.

However the Chroma-query-route doesn’t really allow for the type of structured input as the kor library does. I have not been able to find it in the docs, nor online in other tutorials. Your only information on parsing multiple documents with kor mentions the ‘brute force’ approach with potentially associated costs. This is something I could reduce by using the embeddings/chroma approach.

The question thus is how I can use the kor way of extracting structured information together with the sort of ‘preprocessing’ possible by using vector embeddings.

Any help is appreciated! Please ELI5 though because both langchain and python are a bit of a mystery to me atm ;)

eyurtsev · 2023-07-13T20:30:00Z

eyurtsev
Jul 13, 2023
Maintainer

Hi @SHogenboom! Apologies for the delayed response -- I'm on vacation until end of July. :)

The best approach will depend on the data, the extraction task and other constraints such as quality, latency, cost to develop and maintain, cost to run etc.

The alternative to the brute force approach is to run extraction against a subset of segments (aka "text chunks") representing the most likely segments to contain relevant information.

For an embeddings based approach to work, one needs to find a way a meaningful way to embed content and queries so that document segments containing information to be extracted would be likely to be returned.

I'd try experimenting querying the vectorstore with:

Short snippets of text that contain information that looks like the information that needs to be extracted (this might work)
A query that is literally the typescript description of the schema (I have no idea whether this will work, but perhaps!!)

For example, say you're looking for information about cost of video game consoles in different countries.

For (1) -- I'd query the vector store with sample sentences like: "Xbox costs $200 in the US", "PS4 costs 500 euro in France", and hope that it would retrieve all relevant text passages that may discuss costs of gaming consoles.
** Given, the schema description of the extraction, you could even experiment with having an LLM generate sample sentences that would be used as queries!
For (2) -- You could try to query the vectorstore directly with the typescript schema generated by Kor -- I have no idea whether this will work!

Trade offs

This process trades off quality for speed/cost, and how well it works will depend on your data.

Other approaches

If the source documents look more or less the same, time can be invested up front to train a model or code in a set of heuristics to extract the relevant information from the documents. This approach is less general, but may end up being faster/cheaper at run time (at the expenses of time invested up front to code up the solution.)

For example, if extracting content from PDFs of standard forms, it may be sufficient for a particular problem to extract all the text based on the X,Y coordinates -- say all the text in the region of (0, 30, 20, 30); or maybe all the text that's between the second and third headers etc.

0 replies

SHogenboom · 2023-07-18T07:34:13Z

SHogenboom
Jul 18, 2023
Author

@eyurtsev ; not a problem at all - enjoy the holidays! :)

I’m not sure I understand all of your suggestions, but I think I created a combination between option 1 and 2. My flow at the moment is:

Extract text from PDFs
Split into Chunks
Store as Vector Embeddings
Create a Chain with a structured prompt which is similar to the one you use in the kor schemas
Query the vector embeddings based on similarity search.

Not ideal as of yet, especially need some optimalization in a balance between instructions & similarity search for the vector embeddings. Anyway, for now I’ll not use kor anymore - though I appreciate your efforts!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to combine Kor with a Chroma Embedding Database? #181

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to combine Kor with a Chroma Embedding Database? #181

SHogenboom Jun 28, 2023

Replies: 2 comments

eyurtsev Jul 13, 2023 Maintainer

Trade offs

Other approaches

SHogenboom Jul 18, 2023 Author

SHogenboom
Jun 28, 2023

eyurtsev
Jul 13, 2023
Maintainer

SHogenboom
Jul 18, 2023
Author