Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk bring your own data #683

Open
vicentegarciadiez opened this issue Dec 2, 2023 · 5 comments
Open

Bulk bring your own data #683

vicentegarciadiez opened this issue Dec 2, 2023 · 5 comments
Labels
enhancement New feature or request memory Issues or questions related to memory performance question Further information is requested scale Issues related to support for higher scale solutions

Comments

@vicentegarciadiez
Copy link

Hi team, is there any way of bring my own data to a chat but in a massive way?

I mean, I want to load lots of PDF files to a chat to ask questions about them, but there're lots of limitations like 10 files per time or size limits.

Thanks in advance.

@crickman crickman self-assigned this Dec 2, 2023
@crickman crickman added question Further information is requested memory Issues or questions related to memory labels Dec 2, 2023
@crickman
Copy link
Contributor

crickman commented Dec 2, 2023

Hi @vicentegarciadiez, great question. There is a tool for importing outside of the application, but we've recently discovered it doesn't function outside of the dev-environment:

https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument

The kernel-memory repo has all the machinery in place for you to take matters in your own hands. All you need to be able to do is post documents to the same queue and blob store that your chat-copilot is configured for (See KernelMemory section of appsettings.json for webapi.

The most concise expression of what this might resemble and be viewed @ https://github.com/microsoft/kernel-memory/blob/main/service/Service/Program.cs. (Although you could run yours as a console application.)

  1. Create an IKernelMemory instance (note, this won't be needing any handlers).
image
  1. Call memory.ImportDocumentAsync() for each document you want to import. (This will upload the document to the blob store and create a queue message...the chat-copilot memorypipeline will do the actual processing.)
image

@crickman crickman added enhancement New feature or request performance scale Issues related to support for higher scale solutions labels Dec 2, 2023
@vicentegarciadiez
Copy link
Author

Thanks @crickman for your answer! But I've a question, in your example, mydocument.docx will be available to all chats or only to a selected chat?

Best regards.

@crickman
Copy link
Contributor

crickman commented Dec 4, 2023

Right...good point...I've erroneaously ommitted those details.

This would be a more complete expression (with some of the values expanded as literals):

        string documentId = [The id of the cosmosdb `ChatMemorySource` entity];
        string fileName = ...
        Stream fileContent  = ...

        var uploadRequest =
            new DocumentUploadRequest
            {
                DocumentId = documentId,
                Files = new List<DocumentUploadRequest.UploadedFile> { new(fileName, fileContent) },
                Index = "chatmemory",
                Steps = new List<string>() { "extract", "partition", "gen_embeddings", "save_embeddings" },
            };

        uploadRequest.Tags.Add("chatid", "00000000-0000-0000-0000-000000000000"); // Global document.  Replace with chat-id to associate with a single chat
        uploadRequest.Tags.Add("memory", "DocumentMemory");

        await memoryClient.ImportDocumentAsync(uploadRequest, cancellationToken);

The related code in CC is:
https://github.com/microsoft/chat-copilot/blob/main/webapi/Extensions/ISemanticMemoryClientExtensions.cs

The code for accessing CosmosDB data is:
https://github.com/microsoft/chat-copilot/tree/main/webapi/Storage

@vicentegarciadiez
Copy link
Author

Thanks @crickman and do you know how the images inside a document are indexed? I mean, is the ocr processing those images?

Thanks in advance.

@crickman
Copy link
Contributor

I do not belive images are processed using OCR for docx., pptx, xslx, or pdf.

I have sometimes used extrnal tools to convert documents with complex structure (to text) and then upload the text result. Azure Forms Recognizer has some options for more complex document parsing also.

@crickman crickman removed their assignment Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request memory Issues or questions related to memory performance question Further information is requested scale Issues related to support for higher scale solutions
Projects
No open projects
Status: No status
Development

No branches or pull requests

2 participants