Bulk bring your own data #683

vicentegarciadiez · 2023-12-02T16:08:57Z

Hi team, is there any way of bring my own data to a chat but in a massive way?

I mean, I want to load lots of PDF files to a chat to ask questions about them, but there're lots of limitations like 10 files per time or size limits.

Thanks in advance.

crickman · 2023-12-02T19:17:38Z

Hi @vicentegarciadiez, great question. There is a tool for importing outside of the application, but we've recently discovered it doesn't function outside of the dev-environment:

https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument

The kernel-memory repo has all the machinery in place for you to take matters in your own hands. All you need to be able to do is post documents to the same queue and blob store that your chat-copilot is configured for (See KernelMemory section of appsettings.json for webapi.

The most concise expression of what this might resemble and be viewed @ https://github.com/microsoft/kernel-memory/blob/main/service/Service/Program.cs. (Although you could run yours as a console application.)

Create an IKernelMemory instance (note, this won't be needing any handlers).

Call memory.ImportDocumentAsync() for each document you want to import. (This will upload the document to the blob store and create a queue message...the chat-copilot memorypipeline will do the actual processing.)

vicentegarciadiez · 2023-12-03T08:32:33Z

Thanks @crickman for your answer! But I've a question, in your example, mydocument.docx will be available to all chats or only to a selected chat?

Best regards.

crickman · 2023-12-04T22:33:19Z

Right...good point...I've erroneaously ommitted those details.

This would be a more complete expression (with some of the values expanded as literals):

        string documentId = [The id of the cosmosdb `ChatMemorySource` entity];
        string fileName = ...
        Stream fileContent  = ...

        var uploadRequest =
            new DocumentUploadRequest
            {
                DocumentId = documentId,
                Files = new List<DocumentUploadRequest.UploadedFile> { new(fileName, fileContent) },
                Index = "chatmemory",
                Steps = new List<string>() { "extract", "partition", "gen_embeddings", "save_embeddings" },
            };

        uploadRequest.Tags.Add("chatid", "00000000-0000-0000-0000-000000000000"); // Global document.  Replace with chat-id to associate with a single chat
        uploadRequest.Tags.Add("memory", "DocumentMemory");

        await memoryClient.ImportDocumentAsync(uploadRequest, cancellationToken);

The related code in CC is:
https://github.com/microsoft/chat-copilot/blob/main/webapi/Extensions/ISemanticMemoryClientExtensions.cs

The code for accessing CosmosDB data is:
https://github.com/microsoft/chat-copilot/tree/main/webapi/Storage

vicentegarciadiez · 2023-12-12T09:19:27Z

Thanks @crickman and do you know how the images inside a document are indexed? I mean, is the ocr processing those images?

Thanks in advance.

crickman · 2023-12-13T04:14:04Z

I do not belive images are processed using OCR for docx., pptx, xslx, or pdf.

I have sometimes used extrnal tools to convert documents with complex structure (to text) and then upload the text result. Azure Forms Recognizer has some options for more complex document parsing also.

crickman self-assigned this Dec 2, 2023

crickman added question Further information is requested memory Issues or questions related to memory labels Dec 2, 2023

crickman added this to Apps & Services Semantic Kernel Dec 2, 2023

crickman added enhancement New feature or request performance scale Issues related to support for higher scale solutions labels Dec 2, 2023

crickman removed their assignment Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk bring your own data #683

Bulk bring your own data #683

vicentegarciadiez commented Dec 2, 2023

crickman commented Dec 2, 2023 •

edited

Loading

vicentegarciadiez commented Dec 3, 2023

crickman commented Dec 4, 2023 •

edited

Loading

vicentegarciadiez commented Dec 12, 2023

crickman commented Dec 13, 2023

Bulk bring your own data #683

Bulk bring your own data #683

Comments

vicentegarciadiez commented Dec 2, 2023

crickman commented Dec 2, 2023 • edited Loading

vicentegarciadiez commented Dec 3, 2023

crickman commented Dec 4, 2023 • edited Loading

vicentegarciadiez commented Dec 12, 2023

crickman commented Dec 13, 2023

crickman commented Dec 2, 2023 •

edited

Loading

crickman commented Dec 4, 2023 •

edited

Loading