[Issue]: <title> ’Incremental index‘ command does not generate lancedb files #1560

hope12122 · 2024-12-27T02:38:13Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

'graphrag index command' generates lanceDB vector files, but 'incremental index command' does not produce vector files. How can one use the graphrag tool to generate lanceDB files from the files of incremental index command? If lancedb files are not generated, how can one perform local and global searches using the existing files from incremental index command?"

Steps to reproduce

'graphrag index command' generates lanceDB vector files, but 'incremental index command' does not produce vector files. How can one use the graphrag tool to generate lanceDB files from the files of incremental index command? If lancedb files are not generated, how can one perform local and global searches using the existing files from incremental index command?"

GraphRAG Config Used

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini
model_supports_json: true # recommended if this is available for your model.

audience: "https://cognitiveservices.azure.com/.default"

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

parallelization:
stagger: 0.3

num_threads: 50

async_mode: threaded # or asyncio

embeddings:
async_mode: threaded # or asyncio
vector_store:
type: lancedb
db_uri: 'output/lancedb'
container_name: default
overwrite: true
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://.openai.azure.com
# api_version: 2024-02-15-preview
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>

Input settings

input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"

chunks:
size: 1200
overlap: 100
group_by_columns: [id]

Storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

cache:
type: file # or blob
base_dir: "cache"

reporting:
type: file # or console, blob
base_dir: "logs"

storage:
type: file # or blob
base_dir: "update_hali-11-20"

only turn this on if running `graphrag index` with custom settings

we normally use `graphrag update` with the defaults

update_index_storage:
#type: file # or blob
#base_dir: "update_output_21-30"

Workflow settings

skip_workflows: []

entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1

summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"
max_length: 500

claim_extraction:
enabled: false
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1

community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000

cluster_graph:
max_cluster_size: 10

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes

umap:
enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
graphml: false
embeddings: false
transient: false

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
prompt: "prompts/local_search_system_prompt.txt"

global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

Additional Information

GraphRAG Version:1.0.0
Operating System:ubantu
Python Version:3.12
Related Issues:

The text was updated successfully, but these errors were encountered:

hope12122 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: <title> ’Incremental index‘ command does not generate lancedb files #1560

[Issue]: <title> ’Incremental index‘ command does not generate lancedb files #1560

hope12122 commented Dec 27, 2024

[Issue]: <title> ’Incremental index‘ command does not generate lancedb files #1560

[Issue]: <title> ’Incremental index‘ command does not generate lancedb files #1560

Comments

hope12122 commented Dec 27, 2024

Do you need to file an issue?

Describe the issue

Steps to reproduce

GraphRAG Config Used

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

audience: "https://cognitiveservices.azure.com/.default"

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

num_threads: 50

Input settings

Storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

only turn this on if running graphrag index with custom settings

we normally use graphrag update with the defaults

Workflow settings

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

Logs and screenshots

Additional Information

only turn this on if running `graphrag index` with custom settings

we normally use `graphrag update` with the defaults