[Bug]: PyArrow Capacity Limit #1032

nievespg1 · 2024-08-26T21:40:32Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

At the end of the "create base extracted enities" workflow, graphrag stored a large gaphml file as a string within a pandas dataframe. Later on, it tries to save the file as a parquet file which triggers the fololwing exception:

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2612524437

This is a limitation of the pyarrow as it cannot handle strings larger than 2 GB by default. Their sources recommend using pyarrow.large_string, but further research needs to be done:
References:

Steps to reproduce

Option 1:

Create an index using a dataset than contains 160 million tokens.

Option 2:

Create a dataframe with one row that contains a string larger than 2GB. Below is a script to reproduce the problem:

import string

rows = int(3 * 2**30/len(string.ascii_lowercase.encode('utf-8')))
rows = int(np.ceil(np.log2(rows))) # exponential growth

row_data = string.ascii_lowercase
for _ in tqdm(range(rows)):
    row_data += row_data

df = pd.DataFrame(
    {'A': [row_data]}
)

# Compute the size of the DataFrame in bytes
df_memory_usage = df.memory_usage(deep=True).sum()
df_memory_usage_mb = df_memory_usage / 2**20 # Convert bytes to megabytes

print(f'Total memory usage of the DataFrame is: {df_memory_usage_mb:.2f} MB')
print(f"df.dtypes: {df.dtypes}")

display(df.head())

# This line will raise a PyArrow capacity error
df.to_parquet("max_capacity.parquet")

Expected Behavior

If you follow option one. You should see the below error message at the end of the "create base extracted enities" workflow:

File "pyarrow/array.pxi", line 339, in pyarrow.lib.array  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array  
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 261264517716:37:55,764 
graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

If you executed option 2, you will see the following error:

ArrowCapacityError                        Traceback (most recent call last)
Cell In[5], line 24
     21 display(df.head())
     23 # This line will raise a PyArrow capacity error
---> 24 df.to_parquet(\"max_capacity.parquet\")

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    327 if len(args) > num_allow_args:
    328     warnings.warn(
    329         msg.format(arguments=_format_argument_list(allow_args)),
    330         FutureWarning,
    331         stacklevel=find_stack_level(),
    332     )
--> 333 return func(*args, **kwargs)

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py:3113, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   3032 \"\"\"
   3033 Write a DataFrame to the binary parquet format.
   3034 
   (...)
   3109 >>> content = f.read()
   3110 \"\"\"
   3111 from pandas.io.parquet import to_parquet
-> 3113 return to_parquet(
   3114     self,
   3115     path,
   3116     engine,
   3117     compression=compression,
   3118     index=index,
   3119     partition_cols=partition_cols,
   3120     storage_options=storage_options,
   3121     **kwargs,
   3122 )

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:480, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, filesystem, **kwargs)
    476 impl = get_engine(engine)
    478 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 480 impl.write(
    481     df,
    482     path_or_buf,
    483     compression=compression,
    484     index=index,
    485     partition_cols=partition_cols,
    486     storage_options=storage_options,
    487     filesystem=filesystem,
    488     **kwargs,
    489 )
    491 if path is None:
    492     assert isinstance(path_or_buf, io.BytesIO)

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:190, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, filesystem, **kwargs)
    187 if index is not None:
    188     from_pandas_kwargs[\"preserve_index\"] = index
--> 190 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    192 if df.attrs:
    193     df_metadata = {\"PANDAS_ATTRS\": json.dumps(df.attrs)}

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/table.pxi:3874, in pyarrow.lib.Table.from_pandas()

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    606     return (isinstance(arr, np.ndarray) and
    607             arr.flags.contiguous and
    608             issubclass(arr.dtype.type, np.integer))
    610 if nthreads == 1:
--> 611     arrays = [convert_column(c, f)
    612               for c, f in zip(columns_to_convert, convert_fields)]
    613 else:
    614     arrays = []

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in <listcomp>(.0)
    606     return (isinstance(arr, np.ndarray) and
    607             arr.flags.contiguous and
    608             issubclass(arr.dtype.type, np.integer))
    610 if nthreads == 1:
--> 611     arrays = [convert_column(c, f)
    612               for c, f in zip(columns_to_convert, convert_fields)]
    613 else:
    614     arrays = []

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:592, in dataframe_to_arrays.<locals>.convert_column(col, field)
    589     type_ = field.type
    591 try:
--> 592     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    593 except (pa.ArrowInvalid,
    594         pa.ArrowNotImplementedError,
    595         pa.ArrowTypeError) as e:
    596     e.args += (\"Conversion failed for column {!s} with type {!s}\"
    597                .format(col.name, col.dtype),)

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:339, in pyarrow.lib.array()

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:85, in pyarrow.lib._ndarray_to_array()

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 3489660928"

GraphRAG Config Used

# Define anchors to be reused
openai_api_key: &openai_api_key ${OPENAI_API_KEY}

#######################
# pipeline parameters # 
#######################

# data inputs
input:
  type: file
  file_type: text
  file_pattern: .*\.txt$
  base_dir: ./data

# tokenizer model name
encoding_model: &encoding_name o200k_base # gpt-4o
# encoding_model: &encoding_name cl100k_base # gpt-4-turbo

# text chunking
chunks:
  size: &chunk_size 800 # 800 tokens (about 3200 characters)
  overlap: &chunk_overlap 100 # 100 tokens (about 400 characters)
  strategy:
      type: tokens
      chunk_size: *chunk_size
      chunk_overlap: *chunk_overlap
      encoding_name: *encoding_name

# chat llm inputs
llm: &chat_llm
  api_key: *openai_api_key
  type: openai_chat
  model: gpt-4o-mini
  max_tokens: 4096
  request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
  api_version: "2024-02-01"
  # deployment_name: gpt-4o-mini
  model_supports_json: true
  tokens_per_minute: 150000000
  requests_per_minute: 30000
  max_retries: 20
  max_retry_wait: 10
  sleep_on_rate_limit_recommendation: true
  concurrent_requests: 50

parallelization: &parallelization
  stagger: 0.25
  num_threads: 100

async_mode: &async_mode asyncio
# async_mode: &async_mode threaded

entity_extraction:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: ./prompts/entity_extraction.txt
  max_gleanings: 1

summarize_descriptions:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt:. ./prompts/summarize_descriptions.txt
  max_length: 500

community_reports:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: ./prompts/community_report.txt
  max_length: &max_report_length 2000
  max_input_length: 8000

# embeddings llm inputs
embeddings:
  llm:
    api_key: *openai_api_key
    type: openai_embedding
    model: text-embedding-ada-002
    request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
    api_version: "2024-02-01"
    # deployment_name: text-embedding-ada-002
    model_supports_json: false
    tokens_per_minute: 10000000
    requests_per_minute: 10000
    max_retries: 20
    max_retry_wait: 10
    sleep_on_rate_limit_recommendation: true
    concurrent_requests: 50
  parallelization: *parallelization
  async_mode: *async_mode
  batch_size: 16
  batch_max_tokens: 8191
  vector_store: 
      type: lancedb
      overwrite: true
      db_uri: ./index/storage/lancedb
      query_collection_name: entity_description_embeddings
  
cache:
  type: file
  base_dir: ./index/cache

storage:
  type: file
  base_dir: ./index/storage

reporting:
  type: file
  base_dir: ./index/reporting

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: true

#####################################
# orchestration (query) definitions # 
#####################################
local_search:
  text_unit_prop: 0.5
  community_prop: 0.1
  conversation_history_max_turns: 5
  top_k_entities: 10
  top_k_relationships: 10
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  llm_max_tokens: 2000

global_search:
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  data_max_tokens: 12000
  map_max_tokens: 1000
  reduce_max_tokens: 2000
  concurrency: 50

Logs and screenshots

No response

Additional Information

GraphRAG Version: 0.3.0
Operating System: 22.04.1-Ubuntu
Python Version: 3.11.5
Related Issues: N/A

nievespg1 added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 26, 2024

nievespg1 self-assigned this Aug 26, 2024

github-staff deleted a comment Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: PyArrow Capacity Limit #1032

[Bug]: PyArrow Capacity Limit #1032

nievespg1 commented Aug 26, 2024

[Bug]: PyArrow Capacity Limit #1032

[Bug]: PyArrow Capacity Limit #1032

Comments

nievespg1 commented Aug 26, 2024

Do you need to file an issue?

Describe the bug

Steps to reproduce

Option 1:

Option 2:

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information