You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
At the end of the "create base extracted enities" workflow, graphrag stored a large gaphml file as a string within a pandas dataframe. Later on, it tries to save the file as a parquet file which triggers the fololwing exception:
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2612524437
This is a limitation of the pyarrow as it cannot handle strings larger than 2 GB by default. Their sources recommend using pyarrow.large_string, but further research needs to be done:
References:
Create an index using a dataset than contains 160 million tokens.
Option 2:
Create a dataframe with one row that contains a string larger than 2GB. Below is a script to reproduce the problem:
importstringrows=int(3*2**30/len(string.ascii_lowercase.encode('utf-8')))
rows=int(np.ceil(np.log2(rows))) # exponential growthrow_data=string.ascii_lowercasefor_intqdm(range(rows)):
row_data+=row_datadf=pd.DataFrame(
{'A': [row_data]}
)
# Compute the size of the DataFrame in bytesdf_memory_usage=df.memory_usage(deep=True).sum()
df_memory_usage_mb=df_memory_usage/2**20# Convert bytes to megabytesprint(f'Total memory usage of the DataFrame is: {df_memory_usage_mb:.2f} MB')
print(f"df.dtypes: {df.dtypes}")
display(df.head())
# This line will raise a PyArrow capacity errordf.to_parquet("max_capacity.parquet")
Expected Behavior
If you follow option one. You should see the below error message at the end of the "create base extracted enities" workflow:
File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 261264517716:37:55,764
graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
If you executed option 2, you will see the following error:
ArrowCapacityError Traceback (most recent call last)
Cell In[5], line 24
21 display(df.head())
23 # This line will raise a PyArrow capacity error
---> 24 df.to_parquet(\"max_capacity.parquet\")
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
327 if len(args) > num_allow_args:
328 warnings.warn(
329 msg.format(arguments=_format_argument_list(allow_args)),
330 FutureWarning,
331 stacklevel=find_stack_level(),
332 )
--> 333 return func(*args, **kwargs)
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py:3113, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
3032 \"\"\"
3033 Write a DataFrame to the binary parquet format.
3034
(...)
3109 >>> content = f.read()
3110 \"\"\"
3111 from pandas.io.parquet import to_parquet
-> 3113 return to_parquet(
3114 self,
3115 path,
3116 engine,
3117 compression=compression,
3118 index=index,
3119 partition_cols=partition_cols,
3120 storage_options=storage_options,
3121 **kwargs,
3122 )
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:480, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, filesystem, **kwargs)
476 impl = get_engine(engine)
478 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 480 impl.write(
481 df,
482 path_or_buf,
483 compression=compression,
484 index=index,
485 partition_cols=partition_cols,
486 storage_options=storage_options,
487 filesystem=filesystem,
488 **kwargs,
489 )
491 if path is None:
492 assert isinstance(path_or_buf, io.BytesIO)
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:190, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, filesystem, **kwargs)
187 if index is not None:
188 from_pandas_kwargs[\"preserve_index\"] = index
--> 190 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
192 if df.attrs:
193 df_metadata = {\"PANDAS_ATTRS\": json.dumps(df.attrs)}
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/table.pxi:3874, inpyarrow.lib.Table.from_pandas()
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
606 return (isinstance(arr, np.ndarray) and
607 arr.flags.contiguous and
608 issubclass(arr.dtype.type, np.integer))
610 if nthreads == 1:
--> 611 arrays = [convert_column(c, f)
612 forc, fin zip(columns_to_convert, convert_fields)]
613 else:
614 arrays = []
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in<listcomp>(.0)
606 return (isinstance(arr, np.ndarray) and
607 arr.flags.contiguous and
608 issubclass(arr.dtype.type, np.integer))
610 if nthreads == 1:
--> 611 arrays = [convert_column(c, f)
612 forc, fin zip(columns_to_convert, convert_fields)]
613 else:
614 arrays = []
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:592, in dataframe_to_arrays.<locals>.convert_column(col, field)
589 type_ = field.type
591 try:
--> 592 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
593 except (pa.ArrowInvalid,
594 pa.ArrowNotImplementedError,
595 pa.ArrowTypeError) as e:
596 e.args += (\"Conversion failed for column {!s} with type {!s}\"
597 .format(col.name, col.dtype),)
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:339, inpyarrow.lib.array()
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:85, inpyarrow.lib._ndarray_to_array()
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/error.pxi:91, inpyarrow.lib.check_status()
ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 3489660928"
Do you need to file an issue?
Describe the bug
At the end of the "create base extracted enities" workflow, graphrag stored a large gaphml file as a string within a pandas dataframe. Later on, it tries to save the file as a parquet file which triggers the fololwing exception:
This is a limitation of the pyarrow as it cannot handle strings larger than 2 GB by default. Their sources recommend using pyarrow.large_string, but further research needs to be done:
References:
pandas - pyarrow.lib.ArrowCapacityError when creating string - Stack Overflow
BUG: new string dtype fails with >2 GB of data in a single column · Issue #56259 · pandas-dev/pandas (github.com)
pyarrow.large_string — Apache Arrow v12.0.1
Steps to reproduce
Option 1:
Create an index using a dataset than contains 160 million tokens.
Option 2:
Create a dataframe with one row that contains a string larger than 2GB. Below is a script to reproduce the problem:
Expected Behavior
If you follow option one. You should see the below error message at the end of the "create base extracted enities" workflow:
If you executed option 2, you will see the following error:
GraphRAG Config Used
Logs and screenshots
No response
Additional Information
The text was updated successfully, but these errors were encountered: