-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the error loading large parquet file #300
Comments
ohh I am not sure why this happens, my pandas version is |
Or, can you give a llava-RLHF dataset download link, which is census with your Onedrive file? I just create a new image parquet using the same format. |
Can you try with LLAVAR or LRV? I uploaded them as well. |
LLAVA-RLHF seems correct at my side. I will update it but the previous version should be correct. It's weird. |
thanks a lot. |
this two parquet file could be opened correctly. |
I am sure that LLaVA-RLHF shared in Onedrive checkpoint is damaged, while all rest could be loaded correctly. |
I am putting an updated |
thanks a lot. |
you need use dask:
which solved this problem. |
You can see current code to see if it helps, we changed it to iteratively loading. Otter/pipeline/mimicit_utils/mimicit_dataset.py Lines 222 to 229 in 3a74688
Previously on both of my 2A100 and 8A100 instances, I could directly load |
as mention on here, failed to load the LA_RLHF.parquet (about 22GB) file, which is downloaded from shared Onedrive.
Error msg: pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2368257792.
is there special way or python package required to load this large (image base64) parquet file?
The text was updated successfully, but these errors were encountered: