Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add eq-to-pos delete job session draft #356

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Zyiqin-Miranda
Copy link
Member

For getting overall high-level feedback purpose.

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the equality-to-position-job-session branch from 1c30d06 to 3d5149d Compare November 6, 2024 20:03
@Zyiqin-Miranda
Copy link
Member Author

First version of converter with test to verify correctness working here.
For easier review, an overview of the converter currently:

  1. Fetch all equality deletes, data files, previous position deletes in one for loop that having partition value as key here
  2. For each buckets' files, we have file sequence number (similar to storage layer stream_position) attached, and ONLY fetch relevant data files with equality delete files here.
    By relevant, refer to Iceberg spec, specifically:
    An equality delete file must be applied to a data file when all of the following are true:
    The data file's data sequence number is strictly less than the delete's data sequence number
    The data file's partition (both spec id and partition values) is equal [4] to the delete file's partition or the delete file's partition spec is unpartitioned
  3. Convert remote function will use Daft native reader to only read hash value of merge key columns (primary key columns), append file_path, row_index and use zero-copy pyarrow is_in, filter function to find the pos to delete
  4. Upload to S3 with new pos delete files and commit a overwrite (replace not supported yet) snapshot here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant