Supporting writing schema metadata when writing Parquet in parallel #13866

wiedld · 2024-12-21T01:07:29Z

Which issue does this PR close?

Rationale for this change

The ArrowWriter with it's default ArrowWriterOptions will encode the arrow schema in the parquet kv_metadata, unless explicitly skipped. Skipping is done via ArrowWriterOptions::with_skip_arrow_metadata.

In datafusion's ParquetSink, we can write in either single threaded or parallelized format. When in single-threaded mode, we use the default ArrowWriterOptions and the arrow schema is inserted into file kv_meta. However, when performing parallelized writes we do not use the ArrowWriter and instead rely upon the SerializedFileWriter. As a result, we are missing the arrow schema metadata in the parquet files (see the issue ticket).

ArrowWriterOptions vs WriterProperties

The SerializedFileWriter, along with other associated writers, rely upon the WriterProperties. The WriterProperties differ from the ArrowWriterOptions only in terms of the skip_arrow_metadata and the schema_root:

pub struct ArrowWriterOptions {
    properties: WriterProperties,
    skip_arrow_metadata: bool,
    schema_root: Option<String>,
}

The skip_arrow_metadata config is only used to decide if the schema should be added to the WriterProperties.kv_metadata.

Proposed Solution

Since we have WriterProperties, not ArrowWriterOptions, I focused on solutions which construct the proper WriterProperties.kv_metadata (with or without the arrow schema).

Our established pattern is to take TableParquetOptions configurations, and derive WriterProperties from those. Therefore, I updates those conversion methods to consider arrow schema insertion.

What changes are included in this PR?

add a new configuration ParquetOptions.skip_arrow_metadata
have ParquetSink single-threaded writes, which use the ArrowWriter, respect this configuration
have ParquetSink multiple-threaded writes, which use WriterProperties, respect this configuration
TableParquetOptions => into WriterProperties conversion methods are updated

Are these changes tested?

yes.

Are there any user-facing changes?

We have new APIs:

ParquetOptions.skip_arrow_metadata configuration
replace/deprecate ParquetWriterOptions::try_from(ParquetWriterOptions) and replace with methods which explicitly handle arrow schema

datafusion/common/src/file_options/parquet_writer.rs

wiedld · 2024-12-21T01:11:48Z

datafusion/common/src/file_options/parquet_writer.rs

+/// Encodes the Arrow schema into the IPC format, and base64 encodes it
+///
+/// TODO: make arrow schema encoding available in a public API.
+/// Refer to currently private `add_encoded_arrow_schema_to_metadata` and `encode_arrow_schema` public.
+/// <https://github.com/apache/arrow-rs/blob/2908a80d9ca3e3fb0414e35b67856f1fb761304c/parquet/src/arrow/schema/mod.rs#L172-L221>
+fn encode_arrow_schema(schema: &Arc<Schema>) -> String {


If we are in agreement on need, I'll go make the arrow-rs PR.

apache/arrow-rs#6916

wiedld · 2024-12-21T01:14:11Z

datafusion/core/src/datasource/file_format/parquet.rs

    async fn parquet_sink_write() -> Result<()> {
        let parquet_sink = create_written_parquet_sink("file:///").await?;

-        // assert written
-        let mut written = parquet_sink.written();
-        let written = written.drain();
-        assert_eq!(
-            written.len(),
-            1,
-            "expected a single parquet files to be written, instead found {}",
-            written.len()
-        );
+        // assert written to proper path
+        let (path, file_metadata) = get_written(parquet_sink)?;
+        let path_parts = path.parts().collect::<Vec<_>>();
+        assert_eq!(path_parts.len(), 1, "should not have path prefix");


I refactored the tests first, into a series of helpers. May be easier to review in this commit: 09004c5

wiedld · 2024-12-21T01:14:33Z

datafusion/core/src/datasource/file_format/parquet.rs

+    }
+
+    #[tokio::test]
+    async fn parquet_sink_parallel_write() -> Result<()> {


New test added.

wiedld · 2024-12-21T01:14:40Z

datafusion/core/src/datasource/file_format/parquet.rs

+    }
+
+    #[tokio::test]
+    async fn parquet_sink_write_insert_schema_into_metadata() -> Result<()> {


New test added.

…ady writing the arrow schema in the kv_meta, and allow disablement

…herent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not

…file metadata, based on the ParquetOptions

wiedld · 2024-12-26T18:21:45Z

datafusion/sqllogictest/test_files/repartition_scan.slt

@@ -61,7 +61,7 @@ logical_plan
 physical_plan
 01)CoalesceBatchesExec: target_batch_size=8192
 02)--FilterExec: column1@0 != 42
-03)----ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..88], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:88..176], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:176..264], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:264..351]]}, projection=[column1], predicate=column1@0 != 42, pruning_predicate=column1_null_count@2 != column1_row_count@3 AND (column1_min@0 != 42 OR 42 != column1_max@1), required_guarantees=[column1 not in (42)]
+03)----ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..137], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:137..274], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:274..411], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:411..547]]}, projection=[column1], predicate=column1@0 != 42, pruning_predicate=column1_null_count@2 != column1_row_count@3 AND (column1_min@0 != 42 OR 42 != column1_max@1), required_guarantees=[column1 not in (42)]


In this file, the partition offsets are shifted based upon the slightly larger kv metadata (from the default addition of encoded arrow schema).

…data

…tion offsets based upon larger metadata

alamb

Thank ou @wiedld -- I think this is very close.

Can you please try and avoid into_writer_properties_builder_with_arrow_schema -- it would be nicer to avoid having that and into_writer_properties_builder if possible
Could you please file a ticket / PR upstream in Arrow to support the metadata encoding?

Otherwise I think this PR is ready to go

FYI @kylebarron

alamb · 2024-12-26T22:04:30Z

datafusion/common/src/file_options/parquet_writer.rs

+    ///
+    /// The returned [`WriterPropertiesBuilder`] includes customizations applicable per column,
+    /// as well as the arrow schema encoded into the kv_meta at [`ARROW_SCHEMA_META_KEY`].
+    pub fn into_writer_properties_builder_with_arrow_schema(


I think making setting the arrow schema a different function might be more consistent with the rest of this API https://docs.rs/datafusion/latest/datafusion/config/struct.ParquetOptions.html

So for example I would expect that ParquetOptions::into_writer_properties_builder would always set the arrow metadata to be consistent with how WriterProperties works 🤔

If someone doesn't want the arrow metadata, I would expect an option to disable it, like https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html#method.with_skip_arrow_metadata

So that might mean that TableParquetOptions has a field like skip_arrow_metadata

And then depending on the value of that field, into_writer_properties_builder would set/not set the metadata appropriately

That would also avoid having to change the conversion API

So for example I would expect that ParquetOptions::into_writer_properties_builder would always set the arrow metadata to be consistent with how WriterProperties works 🤔

The ParquetOptions does not have kv_metadata. That is held within the parent structure TableParquetOptions. That is why the existing TryFrom code only constructs the writer properties from the TableParquetOptions. Not the ParquetOptions.

I like the idea of making the metadata APIs on the TableParquetOptions, altho I'll keep the config on the ParquetOptions. I'll push up the change shortly.

Here is the change: f2f9b00

I also made another change to prevent the user errors I was worried about: c5ad794

datafusion/common/src/file_options/parquet_writer.rs

datafusion/core/src/datasource/file_format/parquet.rs

alamb · 2024-12-26T22:12:05Z

datafusion/core/src/datasource/file_format/parquet.rs

+        Ok((path, file_metadata))
+    }
+
+    fn assert_file_metadata(file_metadata: FileMetaData, expected_kv: Vec<KeyValue>) {


If you passed in a ref here it might make the code above simpler (less clone):

Suggested change

fn assert_file_metadata(file_metadata: FileMetaData, expected_kv: Vec<KeyValue>) {

fn assert_file_metadata(file_metadata: &FileMetaData, expected_kv: &[KeyValue]) {

The second arg (expected_kv) as a ref does indeed avoid the cloning. Thank you.

The first arg does not have any cloning currently (see tests), and if we add the ref -- then I would need to clone the file_metadata.key_value_metadata in order to sort & compare to the expected_kv.

…Options>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions

…ta, if is required by the configuration

…he kv_metadata

refactor: make ParquetSink tests a bit more readable

09004c5

github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Dec 21, 2024

wiedld force-pushed the 11770/parquet-sink-metadata branch from 93278ee to 1a9da6f Compare December 21, 2024 01:10

wiedld commented Dec 21, 2024

View reviewed changes

datafusion/common/src/file_options/parquet_writer.rs Outdated Show resolved Hide resolved

wiedld commented Dec 21, 2024

View reviewed changes

wiedld added 6 commits December 20, 2024 17:41

chore(11770): add new ParquetOptions.skip_arrow_metadata

b20b151

test(11770): demonstrate that the single threaded ParquetSink is alre…

ce9510c

…ady writing the arrow schema in the kv_meta, and allow disablement

refactor(11770): replace with new method, since the kv_metadata is in…

da88cec

…herent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not

fix(11770): fix parallel ParquetSink to encode arrow schema into the …

013b098

…file metadata, based on the ParquetOptions

refactor(11770): provide deprecation warning for TryFrom

aac8571

test(11770): update tests with new default to include arrow schema

0b960d9

wiedld force-pushed the 11770/parquet-sink-metadata branch from 1a9da6f to 0b960d9 Compare December 21, 2024 01:53

wiedld changed the title ~~ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata.~~ WIP: ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. Dec 21, 2024

wiedld changed the title ~~WIP: ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata.~~ WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. Dec 21, 2024

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 26, 2024

wiedld commented Dec 26, 2024

View reviewed changes

wiedld changed the title ~~WIP: ParquetSink should be aware of arrow schema encoding for the file metadata.~~ ParquetSink should be aware of arrow schema encoding for the file metadata. Dec 26, 2024

wiedld marked this pull request as ready for review December 26, 2024 18:31

wiedld added 3 commits December 26, 2024 11:20

Merge branch 'main' into 11770/parquet-sink-metadata

3893c8c

refactor: including partitioning of arrow schema inserted into kv_met…

5a64d83

…data

test: update tests for new config prop, as well as the new file parti…

64ef4aa

…tion offsets based upon larger metadata

alamb reviewed Dec 26, 2024

View reviewed changes

alamb changed the title ~~ParquetSink should be aware of arrow schema encoding for the file metadata.~~ Supporting writing schema metadata when writing Parquet in parallel Dec 26, 2024

github-actions bot added the documentation Improvements or additions to documentation label Dec 27, 2024

chore: avoid cloning in tests, and update code docs

30448b9

refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquet…

f2f9b00

…Options>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions

wiedld mentioned this pull request Dec 27, 2024

Expose arrow-schema methods, for use when writing parquet outside of ArrowWriter apache/arrow-rs#6916

Open

wiedld added 4 commits December 27, 2024 13:10

refactor: require the arrow_schema key to be present in the kv_metada…

c5ad794

…ta, if is required by the configuration

chore: update configs.md

3063377

test: update tests to handle the (default) required arrow schema in t…

80a76d0

…he kv_metadata

chore: add reference to arrow-rs upstream PR

999cab9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting writing schema metadata when writing Parquet in parallel #13866

Supporting writing schema metadata when writing Parquet in parallel #13866

wiedld commented Dec 21, 2024 •

edited

Loading

wiedld Dec 21, 2024

wiedld Dec 27, 2024

wiedld Dec 21, 2024

wiedld Dec 21, 2024

wiedld Dec 21, 2024

wiedld Dec 26, 2024 •

edited

Loading

alamb left a comment

alamb Dec 26, 2024

wiedld Dec 27, 2024 •

edited

Loading

wiedld Dec 27, 2024

alamb Dec 26, 2024

wiedld Dec 27, 2024

	fn assert_file_metadata(file_metadata: FileMetaData, expected_kv: Vec<KeyValue>) {
	fn assert_file_metadata(file_metadata: &FileMetaData, expected_kv: &[KeyValue]) {

Supporting writing schema metadata when writing Parquet in parallel #13866

Are you sure you want to change the base?

Supporting writing schema metadata when writing Parquet in parallel #13866

Conversation

wiedld commented Dec 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

ArrowWriterOptions vs WriterProperties

Proposed Solution

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld commented Dec 21, 2024 •

edited

Loading

wiedld Dec 26, 2024 •

edited

Loading

wiedld Dec 27, 2024 •

edited

Loading