[Data Liberation] WP_WXR_Reader #1972

adamziel · 2024-10-31T13:20:17Z

Description

This PR introduces the WP_WXR_Reader class for parsing WordPress eXtended RSS (WXR) files, along with supporting improvements to the XML processing infrastructure.

Note: WP_WXR_Reader is just a reader. It won't actually import the data into WordPress – that part is coming soon.

A part of #1894

Motivation

There is no WordPress importer that would check all these boxes:

Supports 100GB+ WXR files without running out of memory
Can pause and resume along the way
Can resume even after a fatal error
Can run without libxml and mbstring
Is really fast

WP_WXR_Reader is a step in that direction. It cannot pause and resume yet, but the next few PRs will add that feature.

Implementation

WP_WXR_Reader uses the WP_XML_Processor to find XML tags representing meaningful WordPress entities. The reader knows the WXR schema and only looks for relevant elements. For example, it knows that posts are stored in rss > channel > item and comments are stored in rss > channel > item > wp:comment`.

The $wxr->next_entity() method stream-parses the next entity from the WXR document and exposes it to the API consumer via $wxr->get_entity_type() and $wxr->get_entity_date(). The next call to $wxr->next_entity() remembers where the parsing has stopped and parses the next entity after that point.

$fp = fopen('my-wxr-file.xml', 'r');

$wxr_reader = WP_WXR_Reader::from_stream();
while(true) {
    if($wxr_reader->next_entity()) {
        switch ( $wxr_reader->get_entity_type() ) {
            case 'post':
                // ... process post ...
                break;

            case 'comment':
                // ... process comment ...
                break;

            case 'site_option':
                // ... process site option ...
                break;

            // ... process other entity types ...
        }
        continue;
    }

    // Next entity not found – we ran out of data to process.
    // Let's feed another chunk of bytes to the reader.

    if(feof($fp)) {
        break;
    }

    $chunk = fread($fp, 8192);
    if(false === $chunk) {
        $wxr_reader->input_finished();
        continue;
    }
    $wxr_reader->append_bytes($chunk);
}

Similarly to WP_XML_Processor, the WP_WXR_Reader enters a paused state when it doesn't have enough XML bytes to parse the entire entity.

The next_entity() -> fread -> break usage pattern may seem a bit tedious. This is expected. Even if the WXR parsing part of the WP_WXR_Reader offers a high-level API, working with byte streams requires reasoning on a much lower level. The StreamChain class shipped in this repository will make the API consumption easier with its transformation–oriented API for chaining data processors.

Supported WordPress entities

posts – sourced from <item> tags
comments – sourced from <wp:comment> tags
comment meta – sourced from <wp:commentmeta> tags
users – sourced from <wp:author> tags
post meta – sourced from <wp:postmeta> tags
terms – sourced from <wp:term> tags
tags – sourced from <wp:tag> tags
categories – sourced from <wp:category> tags

Caveats

Extensibility

WP_WXR_Reader ignores any XML elements it doesn't recognize. The WXR format is extensible so in the future the reader may start supporting registration of custom handlers for unknown tags in the future.

Nested entities intertwined with data

WP_WXR_Reader flushes the current entity whenever another entity starts. The upside is simplicity and a tiny memory footprint. The downside is that it's possible to craft a WXR document where some information would be lost. For example:

<rss>
	<channel>
		<item>
		  <title>Page with comments</title>
		  <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link>
		  <wp:postmeta>
		    <wp:meta_key>_wp_page_template</wp:meta_key>
		    <wp:meta_value><![CDATA[default]]></wp:meta_value>
		  </wp:postmeta>
		  <wp:post_id>146</wp:post_id>
		</item>
	</channel>
</rss>

WP_WXR_Reader would accumulate post data until the wp:post_meta tag. Then it would emit a post entity and accumulate the meta information until the </wp:postmeta> closer. Then it would advance to <wp:post_id> and ignore it.

This is not a problem in all the .wxr files I saw. Still, it is important to note this limitation. It is possible there is a .wxr generator somewhere out there that intertwines post fields with post meta and comments. If this ever comes up, we could:

Emit the post entity first, then all the nested entities, and then emit a special post_update entity.
Do multiple passes over the WXR file – one for each level of nesting, e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta

Buffering all the post meta and comments seems like a bad idea – there might be gigabytes of data.

Future Plans

The next phase will add pause/resume functionality to handle timeout scenarios:

Save parser state after each entity or every n entities to speed it up. Then also save the n for a quick rewind after resuming.
Resume parsing from saved state.

Testing Instructions

Read the tests and ponder whether they make sense. Confirm the PHPUnit test suite passed on CI. The test suite includes coverage for various WXR formats and streaming behaviors.

Drafts a `WP_WXR_Processor` class that can extract structured information from XML streams. This is an early version. The goal is to make it streamable and resumable.

adamziel · 2024-10-31T13:22:24Z

@brandonpayton the API emerging from this work surprises me – the WXR file is a data source that emits data objects like a "site title", "site URL", "a post", or "a post comment". Blueprint handlers are functions that accept data objects and imprint them in a WordPress instance. I wonder what other connections might we draw here.

adamziel · 2024-10-31T14:21:31Z

Oooh, what if we treated all these data sources as streams of objects?

WXR files
Markdown files
HTML files
Github repos
Site assembler
Other WordPress sites
...who knows what else?

In such a scenario the WP_WXR_Processor would be the only WXR-specific class. All the other data sources would emit the same kind of information that could be processed by a single, generic pipeline that frontloads the assets, rewrites site URLs etc.

brandonpayton · 2024-11-01T02:44:42Z

@adamziel, these are interesting ideas! Will sleep on them.

This might help connect ideas you've shared elsewhere when we've discussed a sort of WordPress concept language. It's not my desire to reinvent the wheel with yet another language or format, but it seemed like the problem space around site recipes might call for it. It depends on whether we want a human-writable thing or just something that can relay various entities.

bgrgicak · 2024-11-01T07:04:58Z

Oooh, what if we treated all these data sources as streams of objects?

I'm not sure I understand one part.

All of these data sources can have similar data (content) and they need to go through a similar processor to update the content and load assets.

But what would be the expected destination for this data? Would it all end up as posts?

adamziel · 2024-11-02T11:20:47Z

packages/playground/data-liberation/src/xml-api/WP_XML_Processor.php

+	 *
+	 * @param int $offset
+	 * @return int
+	 */


@dmsnell @sirreal you may like this change

adamziel · 2024-11-02T14:11:12Z

I'm not sure I understand one part.

All of these data sources can have similar data (content) and they need to go through a similar processor to update the content and load assets.

But what would be the expected destination for this data? Would it all end up as posts?

WXR supports site options, posts, comments, users, metadata, and a few more data types – so that's what it would end up as. Raw markdown might be just posts, but we could support post meta and site options via frontmatter. WordPress -> WordPress would support every possible data type.

adamziel · 2024-11-02T14:23:43Z

I'll go ahead and merge to keep moving forward. The code isn't used anywhere yet.

brandonpayton · 2024-11-02T16:52:58Z

packages/playground/data-liberation/src/utf8_decoder.php

+/**
+ * UTF-8 decoding pipeline by Dennis Snell (@dmsnell), originally
+ * proposed in https://github.com/WordPress/wordpress-develop/pull/6883.
+ *
+ * It enables parsing XML documents with incomplete UTF-8 byte sequences
+ * without crashing or depending on the mbstring extension.
+ */


JanJakes · 2024-11-07T13:43:57Z

packages/playground/data-liberation/src/xml-api/WP_XML_Processor.php

-			return 0;
-		}
+		$name_byte_length = 0;
+		while(true) {


@adamziel @dmsnell @sirreal
As I'm looking into a similar problem right now, I thought I'd dump here an idea that I used in my case.

In the loop, we could 1) check for a sequence of ASCII < 128 characters, 2) check if the next character can be multibyte, and 3) only then call utf8_codepoint_at. In my scenario, it gives some performance gains. If we expect most attribute names to be ASCII < 128, then this could bring significant performance improvements.

(This is a simplification as I didn't consider the $test_as_first_character handling.)

while (true) { // First, let's try to parse an ASCII sequence. $name_byte_length += strspn( $this->xml, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_', $offset + $name_byte_length ); // Check if the following byte can be part of a multibyte character. $byte = $this->xml[ $offset + $name_byte_length ] ?? null; if ( null === $byte || ord( $byte ) < 128 ) { break; } // Check the \x{0080}-\x{ffff} Unicode character range. $codepoint = utf8_codepoint_at( $this->xml, $offset + $name_byte_length, $bytes_parsed ); if ( null === $codepoint || ! $this->is_valid_name_codepoint( $codepoint, $name_byte_length === 0 ) ) { break; } $name_byte_length += $bytes_parsed; }

adamziel added 2 commits October 31, 2024 14:17

[Data Liberation] WP_WXR_Processor

528cfd5

Drafts a `WP_WXR_Processor` class that can extract structured information from XML streams. This is an early version. The goal is to make it streamable and resumable.

Refactor WXR_Action to WXR_Object

428fb53

adamziel added this to the Data Liberation: URL Rewriting milestone Oct 31, 2024

adamziel added [Type] Enhancement New feature or request [Feature] Import Export [Aspect] Data Liberation labels Oct 31, 2024

adamziel added 4 commits October 31, 2024 14:23

Rename WXR object $name

f013cc5

Parse WXR attachments

5aec8a0

Document the next steps for the WP_WXR_Processor

654072f

Add support for terms

65a84ed

adamziel added 14 commits November 1, 2024 13:21

Support wp:tag

4f1f7e5

Make WP_WXR_Processor streaming

6dedf9a

Retain last post ID and last comment ID when parsing nested objects

3cba344

Get it to run without exceptions on theme unit test XML file

76e2210

Add smoke tests for parsing existing WXR files used out in the wild

9fd6442

Bring in WXR files from the WP PHPunit repo

b9b8f66

Add streaming tests

47d9dc9

Make progress on the failing importer test

58f62a4

Parse XML names using @dmsnell's UTF-8 decoder

adf6eab

Simplify the next_entity() logic in the WXR processor

71b7500

Cleanup

004015c

Lint

f79e80d

Rename WP_WXR_Processor to WP_WXR_Reader

6041871

Adjust the entity keys returned by WP_WXR_Reader

3cfb82f

adamziel commented Nov 2, 2024

View reviewed changes

Expand the documentation inside parse_name

3da4b91

adamziel added 2 commits November 2, 2024 13:22

Add documentation strings to WP_WXR_Reader

71a0487

Add XML memory budget and auto-flush the buffer periodically

9f1b362

adamziel marked this pull request as ready for review November 2, 2024 13:34

adamziel added 2 commits November 2, 2024 14:36

Lint

a475658

Give @dmsnell credit for his fantastic UTF8 decoder

f7f4df2

adamziel changed the title ~~[Data Liberation] WP_WXR_Processor~~ [Data Liberation] WP_WXR_Reader Nov 2, 2024

Add more meaningful inline documentation

42b46b6

Lint

efaacf9

adamziel mentioned this pull request Nov 2, 2024

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Open

85 tasks

adamziel merged commit 2b1f0b6 into trunk Nov 2, 2024
7 checks passed

adamziel deleted the wp-wxr-processor branch November 2, 2024 14:23

brandonpayton reviewed Nov 2, 2024

View reviewed changes

adamziel restored the wp-wxr-processor branch November 4, 2024 01:43

JanJakes reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] WP_WXR_Reader #1972

[Data Liberation] WP_WXR_Reader #1972

adamziel commented Oct 31, 2024 •

edited

Loading

adamziel commented Oct 31, 2024 •

edited

Loading

adamziel commented Oct 31, 2024

brandonpayton commented Nov 1, 2024

bgrgicak commented Nov 1, 2024

adamziel Nov 2, 2024

adamziel commented Nov 2, 2024

adamziel commented Nov 2, 2024

brandonpayton Nov 2, 2024

JanJakes Nov 7, 2024 •

edited

Loading

[Data Liberation] WP_WXR_Reader #1972

[Data Liberation] WP_WXR_Reader #1972

Conversation

adamziel commented Oct 31, 2024 • edited Loading

Description

Motivation

Implementation

Supported WordPress entities

Caveats

Extensibility

Nested entities intertwined with data

Future Plans

Testing Instructions

adamziel commented Oct 31, 2024 • edited Loading

adamziel commented Oct 31, 2024

brandonpayton commented Nov 1, 2024

bgrgicak commented Nov 1, 2024

adamziel Nov 2, 2024

Choose a reason for hiding this comment

adamziel commented Nov 2, 2024

adamziel commented Nov 2, 2024

brandonpayton Nov 2, 2024

Choose a reason for hiding this comment

JanJakes Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

adamziel commented Oct 31, 2024 •

edited

Loading

adamziel commented Oct 31, 2024 •

edited

Loading

JanJakes Nov 7, 2024 •

edited

Loading