Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] WP_WXR_Reader #1972

Merged
merged 27 commits into from
Nov 2, 2024
Merged

[Data Liberation] WP_WXR_Reader #1972

merged 27 commits into from
Nov 2, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Oct 31, 2024

Description

This PR introduces the WP_WXR_Reader class for parsing WordPress eXtended RSS (WXR) files, along with supporting improvements to the XML processing infrastructure.

Note: WP_WXR_Reader is just a reader. It won't actually import the data into WordPress – that part is coming soon.

A part of #1894

Motivation

There is no WordPress importer that would check all these boxes:

  • Supports 100GB+ WXR files without running out of memory
  • Can pause and resume along the way
  • Can resume even after a fatal error
  • Can run without libxml and mbstring
  • Is really fast

WP_WXR_Reader is a step in that direction. It cannot pause and resume yet, but the next few PRs will add that feature.

Implementation

WP_WXR_Reader uses the WP_XML_Processor to find XML tags representing meaningful WordPress entities. The reader knows the WXR schema and only looks for relevant elements. For example, it knows that posts are stored in rss > channel > item and comments are stored in rss > channel > item > wp:comment`.

The $wxr->next_entity() method stream-parses the next entity from the WXR document and exposes it to the API consumer via $wxr->get_entity_type() and $wxr->get_entity_date(). The next call to $wxr->next_entity() remembers where the parsing has stopped and parses the next entity after that point.

$fp = fopen('my-wxr-file.xml', 'r');

$wxr_reader = WP_WXR_Reader::from_stream();
while(true) {
    if($wxr_reader->next_entity()) {
        switch ( $wxr_reader->get_entity_type() ) {
            case 'post':
                // ... process post ...
                break;

            case 'comment':
                // ... process comment ...
                break;

            case 'site_option':
                // ... process site option ...
                break;

            // ... process other entity types ...
        }
        continue;
    }

    // Next entity not found – we ran out of data to process.
    // Let's feed another chunk of bytes to the reader.

    if(feof($fp)) {
        break;
    }

    $chunk = fread($fp, 8192);
    if(false === $chunk) {
        $wxr_reader->input_finished();
        continue;
    }
    $wxr_reader->append_bytes($chunk);
}

Similarly to WP_XML_Processor, the WP_WXR_Reader enters a paused state when it doesn't have enough XML bytes to parse the entire entity.

The next_entity() -> fread -> break usage pattern may seem a bit tedious. This is expected. Even if the WXR parsing part of the WP_WXR_Reader offers a high-level API, working with byte streams requires reasoning on a much lower level. The StreamChain class shipped in this repository will make the API consumption easier with its transformation–oriented API for chaining data processors.

Supported WordPress entities

  • posts – sourced from <item> tags
  • comments – sourced from <wp:comment> tags
  • comment meta – sourced from <wp:commentmeta> tags
  • users – sourced from <wp:author> tags
  • post meta – sourced from <wp:postmeta> tags
  • terms – sourced from <wp:term> tags
  • tags – sourced from <wp:tag> tags
  • categories – sourced from <wp:category> tags

Caveats

Extensibility

WP_WXR_Reader ignores any XML elements it doesn't recognize. The WXR format is extensible so in the future the reader may start supporting registration of custom handlers for unknown tags in the future.

Nested entities intertwined with data

WP_WXR_Reader flushes the current entity whenever another entity starts. The upside is simplicity and a tiny memory footprint. The downside is that it's possible to craft a WXR document where some information would be lost. For example:

<rss>
	<channel>
		<item>
		  <title>Page with comments</title>
		  <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link>
		  <wp:postmeta>
		    <wp:meta_key>_wp_page_template</wp:meta_key>
		    <wp:meta_value><![CDATA[default]]></wp:meta_value>
		  </wp:postmeta>
		  <wp:post_id>146</wp:post_id>
		</item>
	</channel>
</rss>

WP_WXR_Reader would accumulate post data until the wp:post_meta tag. Then it would emit a post entity and accumulate the meta information until the </wp:postmeta> closer. Then it would advance to <wp:post_id> and ignore it.

This is not a problem in all the .wxr files I saw. Still, it is important to note this limitation. It is possible there is a .wxr generator somewhere out there that intertwines post fields with post meta and comments. If this ever comes up, we could:

  • Emit the post entity first, then all the nested entities, and then emit a special post_update entity.
  • Do multiple passes over the WXR file – one for each level of nesting, e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta

Buffering all the post meta and comments seems like a bad idea – there might be gigabytes of data.

Future Plans

The next phase will add pause/resume functionality to handle timeout scenarios:

  • Save parser state after each entity or every n entities to speed it up. Then also save the n for a quick rewind after resuming.
  • Resume parsing from saved state.

Testing Instructions

Read the tests and ponder whether they make sense. Confirm the PHPUnit test suite passed on CI. The test suite includes coverage for various WXR formats and streaming behaviors.

Drafts a `WP_WXR_Processor` class that can extract structured
information from XML streams. This is an early version. The goal is to
make it streamable and resumable.
@adamziel
Copy link
Collaborator Author

adamziel commented Oct 31, 2024

@brandonpayton the API emerging from this work surprises me – the WXR file is a data source that emits data objects like a "site title", "site URL", "a post", or "a post comment". Blueprint handlers are functions that accept data objects and imprint them in a WordPress instance. I wonder what other connections might we draw here.

@adamziel
Copy link
Collaborator Author

Oooh, what if we treated all these data sources as streams of objects?

  • WXR files
  • Markdown files
  • HTML files
  • Github repos
  • Site assembler
  • Other WordPress sites
  • ...who knows what else?

In such a scenario the WP_WXR_Processor would be the only WXR-specific class. All the other data sources would emit the same kind of information that could be processed by a single, generic pipeline that frontloads the assets, rewrites site URLs etc.

@brandonpayton
Copy link
Member

@adamziel, these are interesting ideas! Will sleep on them.

This might help connect ideas you've shared elsewhere when we've discussed a sort of WordPress concept language. It's not my desire to reinvent the wheel with yet another language or format, but it seemed like the problem space around site recipes might call for it. It depends on whether we want a human-writable thing or just something that can relay various entities.

@bgrgicak
Copy link
Collaborator

bgrgicak commented Nov 1, 2024

Oooh, what if we treated all these data sources as streams of objects?

I'm not sure I understand one part.

All of these data sources can have similar data (content) and they need to go through a similar processor to update the content and load assets.

But what would be the expected destination for this data? Would it all end up as posts?

*
* @param int $offset
* @return int
*/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell @sirreal you may like this change

@adamziel adamziel marked this pull request as ready for review November 2, 2024 13:34
@adamziel adamziel changed the title [Data Liberation] WP_WXR_Processor [Data Liberation] WP_WXR_Reader Nov 2, 2024
@adamziel
Copy link
Collaborator Author

adamziel commented Nov 2, 2024

I'm not sure I understand one part.

All of these data sources can have similar data (content) and they need to go through a similar processor to update the content and load assets.

But what would be the expected destination for this data? Would it all end up as posts?

WXR supports site options, posts, comments, users, metadata, and a few more data types – so that's what it would end up as. Raw markdown might be just posts, but we could support post meta and site options via frontmatter. WordPress -> WordPress would support every possible data type.

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 2, 2024

I'll go ahead and merge to keep moving forward. The code isn't used anywhere yet.

@adamziel adamziel merged commit 2b1f0b6 into trunk Nov 2, 2024
7 checks passed
@adamziel adamziel deleted the wp-wxr-processor branch November 2, 2024 14:23
Comment on lines +2 to +8
/**
* UTF-8 decoding pipeline by Dennis Snell (@dmsnell), originally
* proposed in https://github.com/WordPress/wordpress-develop/pull/6883.
*
* It enables parsing XML documents with incomplete UTF-8 byte sequences
* without crashing or depending on the mbstring extension.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@adamziel adamziel restored the wp-wxr-processor branch November 4, 2024 01:43
return 0;
}
$name_byte_length = 0;
while(true) {
Copy link

@JanJakes JanJakes Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel @dmsnell @sirreal
As I'm looking into a similar problem right now, I thought I'd dump here an idea that I used in my case.

In the loop, we could 1) check for a sequence of ASCII < 128 characters, 2) check if the next character can be multibyte, and 3) only then call utf8_codepoint_at. In my scenario, it gives some performance gains. If we expect most attribute names to be ASCII < 128, then this could bring significant performance improvements.

(This is a simplification as I didn't consider the $test_as_first_character handling.)

while (true) {
	// First, let's try to parse an ASCII sequence.
	$name_byte_length += strspn(
		$this->xml,
		'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_',
		$offset + $name_byte_length
	);

	// Check if the following byte can be part of a multibyte character.
	$byte = $this->xml[ $offset + $name_byte_length ] ?? null;
	if ( null === $byte || ord( $byte ) < 128 ) {
		break;
	}

	// Check the \x{0080}-\x{ffff} Unicode character range.
	$codepoint = utf8_codepoint_at( $this->xml, $offset + $name_byte_length, $bytes_parsed );
	if (
		null === $codepoint ||
		! $this->is_valid_name_codepoint( $codepoint, $name_byte_length === 0 )
	) {
		break;
	}
	$name_byte_length += $bytes_parsed;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants