Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Blueprints] Support Data Liberation importer in the importWxr step #2058

Merged
merged 38 commits into from
Dec 11, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Dec 5, 2024

Description

Adds the Data Liberation WXR importer as an option in the importWxr step. The new importer is turned by including the "importer": "data-liberation" option:

{
  "steps": [
    {
      "step": "importWxr",
      "file": {
        "resource": "url",
        "url": "https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml"
      },
      "importer": "data-liberation"
    }
  ]
}

When the importer option is missing or set to "default," nothing changes in the behavior of the step and it continues using the https://github.com/humanmade/WordPress-Importer importer.

The new importer:

  • Rewrites links in the imported content
  • Downloads assets through Playground's CORS proxy
  • Parallelizes the downloads
  • Communicates progress

This PR is a part of #1894

Implementation details

This importWxr step fetches and includes the data-liberation-core.phar file. The phar file is built with Box and contains the importer library with its dependencies, which is a subset of the Data Liberation library, a subset of the Blueprints library, and a few vendor libraries.

This, unfortunately, means that any changes in the PHP files require rebuilding the .phar file. Here's how you can do it:

nx build:phar playground-data-liberation

You can also build the entire Data Liberation package as a WordPress plugin complete with a wp-admin page:

nx build:plugin playground-data-liberation

Both commands will output the built files to packages/playground/data-liberation/dist

The progress updates are a first-class feature of the new importer. The updated importer step receives them in real-time via a post_message_to_js() call running after every import step. Then, it passes them on to the progress bar UI.

Other changes

  • TLS traffic now goes through the CORS proxy. Since the new importer uses AsyncHTTP\Client which deals with raw sockets, Playground's TLS-based network bridge runs the outbound traffic through a cors proxy. Technically, TCPOverFetchWebsocket gets the corsProxy URL passed to the playground.boot() call.
  • A few composer dependencies were forked, downgraded to PHP 7.2 using Rector, and bundled with this PR to keep the Data Liberation importer working.
  • CORS proxy – skip rate-limiting when an env variable is present, filter out the HTTP/2 header, don't pass duplicate Access-Control-* headers, add Accept to Access-Control-Allow-Headers. FYI @brandonpayton

Remaining work

  • PHP 7.2 compatibility. Done by forking and Rector-downgrading dependencies that were incompatible with PHP 7.2.
  • Report the importer's progress on the overall Blueprint progress bar
  • Enqueue the data liberation plugin files for downloading at the blueprint compilation stage
  • Don't eagerly rewrite attachments URLs in WP_Stream_Importer. Exposing this information to the API consumer requires an explicit decision. Do we rewrite it? Or do we ignore it?
  • Fix the TLS errors at the intersection of Playground network transport and the async HTTP client library
  • Separate the markdown importer and its dependencies (md parser, frontmatter parser, Symfony libraries) from the core plugin
  • Ship the importer and its tree-shaken deps (URL parser) as a minified zip/phar

Follow-up work

  • Reconsider the WP_Import_Session API – do we need so many verbosely named methods? Can we achieve the same outcomes with fewer methods?
  • Investigate why there's a significant delay before media downloads start on PHP 7.2 – 7.4. It's likely a PHP.wasm issue.

Testing instructions

  • Default importer – Open this link and confirm it does what the current importWxr step do, that is it stays at "Importing content" for a moment, fails to fetch media files (CORS issues in network tools), but inserts posts and pages.
  • Data Liberation – Open this link, confirm the import progress is visible and that the content and media indeed get imported:

CleanShot 2024-12-08 at 14 54 49@2x

Related issues

@adamziel adamziel changed the title Blueprints: Use the Data Liberation plumbing in the importWxr step Blueprints: Use the Data Liberation plugin to import WXR files in the importWxr step Dec 5, 2024
@adamziel adamziel changed the title Blueprints: Use the Data Liberation plugin to import WXR files in the importWxr step [Blueprints] Switch importWxr to Data Liberation WXR Importer Dec 5, 2024
@adamziel adamziel force-pushed the import-wxr-via-data-liberation branch from 43255d9 to a5f77d1 Compare December 7, 2024 19:45
@adamziel adamziel changed the title [Blueprints] Switch importWxr to Data Liberation WXR Importer [Blueprints] Support Data Liberation importer in the importWxr step Dec 8, 2024
@adamziel adamziel marked this pull request as ready for review December 8, 2024 14:20
@adamziel
Copy link
Collaborator Author

The failures are now tests that are either flaky or disabled on the trunk. This one seems ready; I'm quite excited! With this PR, we can start solving all the import and export problems Playground was affected by since day 1, e.g. rewriting the URLs in imported content, slow assets downloads, mapping author IDs, etc!

@adamziel adamziel merged commit 2191e22 into trunk Dec 11, 2024
8 of 10 checks passed
@adamziel adamziel deleted the import-wxr-via-data-liberation branch December 11, 2024 11:32
adamziel added a commit that referenced this pull request Dec 11, 2024
… offline mode assets

Removes data-liberation-core.phar from the list of assets preloaded to
support offline mode. It was implicitly added to that list in #2058.

The preloading triggers the following error:

```
Failed to execute 'addAll' on 'Cache': Cache.addAll(): duplicate requests
```

The error stops the initialization of the offline mode cache on the first
page load. It still work on subsequent page loads. Still, this behavior
breaks an E2E test in trunk.

Playground doesn't need to preload that asset. It is only required for
a psecific flavor of the importWxr step and there's already an established
pattern of not preloading plugins, e.g. the SQLite database integration plugin
is not preloaded.

 ## Testing instructions

Confirm the Chromium E2E tests pass
adamziel added a commit that referenced this pull request Dec 11, 2024
… offline mode assets (#2072)

Removes data-liberation-core.phar from the list of assets preloaded to
support offline mode. It was implicitly added to that list in #2058.

The preloading triggers the following error:

```
Failed to execute 'addAll' on 'Cache': Cache.addAll(): duplicate requests
```

The error stops the initialization of the offline mode cache on the
first page load. It still work on subsequent page loads. Still, this
behavior breaks an E2E test in trunk.

Playground doesn't need to preload that asset. It is only required for a
psecific flavor of the importWxr step and there's already an established
pattern of not preloading plugins, e.g. the SQLite database integration
plugin is not preloaded.

 ## Testing instructions

Confirm the Chromium E2E tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

1 participant