Skip to content

Add two-phase remote site import#2939

Draft
adamziel wants to merge 69 commits intotrunkfrom
adamziel/stream-import-flow
Draft

Add two-phase remote site import#2939
adamziel wants to merge 69 commits intotrunkfrom
adamziel/stream-import-flow

Conversation

@adamziel
Copy link
Copy Markdown

Related issues

How AI was used in this PR

  • AI helped port the original draft import command, wire the new importer child/runtime integration, and draft focused test coverage. I reviewed and validated the resulting code paths, build output, and test results in this workspace.

Proposed Changes

  • add a new studio site import command that stores raw imports in ~/.studio/imports/<key> and makes --name optional by inferring the local site name from the remote host
  • import essential files first, flatten the runtime into ~/Studio/<site>, apply the remote database to SQLite, generate a Playground runtime blueprint, create/start the Studio site, and then continue the skipped-file phase with progress output
  • persist imported-site metadata so later studio site start uses the generated runtime blueprint and studio site delete --files removes both the visible site directory and the technical import directory

Testing Instructions

  • npx eslint --fix apps/cli/commands/site/import.ts apps/cli/importer-child.ts apps/cli/lib/import/migration-client.ts apps/cli/commands/site/start.ts apps/cli/commands/site/delete.ts apps/cli/commands/site/tests/import.test.ts apps/cli/commands/site/tests/start.test.ts apps/cli/commands/site/tests/delete.test.ts apps/cli/index.ts apps/cli/vite.config.ts tools/common/lib/site-events.ts tools/common/logger-actions.ts
  • npm run cli:build
  • npm test -- apps/cli/commands/site/tests/import.test.ts apps/cli/commands/site/tests/start.test.ts apps/cli/commands/site/tests/delete.test.ts
  • npm run typecheck currently fails on a pre-existing repo issue in apps/cli/lib/archive.ts:18 (followSymlinks is not part of ArchiverOptions)

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

@adamziel adamziel force-pushed the adamziel/stream-import-flow branch from f58d38f to f8006c2 Compare March 27, 2026 18:11
@wojtekn
Copy link
Copy Markdown
Contributor

wojtekn commented Mar 30, 2026

@adamziel currently Studio uses "import" terminology for importing from a backup file or .sql file, and sync/push/pull for remote site integrations. Also, we are moving the current Import features from UI to the studio import command.

Could you please use another namespace for the new command, e.g., studio stream-sync, studio sync stream, or similar?

@adamziel
Copy link
Copy Markdown
Author

Let me loop in @Poliuk – would you be able to map all these distinct features and find a useful naming strategy for them?

@adamziel adamziel force-pushed the adamziel/stream-import-flow branch from 4e7bc64 to 3694f85 Compare March 31, 2026 18:59
The import repair tests use spawnSync('sqlite3', ...) to create test
databases. CI environments without sqlite3 (e.g. Windows agents) fail
these tests. Skip them conditionally with it.skipIf rather than failing
the entire test suite.
@adamziel adamziel force-pushed the adamziel/stream-import-flow branch from 2052d3d to 943e346 Compare April 7, 2026 13:25
adamziel added 4 commits April 7, 2026 15:48
…rt-flow

# Conflicts:
#	apps/cli/package.json
#	package-lock.json
…am-import-flow

Add timeout to sqlite3 detection
- migration-client test: skip the PHP-dependent test when native PHP is
  unavailable instead of trying to mock spawnSync (the vi.mock for
  node:child_process didn't intercept reliably)
- runtime-start-options test: use path.join for paths that the source
  code constructs via path.join, so the mock matches on Windows where
  path.join produces backslashes
@adamziel adamziel force-pushed the adamziel/stream-import-flow branch from b89abd4 to 989fcc9 Compare April 7, 2026 15:01
adamziel added 23 commits April 7, 2026 20:47
The previous regex-based parser ignored the byte-length prefix in PHP's
serialize format and used JavaScript's String.length (UTF-16 code units)
instead of the UTF-8 byte count when re-serializing. This silently
corrupts the active_plugins option when any plugin path contains
multi-byte characters, causing PHP's unserialize() to fail and
effectively deactivating every plugin on the imported site.

Replace the hand-rolled parsePhpSerializedStringArray and
serializePhpStringArray with a PHP WASM invocation that uses native
serialize()/unserialize(). The two per-slug calls are batched into a
single PHP WASM boot for efficiency.
parseRuntimePhpConstants used a regex to approximate PHP's define()
syntax — it would silently miss constants with double-quoted values,
concatenated expressions, or any construct beyond simple literals.

Replace it with a PHP WASM invocation that includes the file in a fresh
environment and reads back the user-defined constants via
get_defined_constants(). This lets PHP's own parser handle the syntax.
downloadedFiles and downloadedBytes were overwritten directly from each
progress record, so they dropped back to near-zero whenever the importer
process restarted (exit code 2). bytesReceived already had cumulative
logic but the other counters did not.

Apply the same restart-detection pattern to downloadedFiles and
downloadedBytes, and clamp totalFiles/totalBytes to never decrease.
Two problems prevented ?studio-auto-login from working on imported sites:

1. Studio MU-plugins (including the auto-login endpoint) were only
   mounted for sites without useExactMountLayout. Imported sites use
   exact mount layouts, so the endpoint didn't exist at all. Move the
   MU-plugin mounts outside the layout conditional so they're always
   present.

2. The auto-login MU-plugin looked up studio_admin_username (absent in
   imported databases) then fell back to 'admin' (also typically absent).
   Add a final fallback that finds the first administrator account via
   get_users(), which matches whatever admin the remote site had.
When the progress message wrapped across multiple terminal lines and
then shrank on the next update, ora's line-clearing logic over-cleared
and erased previous output above the spinner. Truncating the text to
a single line prevents wrapping entirely.
When only totalFiles was known (before the first downloaded_files
record), the display showed "2838 files". Once a progress record added
downloadedFiles, it switched to "50/2838 files". The user saw the
leading number jump from 2838 to 50 on each format change.

Always use "X/Y files" once either count is known, defaulting the
downloaded count to 0, so the leading number only ever increases.
When resuming an import with a stored secret, the previous flow always
loaded all WP.com sites (slow, noisy "Loaded 414 sites" message) and
sometimes rotated the secret eagerly. Now:

- If a URL is provided and a stored secret exists, try it immediately
  without fetching the site list or rotating.
- Only load sites and rotate the secret on demand if the preflight
  fails with the stored secret.
- Replace the separate "Connecting to remote site" spinner with
  "Initiating the migration" and remove the chatty site-list and
  secret-rotation success messages.
The file count was only appearing in the message segment (after bytes)
instead of the structured files segment (before bytes), which meant the
cumulative tracking never applied and the number danced with each record.

Three fixes:

- readNumber now parses string values — PHP json_encode may emit numeric
  fields as strings depending on the serialization path.
- Broader field name coverage: files_downloaded, files, bytes_downloaded,
  bytes are now recognized alongside the existing variants.
- The free-text message is suppressed once structured file counts are
  available, avoiding a duplicate "[2838 files]" alongside "0/2838 files".
The streaming-site-migration importer emits files_imported (per-file
running count), files_completed (per-request total), files_indexed
(index size in lifecycle events), and bytes_processed. None of these
matched the field names the progress tracker was looking for
(downloaded_files, total_files, etc.), so the structured X/Y files
display was never populated and the user saw a dancing number from the
raw message text instead.

Also handle string-typed numbers from PHP json_encode, extract
files_indexed from lifecycle events that return early, and suppress
the free-text message once structured file counts are available.
The previous URLs appended ?studio-auto-login as a query parameter to
the site URL or wp-admin path. The auto-login MU-plugin intercepts
requests to the /studio-auto-login path, not a query parameter on
arbitrary pages. Use the correct /studio-auto-login endpoint and pass
the target page via redirect_to, matching the pattern used everywhere
else in the codebase.
files_completed (per-HTTP-request count from completion chunks) overlaps
with files_imported (per-invocation running count from file_progress and
heartbeat records). Mixing both into the same cumulative tracker caused
the restart detector to misfire whenever a completion chunk reported a
lower value than the latest file_progress record, inflating the
displayed count far beyond the actual total (e.g. 17137/3495).

Use only files_imported for the downloaded count. Same treatment for
bytes_processed which is a per-request subset that conflicts with
bytes_received. Also prioritize files_indexed over total_files since
that is the field the importer actually emits.
Both db-sync and db-apply were running without progress callbacks,
showing only a static spinner. The importer already emits structured
progress for both: bytes_received during the SQL dump download, and
statements_executed/statements_total during db-apply.

Wire up onProgress callbacks and progressLabel for both commands.
Map bytes_read (db-apply's file offset) to downloadedBytes so the
byte progress shows during statement application. Suppress the raw
message text when structured statement counts are already displayed.
A bare "2959 files" without a total is noise — it tells the user nothing
about how much work remains. Only show the files segment once totalFiles
is known, always in X/Y format.

The total was also drifting upward as the importer discovered symlink
targets and expanded the index. Lock totalFiles and totalBytes to the
first value reported so the denominator stays stable throughout the
progress display. Remove the premature files_indexed extraction from
lifecycle events since those fire before the fetch phase when the index
is still incomplete.
The importer's files_imported is a per-invocation running total that
resets on exit-code-2 restarts. The importer re-scans already-handled
files on resume, so accumulating across restarts counted the same files
multiple times — producing counts like 12097/3482.

Use the raw per-invocation value directly. It briefly resets to 0 on
restart but is always accurate within an invocation. Keep accumulation
only for bytes_received (HTTP-level counter that genuinely resets per
request without re-scanning).
The importer now emits files_done (cumulative entries processed from
the download list, stable across restarts) and files_total (total
entries in the download list, fixed once the diff phase completes).

Prefer these over files_imported/files_indexed which were per-invocation
counters that reset on restart and could exceed the index size. Remove
the totalFiles lock since files_total from the importer is already
stable by construction.
files_done already includes files_imported (it's download_list_done +
files_imported). Emitting both was redundant and forced the consumer to
choose between them. Now the importer emits only files_done in progress
and heartbeat records, and Studio only reads files_done.
files_done from the importer drops on exit-code-2 restarts because
files_imported resets to 0 before the batch offset advances. Use
Math.max against the previous snapshot so the displayed count never
goes backward. Same treatment for downloadedBytes which can also
drop when the importer restarts a batch.
Two bugs:

1. Heartbeats during indexing emitted files_done:0 without files_total.
   Studio set downloadedFiles=0, which tripped the message suppression
   (hasStructuredCounts was true), but totalFiles was undefined so no
   structured file segment rendered either. Total information blackout.
   Fix: only suppress the message when totalFiles is actually set.

2. The progress label said "Downloading essential files" during the
   index and diff phases. Now the importer's phase name overrides the
   generic label when it describes a non-download activity (e.g.
   "Indexing remote files", "Preparing download list").
Debug records are internal importer diagnostics, not user-facing
progress. Stop surfacing them as the spinner message.
When the remaining-files download exited with code 2 (partial progress),
the state became status=partial, stage=fetch-skipped. On the next run,
prepareSkippedEarlierState overwrote this to status=complete, and the
importer was invoked with --filter=skipped-earlier. But the importer's
validation only accepts this flag from a complete state — it didn't
handle resuming a partial skipped-earlier run.

Now detect when the state already indicates an in-progress skipped-
earlier run and skip both the state overwrite and the --filter flag.
The importer resumes from the state file's stage field directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants