In mid May 2026, the first nightly deployment shipped to production at WooCommerce.com. Within hours, our pipeline caught a real performance regression that we reported upstream before it reached any public release.
WooCommerce.com is itself a production WooCommerce store: it sells extensions, processes real payments, and serves real merchant traffic. Every weekday, a pipeline pulls the latest nightly build, deploys it to staging, runs tests, and produces a ready-to-merge pull request.
It’s modeled on how WordPress.org runs on WordPress trunk: by living closer to the current development branch, you catch problems earlier and contribute fixes back upstream. We’re doing the same for the Woo ecosystem.
What we built
A daily pipeline that does, roughly, this:

AI handles the purple steps. Humans still own the merge decision: reviewing the generated output and deciding what ships. We want to reduce the human involvement over time as our guardrails mature: weekdays only, never weekends, and only once post-deploy telemetry (slow-query anomaly detection, error-rate comparisons, latency percentiles) is reliable enough to act on automatically.
Most of those are standard engineering: cron, CI, deploys. The interesting parts are three jobs that used to need a human and now belong to AI:
- Patch replay. We run a slightly patched version of WC Core on WooCommerce.com. The patches are typically small fixes we’re contributing back to WC Core itself (or to Action Scheduler), applied here only until they land upstream and ship in a release. An LLM reads our patch manifest, decides per-patch whether each one is still needed, already upstreamed, or needs adjustment, and commits the result.
- Advisory code review of the WC Core diff. A multi-thousand-line core diff was never going to get a meaningful human review from our side. An LLM does a focused pass on the integration points where Core meets our codebase, and posts a verdict on each generated PR.
- Exploratory testing on staging. After each upgrade deploys to staging, an agent browses key flows and posts findings as PR comments.
What it caught in the first two weeks
Week 1: the first run flagged a broken integration test, traceable to a specific upstream change. The upstream team had already pushed a revert by the time we noticed. On a daily cadence, we’d have caught it sooner.
Week 2: a few hours after the first auto-generated upgrade shipped to production, we saw real impact: slow database queries on order metadata, with measurable downstream effect (slower checkout, slow renewal lookups, webhook backlog).
The cause was a recent upstream PR that simplified a database index on the orders meta table. The change was reasonable in isolation, as the index it removed had grown large (for context: on WooCommerce.com, the index itself takes ~15GB). But it broke a query pattern that several extensions in our stack relied on. Not WC Core itself: the extensions built on top of it. At production scale, the slowdown was noticeable.
We restored a similar index on our side as same-day mitigation, then reported the regression upstream with detailed slow-query data. Upstream has since landed an updated index, planned for WC 10.8.
What matters here is the timing. We found it before the change could ship in a public WC Core release.
The AI angle
AI in this project absorbed work that was expensive enough that we could only do it at the regular release cadence: code review on multi-thousand-line core diffs, exploratory testing after every upgrade, and patch replay. With AI handling those steps, we can run them continuously, on every nightly.
What the AI gates get wrong
The gates are useful, not flawless. A few honest examples from the first weeks:
- The advisory code reviewer’s first verdicts weren’t useful. Too generic, not focused on real risk. We rewrote the prompt and added diff pre-narrowing before it produced operationally useful feedback. We expect to iterate on it several more times as we see what it catches and misses.
- The exploratory agent on staging is v1. It produces structured findings as PR comments, but its bounded test charters need tuning as we see what it surfaces on actual upgrade diffs. We treat its output as signal, not gate.
- Patch replay worked on the first run, but only because the patch manifest is tight. The LLM’s effectiveness depends on us keeping the manifest well-annotated.
What this means for you
If you build WooCommerce extensions: WooCommerce.com running on the nightly build makes WC Core releases more stable before they reach your extensions. The most useful thing you can do is test your extensions against WC Core beta releases. If you’ve developed defensive patterns for Core regressions, we’d value hearing them.
If you contribute to WooCommerce Core: Your PRs hit a real downstream consumer within a day of merging to trunk. If something breaks our integration, you’ll hear about it within hours or days, not after release.
What’s next
More automation, fewer manual steps. The next steps are post-deploy telemetry as a core part of the upgrade (slow-query anomaly detection, error-rate comparisons, latency percentiles), and routing failure signals back to the right people automatically.
Co-reviewed and frequently paired with Cem Ünalan. Thanks to the rest of the WooCommerce.com engineering team for support along the way.
Leave a Reply