Skip to content

feat(playwright): add max-parallel and customizable timeouts, add docker logs on Grafana startup failure#711

Merged
xnyo merged 10 commits into
mainfrom
l2d2/feat-playwright-concurrency
May 22, 2026
Merged

feat(playwright): add max-parallel and customizable timeouts, add docker logs on Grafana startup failure#711
xnyo merged 10 commits into
mainfrom
l2d2/feat-playwright-concurrency

Conversation

@L2D2Grafana
Copy link
Copy Markdown
Contributor

@L2D2Grafana L2D2Grafana commented May 6, 2026

Description ✨

Attempt to fix playwright docker containers failing to startup, by allowing users to set a max-concurrency limit or add a longer grafana-startup-timeout.

Drilldown apps and Logs Drilldown in particular are experiencing playwright docker containers that are failing to startup in under 60s. 7 matrix tests are running concurrently since we are obligated to work with grafana 11.6 and I believe this is causing the failure.
Screenshot 2026-05-06 at 10 06 53 AM

Summary 📝

  • Added configurable Playwright grafana-startup-timeout by introducing playwright-grafana-startup-timeout in cd.yml and ci.yml, then forwarding it as grafana-startup-timeout into playwright.yml.
  • Added configurable Playwright matrix concurrency by introducing playwright-max-parallel in cd.yml and ci.yml, then forwarding it as max-parallel into playwright.yml.
  • Updated playwright.yml to apply strategy.max-parallel: ${{ inputs.max-parallel }} with a default of 256, preserving existing behavior unless explicitly overridden.
  • Added a failure-only diagnostics step after wait-for-grafana to print docker compose ps -a and recent docker compose logs, making startup/concurrency flakes actionable in CI logs.

Test 🧪

@L2D2Grafana L2D2Grafana self-assigned this May 6, 2026
@L2D2Grafana L2D2Grafana requested review from a team as code owners May 6, 2026 17:13
@grafana-plugins-platform-bot grafana-plugins-platform-bot Bot moved this from 📬 Triage to 🔬 In review in Grafana Catalog Team May 6, 2026
@L2D2Grafana
Copy link
Copy Markdown
Contributor Author

Oh wow now it's failing with only 3 in the matrix

Copy link
Copy Markdown
Member

@xnyo xnyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! However, I noticed the latest version of wait-for-grafana (v1.0.3) changed the way timeouts are handled and added a new input startupTimeout that's specific for the Grafana startup, separate from timeout: grafana/plugin-actions#213

Maybe we can take this opportunity to:

  • Bump wait-for-grafana to v1.0.3
  • Bind the new playwright-grafana-startup-timeout to the new setupTimeout input
  • Introduce also playwright-grafana-timeout which binds to the existing timeout input (defaulting to 60)

WDYT?

@xnyo
Copy link
Copy Markdown
Member

xnyo commented May 7, 2026

Oh wow now it's failing with only 3 in the matrix

@L2D2Grafana Hmm yeah this is unusual 🤔 . I don't think it's related to the number of jobs in the matrix since each job should run into its own VM. Maybe something is preventing Grafana from starting at all. Can you try pinning the workflow to this branch @l2d2/feat-playwright-concurrency and see if there's anything in the logs, thanks to the new step you added in this PR?

Note: If logs-drilldown requires some secrets in Vault the workflow will fail (non-main branches and non-release tags do not have access to Vault for security reasons). If this is the case, I can help testing in one of the testing repos for plugin-ci-workflows, which can access Vault from any branch (including PRs)

@L2D2Grafana
Copy link
Copy Markdown
Contributor Author

Having debug logs for playwrights docker container startup allowed me to dig deeper into the issue, Grafana is crashing. Grafana 13.x added the grafana-apiserver advisor check-type bootstrap, which now competes with the provisioning subsystem at exactly the worst time. https://github.com/grafana/logs-drilldown/actions/runs/25501547107/job/74836196412?pr=1883

logger=provisioning level=error msg="Failed to provision data sources"
  error="Datasource provisioning error: database is locked (5) (SQLITE_BUSY)"
Error: ✗ invalid service state: Failed, expected: Running, failure:
  starting module provisioning: ... database is locked (5) (SQLITE_BUSY)

🤖 So Grafana isn't slow — it's crashing during startup. That's why we see 60 s of 000 (TCP refused): the process is exiting before binding :3000, and wait-for-grafana polls a port that never opens.

What's actually happening
Two subsystems are writing to the same SQLite file (grafana.db) at the exact same time during boot:

Legacy provisioning (logger=provisioning.datasources) is inserting the 5 provisioned datasources from provisioning/datasources/default.yaml:
gdev-testdata, gdev-loki, gdev-tempo, gdev-prometheus, grafanacloud-dev-logs
Each one writes a row to data_source plus a row to secrets.kvstore.
Grafana 13's new apiserver bootstrap (logger=grafana-apiserver + app=advisor.app runner=advisor.checktyperegisterer) is concurrently doing POST /apis/advisor.grafana.app/v0alpha1/.../checktypes for 7 check types (datasource, plugin, ssosetting, config, instance, license, authentication...). Each POST writes a row to resource via resource-server.
SQLite is single-writer. When both fight for the file lock, one gets SQLITE_BUSY (5) and retries:

Why this only bites on contended runners
On a fast/quiet runner, each SQLite write completes in single-digit ms, so the writes from both subsystems naturally interleave inside the lock. On a contended runner with high IO steal, individual writes take long enough that retries exhaust before the queue clears.

@L2D2Grafana
Copy link
Copy Markdown
Contributor Author

L2D2Grafana commented May 7, 2026

This issue might be fixed already in Grafan 13.0.2 by grafana/grafana#123034. A workaround seems to be disabling the Advisor app GF_FEATURE_TOGGLES_grafanaAdvisor: 'false' grafana/logs-drilldown#1886. It could still be nice to merge this for the next poor soul whose's docker image doesn't come online.

@L2D2Grafana L2D2Grafana requested a review from a team as a code owner May 11, 2026 16:25
Copy link
Copy Markdown
Member

@xnyo xnyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@L2D2Grafana Great catch on the investigation! I think this is a great addition for the customizable timeouts and failure logs, so I also agree this is worth merging imo 👍 . Before approving, I suggest only renaming the PR (it's used for the changelog) to be a bit more descriptive, something like:

feat(playwright): add max-parallel and customizable timeouts, add docker logs on Grafana startup failure

@L2D2Grafana L2D2Grafana changed the title feat(playwright): add max-parallel and docker logs feat(playwright): add max-parallel and customizable timeouts, add docker logs on Grafana startup failure May 12, 2026
@L2D2Grafana
Copy link
Copy Markdown
Contributor Author

@L2D2Grafana Great catch on the investigation! I think this is a great addition for the customizable timeouts and failure logs, so I also agree this is worth merging imo 👍 . Before approving, I suggest only renaming the PR (it's used for the changelog) to be a bit more descriptive, something like:

feat(playwright): add max-parallel and customizable timeouts, add docker logs on Grafana startup failure

Updated, ty!

@L2D2Grafana
Copy link
Copy Markdown
Contributor Author

L2D2Grafana commented May 21, 2026

FYI this is affecting other teams https://raintank-corp.slack.com/archives/C08QSAXQBCZ/p1778865730276499 and #721

Copy link
Copy Markdown
Member

@xnyo xnyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@xnyo xnyo merged commit a879e03 into main May 22, 2026
14 checks passed
@github-project-automation github-project-automation Bot moved this from 🔬 In review to 🚀 Shipped in Grafana Catalog Team May 22, 2026
@xnyo xnyo deleted the l2d2/feat-playwright-concurrency branch May 22, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🚀 Shipped

Development

Successfully merging this pull request may close these issues.

3 participants