Skip to content

Conversation

@alexguo-db
Copy link
Contributor

  • Currently the Databricks driver E2E tests do not run on PRs, leading to breaking changes going undetected
  • This PR adds E2E tests to the CI, which trigger when there are changes to any Databricks ADBC driver-related code
  • This PR adds a dummy test, since not all E2E tests are passing currently, once they are we can test all of them in CI
  • The workflow uses workload identity federation, so no secrets need to be stored

Before merging:

  • Since the tests are E2E, we would like maintainers to approve workflows instead of automatically triggering them
  • I would like to request the creation of an environment protection rule for the databricks-e2e environment, so that approval is required
  • I would like to request Databricks users to be added to the list of approvers, if possible

@lidavidm
Copy link
Member

lidavidm commented Jul 8, 2025

We don't have access to things like that. We could ask INFRA, but I'm not sure if they would add non-committers to the approval list or not.

@alexguo-db alexguo-db force-pushed the alex-guo_data/alexguo-add-databricks-e2e-ci branch from e43ea36 to 6785dc1 Compare July 8, 2025 16:25
@alexguo-db alexguo-db force-pushed the alex-guo_data/alexguo-add-databricks-e2e-ci branch from 1ee8037 to 5901ee0 Compare July 8, 2025 16:43
@alexguo-db alexguo-db force-pushed the alex-guo_data/alexguo-add-databricks-e2e-ci branch from fc78074 to a66afa3 Compare July 8, 2025 17:06
@alexguo-db alexguo-db force-pushed the alex-guo_data/alexguo-add-databricks-e2e-ci branch from 9a0b830 to 01997af Compare July 8, 2025 17:57
@alexguo-db alexguo-db marked this pull request as ready for review July 28, 2025 18:44
@alexguo-db alexguo-db requested a review from lidavidm as a code owner July 28, 2025 18:44
@alexguo-db alexguo-db requested review from kou and zeroshade as code owners July 28, 2025 18:44
@github-actions github-actions bot added this to the ADBC Libraries 20 milestone Jul 28, 2025
Comment on lines 36 to 43
environments:
databricks-e2e:
wait_timer: 0
required_reviewers:
- alexguo-db
- jadewang-db
deployment_branch_policy:
protected_branches: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-d "subject_token_type=urn:ietf:params:oauth:token-type:jwt" \
-d "scope=sql")
DATABRICKS_TOKEN=$(echo "$OAUTH_RESPONSE" | jq -r '.access_token')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use ::add-mask:: in some areas here to ensure that the access tokens don't show up in the logs

@zeroshade
Copy link
Member

This seems fine to me in general, but appears to be non-functional as the ACTIONS_ID_TOKEN_URI isn't populated and is an invalid URL, causing the failed deployments

@alexguo-db
Copy link
Contributor Author

alexguo-db commented Aug 8, 2025

This seems fine to me in general, but appears to be non-functional as the ACTIONS_ID_TOKEN_URI isn't populated and is an invalid URL, causing the failed deployments

Yeah, apparently this is not populated on forks which is why I changed the target from pull_request to pull_request_target. So, this doesn't execute on this PR but any future PRs should execute the main branch's version of this workflow
actions/attest-build-provenance#99

@alexguo-db alexguo-db requested a review from zeroshade August 14, 2025 17:38
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

@kou or @lidavidm Any issues you can see?

with:
dotnet-version: ${{ matrix.dotnet }}
- name: Checkout ADBC
uses: actions/checkout@v4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uses: actions/checkout@v4
uses: actions/checkout@v5

Comment on lines +76 to +81
ref: ${{ github.event.pull_request.head.sha || github.sha }}
fetch-depth: 0
submodules: recursive
- name: Build
shell: bash
run: ci/scripts/csharp_build.sh $(pwd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we may run ci/scripts/csharp_build.sh in forked repository with the pull_request_target context (that has write access to apache/arrow-adbc)?

I think that it's not acceptable based on the ASF GitHub Actions policy: https://infra.apache.org/github-actions-policy.html

Triggers

You MUST NOT use pull_request_target as a trigger on ANY action that exports ANY confidential credentials or tokens such as GITHUB_TOKEN or NPM_TOKEN.

Can we run this on fork not apache/arrow-adbc by removing branches: [main] from on.push?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose the point is that the manual approval for the environment protects against this. But Infra may not have intended for environments to be used this way.

Copy link
Contributor Author

@alexguo-db alexguo-db Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @zeroshade checked with ASF Infra about this use case. Can you confirm with ASF Infra that we can bypass the triggers policy if we have an environment with required reviewers? Otherwise, I don't see how other Apache repos can implement this click-approval pattern

Can we run this on fork not apache/arrow-adbc by removing branches: [main] from on.push?

@kou Sorry, I'm not following why excluding it from being run on the upstream repo would improve the security

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run this job on apache/arrow-adbc, evil pull requests can get GITHUB_TOKEN that has id-token: write permission for apache/arrow-adbc. They may abuse it.

If we run this job on fork repository, evil developers can get only GITHUB_TOKEN for their fork repositories. It's not a problem because they already have permissions that these GITHUB_TOKEN have. They can't get additional permissions for apache/arrow-adbc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within our own repo we found out that our GITHUB_TOKEN would only be valid for 5 minutes thus the exchanged Databricks token is also 5 mins long and any test case passing that limit will fail because of invalidation.
We will probably need to supply a Databricks PAT token instead, which will bypass the GITHUB_TOKEN entirely, would that address the concern here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou That won't work for us, the Databricks service principal is set up to only allow OIDC token exchange when the Github OIDC token originates from the main repo.

If we run this job on fork repository, evil developers can get only GITHUB_TOKEN for their fork repositories.

If we allowed the service principal to accept any Github OIDC token then anybody can create a fork and run malicious queries against the Databricks workspace.

The alternative is to use a Databricks personal access token secret but that runs into the same problem where we can only store it only on the main repo (so only people with main branch permissions can copy the branch to main and run the E2E tests)

How is this click approval pattern implemented elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need to supply a Databricks PAT token instead, which will bypass the GITHUB_TOKEN entirely, would that address the concern here?

Can we avoid using pull_request_target by this? If so, it addresses the concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allowed the service principal to accept any Github OIDC token then anybody can create a fork and run malicious queries against the Databricks workspace.

Can we accept only trusted fork repositories not any fork repositories in Databricks side?

How is this click approval pattern implemented elsewhere?

I haven't seen the implementation in apache/* repositories...

apache/airflow-publish may have an implementation of it...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, can Databricks provide local test tool like MinIO for AWS S3, Azurite fhttps://github.com/Azure/Azurite or Azure Storage and Storage Testbench https://github.com/googleapis/storage-testbench for Google Cloud Storage?

.asf.yaml Outdated
Comment on lines 42 to 45
required_reviewers:
- alexguo-db
- jadewang-db
- eric-wang-1990
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this syntax is wrong:

See also: https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#repository-deployment-environments

Suggested change
required_reviewers:
- alexguo-db
- jadewang-db
- eric-wang-1990
required_reviewers:
- id: alexguo-db
type: User
- id: jadewang-db
type: User
- id: eric-wang-1990
type: User

Comment on lines 22 to 26
- krlmlr
- nbenn
- alexguo-db
- jadewang-db
- eric-wang-1990
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep this list in alphabetical order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants