Skip to content

[Enhancement] - Automate GCS bucket, datastore, and GCP app creation#349

Open
leekahung wants to merge 21 commits into
mainfrom
issue-347/automate-gcs-bucket-creation-uploads
Open

[Enhancement] - Automate GCS bucket, datastore, and GCP app creation#349
leekahung wants to merge 21 commits into
mainfrom
issue-347/automate-gcs-bucket-creation-uploads

Conversation

@leekahung
Copy link
Copy Markdown
Contributor

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Infrastructure
  • Maintenance

Description

This PR is a follow-up to #345, where bucket, datastore, and app creation can be made via automation scripts from the backend. The current flow is for file-based storage and deployment. Only users with GCP access can run this automated flow in CLI.

The creation process has been tested, see bucket for tenantfirstaid-05122026-test in GCS and the datastore for tenantfirstaid-test-fresh in AI Applications on GCP, and the app for tenantfirstaid-test-fresh in AI Applications on GCP.

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

To test, you can run the following from the backend directory:

  1. make generate-metadata GCS_BUCKET_NAME=<name-of-your-new-bucket>
  2. make upload-to-gcs GCS_BUCKET_NAME=<name-of-your-new-bucket>
  3. make create-datastore-gcs GCS_BUCKET_NAME=<name-of-your-new-bucket> DATASTORE_NAME=<name-of-your-new-datastore>
  4. make create-app-gcs DATASTORE_ID=<name-of-your-new-datastore> APP_NAME=<name-of-your-new-app>

What each step does:

  1. Creates the new metadata.jsonl reference (PR Include annotations for existing statutes and ASCII enforcement #345)
  2. Creates a new bucket in GCS and uploads metadata.jsonl and the documents to the new bucket (this PR)
  3. Creates a new datastore to and connects your bucket to the datastore (this PR)
  4. Creates a new app on GCP for the datastore to connect to (this PR)

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests have not been included
  • I need help with writing tests

Documentation

  • If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

leekahung added 3 commits May 11, 2026 20:21
…o its own module; Include new unit tests for functions
…tion; Include new unit tests to cover creation scripts; Update doucmentation new automated flow for bucket to connected app on GCP
@leekahung leekahung requested a review from yangm2 May 13, 2026 02:18
@leekahung leekahung added backend Bot implementation and other backend concerns infrastructure Pull requests related to infrastructure and underlying workflows labels May 13, 2026
@leekahung
Copy link
Copy Markdown
Contributor Author

@claude pr-review

@leekahung leekahung marked this pull request as ready for review May 13, 2026 02:19
@claude

This comment was marked as outdated.

@leekahung
Copy link
Copy Markdown
Contributor Author

@claude pr-reivew

@claude

This comment was marked as outdated.

@leekahung leekahung marked this pull request as draft May 13, 2026 02:49
@leekahung leekahung force-pushed the issue-347/automate-gcs-bucket-creation-uploads branch from e2ced7b to e64e9a8 Compare May 17, 2026 23:18
@leekahung leekahung marked this pull request as ready for review May 17, 2026 23:19
@leekahung
Copy link
Copy Markdown
Contributor Author

@claude pr-review

@claude

This comment was marked as outdated.

@leekahung
Copy link
Copy Markdown
Contributor Author

@claude pr-review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 17, 2026

Claude finished @leekahung's task in 3m 23s —— View job


PR Review

  • Read CLAUDE.md and architecture docs
  • Review changed files (Makefile, scripts, tests)
  • Analyze code quality, security, performance
  • Check test coverage and quality
  • Review documentation updates
  • Post final review

Re-reviewing after the latest three pushes (96d4894 Simplifying comments, 07e7d82 Address PR review feedback and renamed DATASTORE_NAME to DATASTORE_ID, add rollback timeout, update docs, 60cf670 Additional fixes). All four items flagged last round are addressed — calling those out first, then a small set of new things.

Resolved since prior review

  • DATASTORE_NAME vs DATASTORE_ID (Format the output #1) — Normalized to DATASTORE_ID everywhere (backend/Makefile:74-86), --datastore-name--datastore-id in create_datastore_gcs.py:186, and APP_NAMEAPP_ID / --app-id for parallel consistency in create_app_gcs.py:78. Help text + console hint now read coherently end-to-end ("Your datastore ID is: ... Next: ... DATASTORE_ID=..." → "create-app-gcs ... APP_ID=...").
  • Rollback timeout (System prompt tweak: It shouldn't think that it's a lawyer THIS much lol. #2)ROLLBACK_TIMEOUT_SECONDS = 120 defined at the module level (create_datastore_gcs.py:29) and threaded through delete_datastore as operation.result(timeout=timeout) (create_datastore_gcs.py:173). The existing cleanup-error test (test_import_failure_with_cleanup_error_reports_manual_step) covers the surrounding code path.
  • Search app vs datastore relationship (Frontend design tweaks #3)Architecture.md:185 now explicitly states "The app is for browsing/previewing the datastore in the GCP console; the backend's LangChain VertexAISearchRetriever queries the datastore directly and does not depend on this app." Same content also added to corpus-dataset.md:55-58. This is the right place for it.
  • Leading-hyphen help text (File reorg and show feedback #4)shared.py:11-15 error message now says "must begin with a letter or digit", and both flag helps (create_datastore_gcs.py:189, create_app_gcs.py:81) match. Self-explanatory failure now.
  • enforce_ascii double-read (carry-over Idea: A Vector database #5)enforce_ascii.py:97-102 now reads once with path.read_text(encoding="utf-8") and catches UnicodeDecodeError from that single read. Resolved.
  • FULL reconciliation docstring (carry-over)import_documents now has a docstring (create_datastore_gcs.py:120-127) explicitly stating "Callers must guarantee the datastore is freshly created (no pre-existing documents)" — exactly the invariant note the prior review asked for.

New / remaining findings

Lower impact / nits

  1. _NAME_PATTERN permits trailing hyphensbackend/scripts/shared.py:6

    ^[a-z0-9][a-z0-9-]{0,62}$ accepts names like my-ds- (trailing hyphen). Discovery Engine documents the format as starting with a letter/digit but is silent on trailing hyphens in some places — empirically, the server rejects trailing-hyphen IDs with InvalidArgument. A tighter regex would catch it locally:

    _NAME_PATTERN = re.compile(r"^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$")

    Nice-to-have. Tests in test_shared.py:17 already cover the leading-hyphen case; a "trailing-" parameter would round out the constraint coverage. Fix this →

  2. test_calls_delete_and_waits doesn't assert the timeout kwargbackend/tests/test_create_datastore_gcs.py:181

    operation.result.assert_called_once()

    Since the whole point of the new code is to pass timeout=ROLLBACK_TIMEOUT_SECONDS, locking that in with assert_called_once_with(timeout=ROLLBACK_TIMEOUT_SECONDS) prevents a future refactor from silently dropping it. One-line tightening. Fix this →

  3. metadata_blob content type application/jsonlbackend/scripts/upload_to_gcs.py:134 (carry-over)

    Canonical media type for JSON Lines is application/x-ndjson. Functionally a no-op since Vertex AI Search reads by URI; pure pedantry, leaving for completeness.

  4. No client-side --bucket validationbackend/scripts/upload_to_gcs.py:143-147

    GCS bucket names have their own rules (3-63 chars, lowercase, dots/hyphens/digits/letters, no consecutive dots, no leading dash, etc.). The existing validate_resource_name is almost a subset (it lacks dots and the length floor), so reusing it isn't quite right. Letting GCS surface the error is a defensible choice — the messages are clear — but worth a one-line note in --bucket help text linking to GCS naming rules for self-service debugging.

  5. bucket.location casing is unspecified at runtimebackend/scripts/create_datastore_gcs.py:59

    The defensive (bucket.location or "").upper() is correct, and the parametrized test covers both upper-case forms (US, US-CENTRAL1, EU-WEST1, etc.). If GCS ever returns mixed-case strings, the upper-case before-comparison handles it. Good defensive code; flagging only because no test exercises lower-case bucket.location returns — could add ("us-east1", "us") to the compatible parametrize as a regression guard. Truly optional.

Documentation

  1. .claude/skills/corpus-dataset.md has uncommitted local changes — appears as M in git status and the local working tree reverts the IAM/permissions block to the pre-Format the output #1-fix state. Just noting — if this is leftover scratch from a CI checkout it's a non-issue; if it's intentional, those uncommitted changes would undo the cross-project note that the previous review explicitly praised. Worth a quick git status confirmation before merge.

What I liked (this round)

  • The DATASTORE_NAMEDATASTORE_ID + APP_NAMEAPP_ID rename is a small change touching many places, and it's done cleanly: error messages, Makefile help, --flag names, console output, and Architecture.md all read consistently as "ID" throughout. The footgun the prior review flagged is gone.
  • ROLLBACK_TIMEOUT_SECONDS = 120 as a module-level constant with the rationale in a comment (create_datastore_gcs.py:26-29) is exactly the right level of explanation — a future reader knows why the timeout exists, not just that it does.
  • The new import_documents docstring (create_datastore_gcs.py:121-127) documents the fresh-datastore invariant the FULL reconciliation mode depends on. This is the kind of caller-contract note that prevents subtle bugs years down the line.
  • The Architecture.md step 5 "for browsing/previewing... does not depend on this app" sentence is the right granularity — short, future-proof, and resolves the "why am I creating this?" question without overcommitting.
  • enforce_ascii.py is now a clean single-read path. Smaller blast radius for the I/O bug.

Summary

Two prior-round items (#1 naming, #2 rollback timeout) are fully resolved; the remaining items are pure polish. Nothing here blocks merge — the strongest candidates for one more pass are #1 (tighten the regex to reject trailing hyphens) and #2 (assert timeout= in the delete test). The uncommitted corpus-dataset.md change (#6) is worth a quick look before merging.

I did not run make lint / make typecheck / make test myself (sandbox restrictions in this CI environment), so please confirm those pass locally before merging.

• Branch: issue-347/automate-gcs-bucket-creation-uploads

Copy link
Copy Markdown
Contributor

@yangm2 yangm2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is so much more reproducible and convenient than clicking through the cloud console interface!

Claude says these are experiments that we want to enable in order to improve RAG retrieval for the ORS/Portland/Eugene docs.

Impact Experiment Script Arg change
Highest Section-boundary chunking generate_metadata_jsonl.py --granularity [file|section] — switches from one entry per .txt file to one entry per 90.XXX section; no Vertex AI API change needed
High Cross-reference expansion generate_metadata_jsonl.py --expand-refs — resolves inline ORS X.XXX citations and appends target section text to each entry's struct_data at generation time
Medium Chunk size / overlap create_datastore_gcs.py --chunk-size INT and --chunk-overlap INT — passed into DocumentProcessingConfig; pairs with --granularity file since section-level entries are already small
Medium Annotations datastore separation generate_metadata_jsonl.py + create_datastore_gcs.py --exclude-annotations flag on metadata generation, run twice to produce two metadata.jsonl files; --datastore-id used twice to create two datastores; --datastore-id on create_app_gcs.py made repeatable (action="append") to attach both to one app

(@lee-coates this is what you've been advocating for, right?)

Additionally, my guess is that we'll end up with multiple RAG tools with unique settings for each, so reproducible deployment of the RAG should be driven by a revision-controlled configuration (e.g. TOML) file.

Speaking of, please update Deployment.md



class TestUploadFiles:
def test_uploads_each_file_and_metadata(self, tmp_path: Path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is noisy ...

tests/test_upload_to_gcs.py::TestUploadFiles::test_uploads_each_file_and_metadata   uploaded a.txt (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/a.txt)
  uploaded b.txt (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/b.txt)
  uploaded metadata.jsonl (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/metadata.jsonl)
PASSED



class TestMain:
def test_dry_run_does_not_call_storage(self, tmp_path: Path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is noisy


client_cls.assert_not_called()

def test_happy_path_creates_bucket_and_uploads(self, tmp_path: Path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is noisy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a M:1 relationship between buckets and datastore that is not exposed this current flow. Also a M:1 relationship between datastore(s) and crawled websites to the app.

This flow should probably also utilize the scripts/vertex_ai_*.py scripts as part of the debug/diagnosis/experimentation.

Comment thread backend/Makefile
Comment on lines +67 to +70
ifeq ($(strip $(GCS_BUCKET_NAME)),)
$(error GCS_BUCKET_NAME is required: make upload-to-gcs GCS_BUCKET_NAME=my-bucket)
endif
$(PYTHON) run python -m scripts.upload_to_gcs --bucket $(GCS_BUCKET_NAME) $(if $(LOCATION),--location $(LOCATION)) $(UPLOAD_OPTIONS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would need a foreach to support multiple buckets

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can do multi-bucket support in a follow-up

Comment on lines +180 to +184
parser.add_argument(
"--bucket",
required=True,
help="GCS bucket holding metadata.jsonl and the document objects.",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have 1 or more buckets in the datastore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can certainly think about multi-bucket support in a follow up PR, we could get the pipeline for single bucket set up here first

Comment thread backend/scripts/create_datastore_gcs.py Outdated
load_gcp_credentials,
)

DEFAULT_LOCATION = "us"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY ... should this just go in constants.py or get read from a corpus-cfg.TOML

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it to constants.py

Comment thread backend/scripts/create_datastore_gcs.py Outdated
Comment thread backend/scripts/create_app_gcs.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case of multiple jurisdictions in a single datastore, I think those should probably be in separate buckets and have separate JSONL files. I think that would simplify the filtering in this script but also streamline the ingestion of different doc structures. This would be more flexible for adding additional jurisdictions in the future.

@leekahung
Copy link
Copy Markdown
Contributor Author

Thanks @yangm2 for the feedback! I'll have a go at the feedback and revise it

leekahung added 7 commits May 27, 2026 22:19
Replace stale create_vector_store.py references with the
upload-to-gcs/create-datastore-gcs flow and document the
vertex_ai_search debugging scripts.
Drop the generate-metadata entry point and call
scripts.generate_metadata_jsonl directly from the Makefile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Bot implementation and other backend concerns infrastructure Pull requests related to infrastructure and underlying workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automate GCS bucket creation and split metadata generator from ASCII encoder

2 participants