[Enhancement] - Automate GCS bucket, datastore, and GCP app creation by leekahung · Pull Request #349 · codeforpdx/tenantfirstaid

leekahung · 2026-05-13T02:18:34Z

What type of PR is this? (check all applicable)

Description

This PR is a follow-up to #345, where bucket, datastore, and app creation can be made via automation scripts from the backend. The current flow is for file-based storage and deployment. Only users with GCP access can run this automated flow in CLI.

The creation process has been tested, see bucket for tenantfirstaid-05122026-test in GCS and the datastore for tenantfirstaid-test-fresh in AI Applications on GCP, and the app for tenantfirstaid-test-fresh in AI Applications on GCP.

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

To test, you can run the following from the backend directory:

make generate-metadata GCS_BUCKET_NAME=<name-of-your-new-bucket>
make upload-to-gcs GCS_BUCKET_NAME=<name-of-your-new-bucket>
make create-datastore-gcs GCS_BUCKET_NAME=<name-of-your-new-bucket> DATASTORE_NAME=<name-of-your-new-datastore>
make create-app-gcs DATASTORE_ID=<name-of-your-new-datastore> APP_NAME=<name-of-your-new-app>

What each step does:

Creates the new metadata.jsonl reference (PR Include annotations for existing statutes and ASCII enforcement #345)
Creates a new bucket in GCS and uploads metadata.jsonl and the documents to the new bucket (this PR)
Creates a new datastore to and connects your bucket to the datastore (this PR)
Creates a new app on GCP for the datastore to connect to (this PR)

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests have not been included
I need help with writing tests

Documentation

If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

…o its own module; Include new unit tests for functions

…tion; Include new unit tests to cover creation scripts; Update doucmentation new automated flow for bucket to connected app on GCP

leekahung · 2026-05-13T02:18:57Z

@claude pr-review

leekahung · 2026-05-13T02:40:25Z

@claude pr-reivew

…umentation

… import reconciliation to FULL; adjust unit tests

leekahung · 2026-05-17T23:19:33Z

@claude pr-review

…, add rollack timout, update docs

leekahung · 2026-05-17T23:48:06Z

@claude pr-review

claude · 2026-05-17T23:48:25Z

Claude finished @leekahung's task in 3m 23s —— View job

PR Review

Read CLAUDE.md and architecture docs
Review changed files (Makefile, scripts, tests)
Analyze code quality, security, performance
Check test coverage and quality
Review documentation updates
Post final review

Re-reviewing after the latest three pushes (96d4894 Simplifying comments, 07e7d82 Address PR review feedback and renamed DATASTORE_NAME to DATASTORE_ID, add rollback timeout, update docs, 60cf670 Additional fixes). All four items flagged last round are addressed — calling those out first, then a small set of new things.

Resolved since prior review

DATASTORE_NAME vs DATASTORE_ID (Format the output #1) — Normalized to DATASTORE_ID everywhere (backend/Makefile:74-86), --datastore-name → --datastore-id in create_datastore_gcs.py:186, and APP_NAME → APP_ID / --app-id for parallel consistency in create_app_gcs.py:78. Help text + console hint now read coherently end-to-end ("Your datastore ID is: ... Next: ... DATASTORE_ID=..." → "create-app-gcs ... APP_ID=...").
Rollback timeout (System prompt tweak: It shouldn't think that it's a lawyer THIS much lol. #2) — ROLLBACK_TIMEOUT_SECONDS = 120 defined at the module level (create_datastore_gcs.py:29) and threaded through delete_datastore as operation.result(timeout=timeout) (create_datastore_gcs.py:173). The existing cleanup-error test (test_import_failure_with_cleanup_error_reports_manual_step) covers the surrounding code path.
Search app vs datastore relationship (Frontend design tweaks #3) — Architecture.md:185 now explicitly states "The app is for browsing/previewing the datastore in the GCP console; the backend's LangChain VertexAISearchRetriever queries the datastore directly and does not depend on this app." Same content also added to corpus-dataset.md:55-58. This is the right place for it.
Leading-hyphen help text (File reorg and show feedback #4) — shared.py:11-15 error message now says "must begin with a letter or digit", and both flag helps (create_datastore_gcs.py:189, create_app_gcs.py:81) match. Self-explanatory failure now.
enforce_ascii double-read (carry-over Idea: A Vector database #5) — enforce_ascii.py:97-102 now reads once with path.read_text(encoding="utf-8") and catches UnicodeDecodeError from that single read. Resolved.
FULL reconciliation docstring (carry-over) — import_documents now has a docstring (create_datastore_gcs.py:120-127) explicitly stating "Callers must guarantee the datastore is freshly created (no pre-existing documents)" — exactly the invariant note the prior review asked for.

New / remaining findings

Lower impact / nits

_NAME_PATTERN permits trailing hyphens — backend/scripts/shared.py:6

^[a-z0-9][a-z0-9-]{0,62}$ accepts names like my-ds- (trailing hyphen). Discovery Engine documents the format as starting with a letter/digit but is silent on trailing hyphens in some places — empirically, the server rejects trailing-hyphen IDs with InvalidArgument. A tighter regex would catch it locally:
```
_NAME_PATTERN = re.compile(r"^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$")
```
Nice-to-have. Tests in test_shared.py:17 already cover the leading-hyphen case; a "trailing-" parameter would round out the constraint coverage. Fix this →
test_calls_delete_and_waits doesn't assert the timeout kwarg — backend/tests/test_create_datastore_gcs.py:181
```
operation.result.assert_called_once()
```
Since the whole point of the new code is to pass timeout=ROLLBACK_TIMEOUT_SECONDS, locking that in with assert_called_once_with(timeout=ROLLBACK_TIMEOUT_SECONDS) prevents a future refactor from silently dropping it. One-line tightening. Fix this →
metadata_blob content type application/jsonl — backend/scripts/upload_to_gcs.py:134 (carry-over)

Canonical media type for JSON Lines is application/x-ndjson. Functionally a no-op since Vertex AI Search reads by URI; pure pedantry, leaving for completeness.
No client-side --bucket validation — backend/scripts/upload_to_gcs.py:143-147

GCS bucket names have their own rules (3-63 chars, lowercase, dots/hyphens/digits/letters, no consecutive dots, no leading dash, etc.). The existing validate_resource_name is almost a subset (it lacks dots and the length floor), so reusing it isn't quite right. Letting GCS surface the error is a defensible choice — the messages are clear — but worth a one-line note in --bucket help text linking to GCS naming rules for self-service debugging.
bucket.location casing is unspecified at runtime — backend/scripts/create_datastore_gcs.py:59

The defensive (bucket.location or "").upper() is correct, and the parametrized test covers both upper-case forms (US, US-CENTRAL1, EU-WEST1, etc.). If GCS ever returns mixed-case strings, the upper-case before-comparison handles it. Good defensive code; flagging only because no test exercises lower-case bucket.location returns — could add ("us-east1", "us") to the compatible parametrize as a regression guard. Truly optional.

Documentation

.claude/skills/corpus-dataset.md has uncommitted local changes — appears as M in git status and the local working tree reverts the IAM/permissions block to the pre-Format the output #1-fix state. Just noting — if this is leftover scratch from a CI checkout it's a non-issue; if it's intentional, those uncommitted changes would undo the cross-project note that the previous review explicitly praised. Worth a quick git status confirmation before merge.

What I liked (this round)

The DATASTORE_NAME → DATASTORE_ID + APP_NAME → APP_ID rename is a small change touching many places, and it's done cleanly: error messages, Makefile help, --flag names, console output, and Architecture.md all read consistently as "ID" throughout. The footgun the prior review flagged is gone.
ROLLBACK_TIMEOUT_SECONDS = 120 as a module-level constant with the rationale in a comment (create_datastore_gcs.py:26-29) is exactly the right level of explanation — a future reader knows why the timeout exists, not just that it does.
The new import_documents docstring (create_datastore_gcs.py:121-127) documents the fresh-datastore invariant the FULL reconciliation mode depends on. This is the kind of caller-contract note that prevents subtle bugs years down the line.
The Architecture.md step 5 "for browsing/previewing... does not depend on this app" sentence is the right granularity — short, future-proof, and resolves the "why am I creating this?" question without overcommitting.
enforce_ascii.py is now a clean single-read path. Smaller blast radius for the I/O bug.

Summary

Two prior-round items (#1 naming, #2 rollback timeout) are fully resolved; the remaining items are pure polish. Nothing here blocks merge — the strongest candidates for one more pass are #1 (tighten the regex to reject trailing hyphens) and #2 (assert timeout= in the delete test). The uncommitted corpus-dataset.md change (#6) is worth a quick look before merging.

I did not run make lint / make typecheck / make test myself (sandbox restrictions in this CI environment), so please confirm those pass locally before merging.

• Branch: issue-347/automate-gcs-bucket-creation-uploads
•

yangm2

Thanks this is so much more reproducible and convenient than clicking through the cloud console interface!

Claude says these are experiments that we want to enable in order to improve RAG retrieval for the ORS/Portland/Eugene docs.

Impact	Experiment	Script	Arg change
Highest	Section-boundary chunking	generate_metadata_jsonl.py	--granularity [file\|section] — switches from one entry per .txt file to one entry per 90.XXX section; no Vertex AI API change needed
High	Cross-reference expansion	generate_metadata_jsonl.py	--expand-refs — resolves inline ORS X.XXX citations and appends target section text to each entry's struct_data at generation time
Medium	Chunk size / overlap	create_datastore_gcs.py	--chunk-size INT and --chunk-overlap INT — passed into DocumentProcessingConfig; pairs with --granularity file since section-level entries are already small
Medium	Annotations datastore separation	generate_metadata_jsonl.py + create_datastore_gcs.py	--exclude-annotations flag on metadata generation, run twice to produce two metadata.jsonl files; --datastore-id used twice to create two datastores; --datastore-id on create_app_gcs.py made repeatable (action="append") to attach both to one app

(@lee-coates this is what you've been advocating for, right?)

Additionally, my guess is that we'll end up with multiple RAG tools with unique settings for each, so reproducible deployment of the RAG should be driven by a revision-controlled configuration (e.g. TOML) file.

Speaking of, please update Deployment.md

yangm2 · 2026-05-27T18:54:43Z

+
+
+class TestUploadFiles:
+    def test_uploads_each_file_and_metadata(self, tmp_path: Path):


This test is noisy ...

tests/test_upload_to_gcs.py::TestUploadFiles::test_uploads_each_file_and_metadata uploaded a.txt (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/a.txt) uploaded b.txt (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/b.txt) uploaded metadata.jsonl (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/metadata.jsonl) PASSED

yangm2 · 2026-05-27T18:55:12Z

+
+
+class TestMain:
+    def test_dry_run_does_not_call_storage(self, tmp_path: Path):


this test is noisy

yangm2 · 2026-05-27T18:55:21Z

+
+        client_cls.assert_not_called()
+
+    def test_happy_path_creates_bucket_and_uploads(self, tmp_path: Path):


this test is noisy

yangm2 · 2026-05-27T19:38:46Z

There's a M:1 relationship between buckets and datastore that is not exposed this current flow. Also a M:1 relationship between datastore(s) and crawled websites to the app.

This flow should probably also utilize the scripts/vertex_ai_*.py scripts as part of the debug/diagnosis/experimentation.

yangm2 · 2026-05-27T19:43:36Z

+ifeq ($(strip $(GCS_BUCKET_NAME)),)
+	$(error GCS_BUCKET_NAME is required: make upload-to-gcs GCS_BUCKET_NAME=my-bucket)
+endif
+	$(PYTHON) run python -m scripts.upload_to_gcs --bucket $(GCS_BUCKET_NAME) $(if $(LOCATION),--location $(LOCATION)) $(UPLOAD_OPTIONS)


would need a foreach to support multiple buckets

Think we can do multi-bucket support in a follow-up

yangm2 · 2026-05-27T19:53:33Z

+    parser.add_argument(
+        "--bucket",
+        required=True,
+        help="GCS bucket holding metadata.jsonl and the document objects.",
+    )


Could have 1 or more buckets in the datastore

we can certainly think about multi-bucket support in a follow up PR, we could get the pipeline for single bucket set up here first

yangm2 · 2026-05-27T19:55:35Z

+    load_gcp_credentials,
+)
+
+DEFAULT_LOCATION = "us"


DRY ... should this just go in constants.py or get read from a corpus-cfg.TOML

moving it to constants.py

yangm2 · 2026-05-27T20:08:36Z

For the case of multiple jurisdictions in a single datastore, I think those should probably be in separate buckets and have separate JSONL files. I think that would simplify the filtering in this script but also streamline the ingestion of different doc structures. This would be more flexible for adding additional jurisdictions in the future.

leekahung · 2026-05-28T00:32:46Z

Thanks @yangm2 for the feedback! I'll have a go at the feedback and revise it

Replace stale create_vector_store.py references with the upload-to-gcs/create-datastore-gcs flow and document the vertex_ai_search debugging scripts.

Drop the generate-metadata entry point and call scripts.generate_metadata_jsonl directly from the Makefile.

…constant

leekahung added 3 commits May 11, 2026 20:21

Automate GCS bucket creation and uploads; Split ASCII enforcement int…

221e7b5

…o its own module; Include new unit tests for functions

Add Makefile targets and scripts for Vertex AI datastore and app crea…

fbd1f04

…tion; Include new unit tests to cover creation scripts; Update doucmentation new automated flow for bucket to connected app on GCP

Update architecture.md

fab02ae

leekahung requested a review from yangm2 May 13, 2026 02:18

leekahung added backend Bot implementation and other backend concerns infrastructure Pull requests related to infrastructure and underlying workflows labels May 13, 2026

leekahung marked this pull request as ready for review May 13, 2026 02:19

This comment was marked as outdated.

Sign in to view

Adjust script logic; update unit tests

ad3c9bb

This comment was marked as outdated.

Sign in to view

leekahung marked this pull request as draft May 13, 2026 02:49

leekahung added 4 commits May 13, 2026 01:54

Address feedback; create shared utils for gcs/gcp scripts; update doc…

1c78c8a

…umentation

Fixed formatting

6380822

Address review feedback; move collection_path to shared utils; switch…

2f74170

… import reconciliation to FULL; adjust unit tests

Resolve merge conflict

e64e9a8

leekahung force-pushed the issue-347/automate-gcs-bucket-creation-uploads branch from e2ced7b to e64e9a8 Compare May 17, 2026 23:18

leekahung marked this pull request as ready for review May 17, 2026 23:19

This comment was marked as outdated.

Sign in to view

leekahung added 3 commits May 17, 2026 16:23

Simplifying comments

96d4894

Address PR review feedback and renamed DATASTORE_NAME to DATASTORE_ID…

07e7d82

…, add rollack timout, update docs

Additional fixes

60cf670

leekahung added 3 commits May 17, 2026 17:30

Tighten resource name validation and lock in rollback timeout contract

9c3e54b

adjust script comment

f664aab

Fixed formatting

4d74f02

yangm2 requested changes May 27, 2026

View reviewed changes

yangm2 mentioned this pull request May 27, 2026

Enable Oregon Law Help RAG tool for family law queries #351

Open

11 tasks

leekahung added 7 commits May 27, 2026 22:19

Update corpus docs to reference GCS ingestion pipeline

72269a4

Replace stale create_vector_store.py references with the upload-to-gcs/create-datastore-gcs flow and document the vertex_ai_search debugging scripts.

Invoke generate-metadata via module instead of console script

ec0f010

Drop the generate-metadata entry point and call scripts.generate_metadata_jsonl directly from the Makefile.

Warn against deleting production app/datastore on name collision

d3019ab

Suppress stdout in upload_to_gcs tests

7666c6f

removing -s flag from pytest

936e37e

Centralize Vertex AI Search location default; rename bucket-location …

6cf8f85

…constant

Dedupe GCS script test mocks into shared helpers

f9ccd04



		class TestUploadFiles:
		def test_uploads_each_file_and_metadata(self, tmp_path: Path):



		class TestMain:
		def test_dry_run_does_not_call_storage(self, tmp_path: Path):


		client_cls.assert_not_called()

		def test_happy_path_creates_bucket_and_uploads(self, tmp_path: Path):

Conversation

leekahung commented May 13, 2026

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

Documentation

[optional] Are there any post deployment tasks we need to perform?

Uh oh!

leekahung commented May 13, 2026

Uh oh!

This comment was marked as outdated.

leekahung commented May 13, 2026

Uh oh!

This comment was marked as outdated.

leekahung commented May 17, 2026

Uh oh!

This comment was marked as outdated.

leekahung commented May 17, 2026

Uh oh!

claude Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Resolved since prior review

New / remaining findings

Lower impact / nits

Documentation

What I liked (this round)

Summary

Uh oh!

yangm2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leekahung commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 17, 2026 •

edited

Loading

yangm2 left a comment •

edited

Loading