[Enhancement] - Automate GCS bucket, datastore, and GCP app creation#349
[Enhancement] - Automate GCS bucket, datastore, and GCP app creation#349leekahung wants to merge 21 commits into
Conversation
…o its own module; Include new unit tests for functions
…tion; Include new unit tests to cover creation scripts; Update doucmentation new automated flow for bucket to connected app on GCP
|
@claude pr-review |
This comment was marked as outdated.
This comment was marked as outdated.
|
@claude pr-reivew |
This comment was marked as outdated.
This comment was marked as outdated.
… import reconciliation to FULL; adjust unit tests
e2ced7b to
e64e9a8
Compare
|
@claude pr-review |
This comment was marked as outdated.
This comment was marked as outdated.
…, add rollack timout, update docs
|
@claude pr-review |
|
Claude finished @leekahung's task in 3m 23s —— View job PR Review
Re-reviewing after the latest three pushes ( Resolved since prior review
New / remaining findingsLower impact / nits
Documentation
What I liked (this round)
SummaryTwo prior-round items ( I did not run • Branch: |
There was a problem hiding this comment.
Thanks this is so much more reproducible and convenient than clicking through the cloud console interface!
Claude says these are experiments that we want to enable in order to improve RAG retrieval for the ORS/Portland/Eugene docs.
| Impact | Experiment | Script | Arg change |
|---|---|---|---|
| Highest | Section-boundary chunking | generate_metadata_jsonl.py | --granularity [file|section] — switches from one entry per .txt file to one entry per 90.XXX section; no Vertex AI API change needed |
| High | Cross-reference expansion | generate_metadata_jsonl.py | --expand-refs — resolves inline ORS X.XXX citations and appends target section text to each entry's struct_data at generation time |
| Medium | Chunk size / overlap | create_datastore_gcs.py | --chunk-size INT and --chunk-overlap INT — passed into DocumentProcessingConfig; pairs with --granularity file since section-level entries are already small |
| Medium | Annotations datastore separation | generate_metadata_jsonl.py + create_datastore_gcs.py | --exclude-annotations flag on metadata generation, run twice to produce two metadata.jsonl files; --datastore-id used twice to create two datastores; --datastore-id on create_app_gcs.py made repeatable (action="append") to attach both to one app |
(@lee-coates this is what you've been advocating for, right?)
Additionally, my guess is that we'll end up with multiple RAG tools with unique settings for each, so reproducible deployment of the RAG should be driven by a revision-controlled configuration (e.g. TOML) file.
Speaking of, please update Deployment.md
|
|
||
|
|
||
| class TestUploadFiles: | ||
| def test_uploads_each_file_and_metadata(self, tmp_path: Path): |
There was a problem hiding this comment.
This test is noisy ...
tests/test_upload_to_gcs.py::TestUploadFiles::test_uploads_each_file_and_metadata uploaded a.txt (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/a.txt)
uploaded b.txt (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/b.txt)
uploaded metadata.jsonl (/private/var/folders/57/9t9kbd551dgd9dp_mz_kjxsc0000gn/T/-redacted-/pytest-0/test_uploads_each_file_and_met0/metadata.jsonl)
PASSED|
|
||
|
|
||
| class TestMain: | ||
| def test_dry_run_does_not_call_storage(self, tmp_path: Path): |
|
|
||
| client_cls.assert_not_called() | ||
|
|
||
| def test_happy_path_creates_bucket_and_uploads(self, tmp_path: Path): |
There was a problem hiding this comment.
There's a M:1 relationship between buckets and datastore that is not exposed this current flow. Also a M:1 relationship between datastore(s) and crawled websites to the app.
This flow should probably also utilize the scripts/vertex_ai_*.py scripts as part of the debug/diagnosis/experimentation.
| ifeq ($(strip $(GCS_BUCKET_NAME)),) | ||
| $(error GCS_BUCKET_NAME is required: make upload-to-gcs GCS_BUCKET_NAME=my-bucket) | ||
| endif | ||
| $(PYTHON) run python -m scripts.upload_to_gcs --bucket $(GCS_BUCKET_NAME) $(if $(LOCATION),--location $(LOCATION)) $(UPLOAD_OPTIONS) |
There was a problem hiding this comment.
would need a foreach to support multiple buckets
There was a problem hiding this comment.
Think we can do multi-bucket support in a follow-up
| parser.add_argument( | ||
| "--bucket", | ||
| required=True, | ||
| help="GCS bucket holding metadata.jsonl and the document objects.", | ||
| ) |
There was a problem hiding this comment.
Could have 1 or more buckets in the datastore
There was a problem hiding this comment.
we can certainly think about multi-bucket support in a follow up PR, we could get the pipeline for single bucket set up here first
| load_gcp_credentials, | ||
| ) | ||
|
|
||
| DEFAULT_LOCATION = "us" |
There was a problem hiding this comment.
DRY ... should this just go in constants.py or get read from a corpus-cfg.TOML
There was a problem hiding this comment.
moving it to constants.py
There was a problem hiding this comment.
For the case of multiple jurisdictions in a single datastore, I think those should probably be in separate buckets and have separate JSONL files. I think that would simplify the filtering in this script but also streamline the ingestion of different doc structures. This would be more flexible for adding additional jurisdictions in the future.
|
Thanks @yangm2 for the feedback! I'll have a go at the feedback and revise it |
Replace stale create_vector_store.py references with the upload-to-gcs/create-datastore-gcs flow and document the vertex_ai_search debugging scripts.
Drop the generate-metadata entry point and call scripts.generate_metadata_jsonl directly from the Makefile.
What type of PR is this? (check all applicable)
Description
This PR is a follow-up to #345, where bucket, datastore, and app creation can be made via automation scripts from the backend. The current flow is for file-based storage and deployment. Only users with GCP access can run this automated flow in CLI.
The creation process has been tested, see bucket for
tenantfirstaid-05122026-testin GCS and the datastore fortenantfirstaid-test-freshin AI Applications on GCP, and the app fortenantfirstaid-test-freshin AI Applications on GCP.Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.
To test, you can run the following from the
backenddirectory:make generate-metadata GCS_BUCKET_NAME=<name-of-your-new-bucket>make upload-to-gcs GCS_BUCKET_NAME=<name-of-your-new-bucket>make create-datastore-gcs GCS_BUCKET_NAME=<name-of-your-new-bucket> DATASTORE_NAME=<name-of-your-new-datastore>make create-app-gcs DATASTORE_ID=<name-of-your-new-datastore> APP_NAME=<name-of-your-new-app>What each step does:
Added/updated tests?
Documentation
Architecture.mdhas been updated[optional] Are there any post deployment tasks we need to perform?