USE 306 - Incremental table read error handling #182
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
Primary git commit:
Why these changes are being introduced:
There are a copule of primary ways in which a read method can fail based on the table requested:
For the first, we want to raise an error immediately. For the second, there is a bit more nuance depending on the table requested.
How this addresses that need:
For
TIMDEXDataset.read_*methods, we operate from the assumption thatrecordsandcurrent_recordsshould always be available and so we raise an error indicating a metadata rebuild is required.But for
TIMDEXEmbeddings.read_*methods, and potentially other data sources as added, we may legitimately not have data yet. As such, we'll want to log warnings and suggest a refresh, but just return an empty set.Additional commentary:
This is an incremental improvment to a problem that runs a little deeper.
The improvement is more explicit errors and logging right now. For example, TIM can now attempt to read embeddings for a source during the CLI command
reindex-sourcewithout error, just an empty set if they don't yet exist. We also raise meaningful errors for a brand new dataset, where are after the first records write, we need to build the metadata a first (and only) time.Both of these edge cases point to a slightly deeper need: the creation of DuckDB schemas and tables explicitly, from clearly visible code in TDA, that does not require actual data to exist yet. If we do this, then read methods will naturally just return nothing if nothing exists, beacuse the schemas / tables will always exist. This also serves the function of self-documenting what schemas/tables make up the dataset.
Currently, we rely on creating schemas/tables from pre-existing parquet data. This is often a very fine and good approach, and has served us well, but the dataset structure and patterns have matured enough, we could probably define these tables more explicitly in code solving these edge cases and better documenting the dataset structure.
How can a reviewer manually see the effects of these changes?
1- Start ipython shell
2- Prepare a brand new dataset:
3- Records error handling:
4- Embeddings error handling:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES:
reindex-sourceCLI command, will no longer encounter an error if embeddings don't yet exist.What are the relevant tickets?
Code review