Feature/i18n react i18next by zhb-ai · Pull Request #261 · microsoft/data-formulator

zhb-ai · 2026-03-22T06:31:27Z

No description provided.

… translation files and internationalization component integration Add Chinese and English translation files covering the main functional modules of the application. Integrate react-i18next to implement internationalization, and modify components to support language switching. Main changes include: - Add locales directory containing en/zh translation files - Configure i18n initialization and language detection - Modify components such as ChatDialog and DataThreadCards to use translations - Add i18next-related dependencies to package.json

… support - Add default prompt texts to Chinese and English translation files - Modify chart recommendation box to use internationalized texts - Update report view to use internationalized texts

…ish switching Add a language switch button component to the application top bar, allowing users to switch the interface language between Chinese and English. Implemented using MUI's ToggleButtonGroup with appropriate style adjustments.

… and column names Add a complete test framework structure, including unit tests, integration tests, and contract tests Add test cases for Chinese table and column name handling, covering name processing logic at different levels Add marked tests for known issues to ensure verification when fixed in the future

Add support for Excel file parsing, including adding xlrd dependency and test cases Fix Chinese table name handling issues in various scenarios, remove markers for known issues Add integration tests to verify the complete flow of Excel upload and Chinese table name processing Update test fixtures and documentation

…me handling Add a new file parsing API endpoint for handling legacy Excel files that cannot be directly parsed by the client. Meanwhile, unify and improve table name processing logic to support Unicode characters and fix known issues. Remove annotations marked as known issues in tests since the related features have been fixed. - Add /api/tables/parse-file endpoint to process .xls files - Unify table name processing logic across multiple modules, supporting Unicode characters such as Chinese - Fix prefix handling when table names start with numbers - Update frontend upload component to use the new parsing API

Add complete support for table metadata after XLS file upload, including: - Automatically retrieve column information from dataframe when schema information is missing - Automatically calculate row count from dataframe when row count information is missing - Save complete column type information when creating tables - Add integration tests to verify table list functionality after XLS upload

…e file handling logic - Add drag-and-drop upload related states and event handling - Refactor file handling logic into shared functions for use by both drag-and-drop and file selection - Add visual feedback effects during drag-and-drop

…yles Display different icons based on message type, and adjust button styles to reflect message severity. Remove unused style imports.

Increase right padding from 12px to 25px for better visual balance

Add translation content for field tooltip texts and encoding channel labels, including both Chinese and English versions Implement tooltip functionality in field cards and encoding cards to display field sources and calculation descriptions Add internationalization translation guidelines documentation explaining translation rules and considerations

…lated components Uniformly adjust width values in EncodingShelfThread, VisualizationView, EncodingShelfCard, and EncodingBox components to improve layout and user experience

…ll agents Add AVAILABLE_LANGUAGES configuration option and language switcher Add language_instruction parameter to all agent constructors Implement agent_language.py to build multi-language prompt fragments Pass current UI language to backend via Accept-Language header

Remove unused field tooltip text, while adding detailed tooltip text for encoding channels. Update related components to use the new tooltip system and remove the old tooltip

Uniformly increase width values across multiple components, including elements in EncodingShelfThread, VisualizationView, EncodingShelfCard, and EncodingBox, to optimize interface layout and user experience.

Update the TRANSLATION_GUIDE.md document with the following changes: 1. Rename the section "Tooltip Strategy for Non-Translatable Keywords" to "Tooltip Strategy for Encoding Channel Labels" 2. Simplify tooltip implementation instructions by removing fieldTooltip-related descriptions 3. Update example code to demonstrate the implementation of channel label tooltips 4. Adjust JSON file structure description by removing fieldTooltip-related entries

- Add new chart type translations to chart.json for both Chinese and English - Add chart category tooltip text - Implement chart name and category tooltip functionality in the EncodingShelfCard component

Enhance the translation guide documentation with detailed explanations for tooltip localization strategies for non-translatable UI labels: 1. Add a Core Principles section explaining why and how to use tooltips 2. Restructure encoding channel label explanations into subsections with implementation details 3. Add localization solutions for chart type names 4. Clarify applicable scenarios and limitations of the tooltip strategy

- Redesign session menu layout using button styles instead of plain text - Add "Local File" category with export/import options to the menu - Replace exit button icon with restart icon - Add divider line to the top toolbar - Update Chinese and English translation files by adding the "localFile" field

refactor(view components): Optimize column width calculation sampling and extract config slider component - Change from random sampling to deterministic sampling for stable column widths - Extract reusable config slider component to reduce code duplication - Disable formula generation button when prompts are empty

…scenario handling Call the onError callback when data formulation fails, no results are returned, token mismatch occurs, or all candidates fail.

- Fix card column count calculation logic using a more accurate formula to calculate the number of columns that can fit - Remove unnecessary right margin styles - Rename PANEL_PADDING to PANE_PADDING consistently

…ation Improve scroll logic to smoothly follow content expansion during collapse animation, using requestAnimationFrame for smooth scrolling effects. Also adjust overflow styles to prevent horizontal scrollbar flickering.

- Unify row number column width to 56px and optimize style display - Remove special handling logic for virtual tables - There is still a remaining issue where the row number column width cannot be fixed.

Optimize visual details of the report creation interface, including: - Adjust element spacing and padding - Unify font sizes and colors - Improve style consistency for buttons and labels

- Add Vitest testing framework configuration - Add tests for data transformation, Redux selectors, and Excel parsing - Update README with test directory structure and how to run tests

…mponents - Add safe rendering logic to ensure object values (such as Date instances from Excel) are converted to strings before rendering - Fix date and rich text value conversion in Excel file processing - Add unit tests to verify safe rendering patterns

…erve non-ASCII characters - Fix the issue where default json.dumps escapes non-ASCII characters, ensuring Chinese and other characters remain unchanged during serialization - Add test cases to verify character preservation behavior in various scenarios

…ata functions - Update error checking to handle various error statuses more robustly. - Improve logging to provide clearer information on repair attempts and final statuses. - Add exception handling during follow-up calls to prevent crashes and log errors appropriately. - Ensure that error messages are sanitized before logging to maintain security.

- Add tests to ensure DuckDB prompts include non-ASCII identifier quoting rules - Add tests to verify file manager table name handling logic - Add integration tests to verify data repair loop logic - Test error messages using sanitize_model_error processing

- Add server-side model registry, supporting global model configuration via environment variables - Frontend distinguishes between server-managed models and user-defined models, optimizing model selection interface - Add model connectivity test API, supporting parallel status checks for multiple models - Remove automatic testing logic, switch to on-demand manual testing - Update i18n multilingual support, improve model management related text

…y, and registry Add three test files to test: 1. /list-global-models API endpoint returns correct model list without leaking sensitive information 2. Security features of global models including credential parsing and error message sanitization 3. ModelRegistry's model discovery functionality and security to ensure API keys are not leaked

Add detailed documentation about data directories, including directory structure and resolution order.

…l for improved navigation - Introduced TopNavButton component for better navigation handling in the AppBar. - Refactored AppFC to AppShell, integrating location-based logic for page selection. - Enhanced AppBar with dynamic button rendering based on the current route. - Improved layout and styling for a more cohesive user experience.

zhb-ai · 2026-03-22T06:36:54Z

@microsoft-github-policy-service agree

…diagnostics - Added `model_info` parameter to `DataRecAgent` and `DataTransformationAgent` for better model context handling. - Updated `derive_data` and `refine_data` functions to pass model information to agents. - Improved error handling and diagnostics reporting in agent responses, including detailed diagnostics in the frontend. - Enhanced JSON spec parsing and output variable assignment checks to ensure correct variable usage in generated code.

- Adjusted `DataRecAgent` and `DataTransformationAgent` to insert language instructions before the execution environment marker, improving context relevance. - Enhanced prompt construction to reduce recency-bias interference on chart-type selection by ensuring language instructions are positioned effectively.

…ion reset - Refactored ResetDialog to change exit functionality to reset, updating state management and button actions accordingly. - Updated i18n strings in English and Chinese to reflect the new reset terminology and warnings. - Increased default formulate timeout from 30 to 60 seconds in ConfigDialog for improved user experience.

Copilot

Pull request overview

This PR introduces React i18n support (react-i18next) and expands internationalization across the frontend, while also adding a substantial set of backend/frontend tests and improving Unicode handling and model configuration (including server-managed “global models”).

Changes:

Add i18next + react-i18next setup, locale resources (en/zh), and replace many hard-coded UI strings with t(...) calls.
Add global model registry + API support (server-managed models) and related backend tests; improve agent prompt language control and diagnostics payloads.
Add Vitest-based frontend unit tests and broaden Python test coverage around Unicode table-name sanitization, JSON serialization, and upload/parse flows (including legacy .xls parsing).

Reviewed changes

Copilot reviewed 116 out of 121 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
vitest.config.ts	Adds Vitest configuration for frontend unit tests (jsdom + setup file).
tests/frontend/unit/views/safeCellRender.test.tsx	Tests safe rendering of object/boolean cell values in React tables/grids.
tests/frontend/unit/views/checkIsLikelyTextOnlyModel.test.ts	Tests heuristic for text-only model detection.
tests/frontend/unit/data/resolveExcelCellValue.test.ts	Tests Excel cell value normalization for richText/hyperlink/formula/error.
tests/frontend/unit/data/coerceDate.test.ts	Tests date coercion behavior for Date/null/strings/timestamps.
tests/frontend/unit/app/dfSelectors.test.ts	Tests selector logic for active model selection.
tests/frontend/setup.ts	Adds jest-dom matchers for Vitest.
tests/frontend/README.md	Documents frontend test layout and Vitest commands.
tests/conftest.py	Ensures `py-src` is importable for Python tests.
tests/backend/unit/test_workspace_fresh_names.py	Tests workspace fresh-name generation with Unicode names.
tests/backend/unit/test_unicode_table_name_sanitization.py	Tests Unicode preservation across multiple sanitizers.
tests/backend/unit/test_parquet_utils_table_names.py	Tests parquet sanitizer for safety, casing, and Unicode.
tests/backend/unit/test_model_registry.py	Tests env-var model registry loading + credential isolation.
tests/backend/unit/test_list_global_models_api.py	Tests `/api/agent/list-global-models` response shape and secret redaction.
tests/backend/unit/test_json_chinese_serialization.py	Regression tests for `json.dumps(..., ensure_ascii=False)` patterns.
tests/backend/unit/test_global_model_security.py	Tests global model credential resolution + error sanitization properties.
tests/backend/unit/test_file_manager_table_names.py	Tests file-manager table name sanitization with Unicode and edge cases.
tests/backend/unit/test_external_data_loader_table_names.py	Tests external loader table name sanitization rules/limits.
tests/backend/unit/test_duckdb_notes_prompt.py	Ensures DuckDB notes mention non-ASCII identifier quoting rule.
tests/backend/unit/test_client_image_strip.py	Tests client retry logic and stripping `image_url` blocks for text-only models.
tests/backend/unit/test_agent_utils_sql_table_names.py	Tests SQL sanitizer and DuckDB view creation with Unicode names.
tests/backend/unit/README.md	Documents backend unit test conventions.
tests/backend/integration/test_parse_file_endpoint.py	Integration tests for server-side parse-file endpoint.
tests/backend/integration/test_excel_fixture_parsing.py	Ensures `.xls` fixture can be parsed by pandas.
tests/backend/integration/test_create_table_xls_upload.py	End-to-end `.xls` upload flow tests (workspace + list-tables).
tests/backend/integration/README.md	Documents backend integration testing scope.
tests/backend/fixtures/README.md	Documents fixture directory usage.
tests/backend/contract/test_table_name_contracts.py	Contract tests for route-level sanitization guarantees with Unicode.
tests/backend/contract/README.md	Documents contract test intent for boundary stability.
tests/backend/README.md	Documents backend test layering and recommended expansion order.
tests/README.md	Documents overall test tree organization and commands.
src/views/SelectableDataGrid.tsx	Adds i18n strings, fixed table layout/colgroup, safer cell rendering, UI tweaks.
src/views/ReactTable.tsx	Prevents rendering object values directly by stringifying objects.
src/views/MultiTablePreview.tsx	i18n for empty/preview/remove labels and rows×cols display.
src/views/MessageSnackbar.tsx	Adds diagnostics viewer, i18n, and message-button severity indicator.
src/views/EncodingShelfThread.tsx	Adjusts encoding shelf width.
src/views/EncodingBox.tsx	i18n for channel labels/tips and several UI strings; minor refactors.
src/views/DataView.tsx	Makes row sampling deterministic and standardizes row-id column sizing.
src/views/DataThreadCards.tsx	Adds `t(...)` to table-card tooltips/aria labels for i18n.
src/views/DataLoadingThread.tsx	i18n throughout, adds text-only model heuristic + safer preview formatting.
src/views/DataFormulator.tsx	Uses unified model selector, i18n for landing/footer, changes model auto-select behavior.
src/views/DBTableManager.tsx	i18n for DB manager UI strings and status messages.
src/views/ChatThreadView.tsx	i18n for labels, accessibility improvements for collapsible rows.
src/views/ChatDialog.tsx	i18n for dialog labels and buttons.
src/views/ChartifactDialog.tsx	i18n for report title/footer text via `i18n.t(...)`.
src/views/ChartRecBox.tsx	i18n placeholders/tooltips and minor prompt helper refactor.
src/views/AgentRulesDialog.tsx	i18n for rule dialog labels and buttons.
src/views/About.tsx	i18n for feature descriptions and accessibility labels.
src/scss/DataView.scss	Adjusts styling for row-id header cell sizing.
src/index.tsx	Initializes i18n on app startup.
src/i18n/locales/zh/upload.json	Adds Chinese strings for upload flow.
src/i18n/locales/zh/navigation.json	Adds Chinese strings for navigation.
src/i18n/locales/zh/model.json	Adds Chinese strings for model UI.
src/i18n/locales/zh/messages.json	Adds Chinese strings for messages UI.
src/i18n/locales/zh/index.ts	Aggregates zh locale modules.
src/i18n/locales/zh/encoding.json	Adds Chinese strings for encoding UI.
src/i18n/locales/zh/chart.json	Adds Chinese strings for chart UI.
src/i18n/locales/index.ts	Exports `en` and `zh` locale bundles.
src/i18n/locales/en/upload.json	Adds English strings for upload flow.
src/i18n/locales/en/navigation.json	Adds English strings for navigation.
src/i18n/locales/en/model.json	Adds English strings for model UI.
src/i18n/locales/en/messages.json	Adds English strings for messages UI.
src/i18n/locales/en/index.ts	Aggregates en locale modules.
src/i18n/locales/en/encoding.json	Adds English strings for encoding UI.
src/i18n/locales/en/chart.json	Adds English strings for chart UI.
src/i18n/index.ts	Initializes i18next + language detection and registers resources.
src/data/utils.ts	Adds `resolveExcelCellValue()` and uses it when reading Excel via ExcelJS.
src/data/types.ts	Updates date coercion to convert Date objects to ISO strings.
src/app/utils.tsx	Adds URLs, passes `Accept-Language` header, and exposes `getAgentLanguage()`.
src/app/useFormulateData.ts	Adds `onError` callbacks and attaches diagnostics payload to failure messages.
src/app/store.ts	Blacklists `globalModels` from persistence to avoid stale server-managed models.
src/app/dfSlice.tsx	Adds global models support, new thunks, selector changes, and status handling.
requirements.txt	Adds `xlrd` dependency for legacy `.xls` parsing.
pytest.ini	Defines test paths/markers and standardizes pytest discovery.
pyproject.toml	Adds `xlrd` runtime dep and `pytest` dev dependency.
py-src/data_formulator/tables_routes.py	Adds parse-file endpoint; improves metadata filling and list-tables fallback behavior.
py-src/data_formulator/sandbox/not_a_sandbox.py	Improves error diagnostics for missing/incorrect output DataFrame variable.
py-src/data_formulator/sandbox/local_sandbox.py	Adds diagnostics (DataFrame variable names) for sandbox execution results/errors.
py-src/data_formulator/sandbox/docker_sandbox.py	Improves error message with DataFrame variable diagnostics.
py-src/data_formulator/model_registry.py	Adds env-based global model registry with safe public listing.
py-src/data_formulator/datalake/workspace.py	Ensures session JSON uses `ensure_ascii=False` for Unicode readability.
py-src/data_formulator/datalake/parquet_utils.py	Updates table-name sanitization to preserve Unicode while staying safe.
py-src/data_formulator/datalake/file_manager.py	Updates table-name sanitization to preserve Unicode while staying safe.
py-src/data_formulator/datalake/azure_blob_workspace.py	Ensures session JSON uses `ensure_ascii=False` for Unicode readability.
py-src/data_formulator/data_loader/external_data_loader.py	Improves sanitizer to preserve Unicode and normalize separators safely.
py-src/data_formulator/app.py	Adds AVAILABLE_LANGUAGES to app config; refactors logging config.
py-src/data_formulator/agents/data_agent.py	Adds language instruction injection into the system prompt.
py-src/data_formulator/agents/client_utils.py	Adds image-block stripping + retry logic and a `ping()` helper.
py-src/data_formulator/agents/agent_utils_sql.py	Updates SQL table-name sanitization for Unicode and safer prefixes.
py-src/data_formulator/agents/agent_utils.py	Adds lenient JSON parsing and output-variable heuristic patch helper.
py-src/data_formulator/agents/agent_sort_data.py	Uses `ensure_ascii=False` when serializing inputs for LLM prompts.
py-src/data_formulator/agents/agent_report_gen.py	Adds language instruction injection for report generation prompts.
py-src/data_formulator/agents/agent_language.py	Adds shared language-instruction builder for agent prompts.
py-src/data_formulator/agents/agent_interactive_explore.py	Adds language instruction injection into system prompt.
py-src/data_formulator/agents/agent_data_load.py	Adds language instruction injection into system prompt.
py-src/data_formulator/agents/agent_data_clean_stream.py	Adds language instruction injection and uses `ensure_ascii=False` in stream output.
py-src/data_formulator/agents/agent_code_explanation.py	Adds language instruction injection into system prompt.
py-src/data_formulator/agents/agent_chart_insight.py	Adds language instruction injection into system prompt.
package.json	Adds i18n deps and Vitest/testing-library tooling + test scripts.
.gitignore	Ignores `*.egg-info/` and `tmp-docs/`.
.env.template	Documents LOG_LEVEL, DATA_FORMULATOR_HOME, and AVAILABLE_LANGUAGES.

Comments suppressed due to low confidence (1)

src/views/MessageSnackbar.tsx:133

This effect uses displayedMessageIdx and dispatch but the dependency array only includes messages. That can lead to stale reads of displayedMessageIdx (and will trip react-hooks/exhaustive-deps). Include displayedMessageIdx and dispatch in the dependency list (or refactor to derive the next index from messages.length if that’s the intended trigger).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T08:21:39Z

py-src/data_formulator/app.py

+    log_level_str = os.getenv("LOG_LEVEL", "INFO").strip().upper()
+    app_log_level = getattr(logging, log_level_str, logging.INFO)
+
    logging.basicConfig(
-        level=logging.ERROR,
+        level=logging.WARNING,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers=[logging.StreamHandler(sys.stdout)]
    )
-
-    # Enable INFO for agent modules so timing logs are visible
-    logging.getLogger('data_formulator.agents').setLevel(logging.INFO)

-    # Suppress verbose logging from third-party libraries
+    logging.getLogger('data_formulator').setLevel(app_log_level)
+


configure_logging() reads LOG_LEVEL, but logging.basicConfig(level=logging.WARNING, ...) sets the root logger to WARNING, so INFO/DEBUG logs from data_formulator will still be filtered out even if LOG_LEVEL=INFO/DEBUG. Set the root/handler level to app_log_level (or configure a dedicated handler for data_formulator) so the env var actually controls emitted log verbosity.

Thank you for the review, but this issue does not apply to the current code. The snippet referenced in your comment (with duplicate level=logging.ERROR / level=logging.WARNING parameters and a separate data_formulator.agents logger) appears to be from an intermediate revision and is not present in the final version.

In the current implementation, logging.basicConfig(level=logging.WARNING, ...) sets the root logger level to WARNING, while logging.getLogger('data_formulator').setLevel(app_log_level) explicitly sets our application logger to the level specified by the LOG_LEVEL env var. This works correctly because Python's log propagation (callHandlers) only checks handler levels when walking up the logger hierarchy — it does not re-check parent logger levels. Since the root StreamHandler defaults to NOTSET, all records that pass the originating logger's effective level check will be emitted. The root WARNING level only serves to suppress noisy INFO/DEBUG output from third-party libraries that don't have an explicit level override, which is the intended behavior.

No changes needed here.

Copilot · 2026-03-22T08:21:40Z

tests/README.md

+- `tests/frontend`
+  - frontend-related contract tests driven by `pytest`
+  - focused on request/response boundary behavior
+  - not intended for React component-level unit tests yet


This README says tests/frontend are pytest-driven contract tests and recommends running them via pytest tests/frontend, but this PR introduces Vitest + React Testing Library tests under tests/frontend/** and adds npm test/vitest scripts. Update this documentation to reflect the new frontend test runner and commands (and clarify what, if anything, remains pytest-driven under tests/frontend).

The root tests/README.md was outdated and contradicted the actual test setup:

It described tests/frontend/ as "pytest-driven contract tests," but all files there are .test.ts / .test.tsx running under Vitest + @testing-library/react.
It stated "not intended for React component-level unit tests yet," but safeCellRender.test.tsx is exactly that.
The suggested command pytest tests/frontend would find zero tests since there are no Python files in that directory.
Meanwhile, tests/frontend/README.md already correctly documented the Vitest setup — so the two READMEs directly contradicted each other.

Fixed by updating the root tests/README.md to accurately reflect the frontend test stack (Vitest), the correct commands (npm test / npm run test:watch), and a cross-reference to tests/frontend/README.md for layout details.
comment ID:
07fe32f

Copilot · 2026-03-22T08:21:40Z

py-src/data_formulator/agents/client_utils.py

+    def _strip_image_blocks(self, content):
+        """Remove image_url blocks from multimodal content arrays."""
+        if isinstance(content, list):
+            sanitized = []
+            for item in content:
+                if isinstance(item, dict):
+                    if item.get("type") == "image_url":
+                        continue
+                    sanitized.append(item)
+                else:
+                    sanitized.append(item)
+            return sanitized
+        return content
+
+    def _strip_images_from_messages(self, messages):
+        """Create a copy of messages with image_url blocks removed."""
+        sanitized_messages = []
+        for msg in messages:
+            if isinstance(msg, dict):
+                new_msg = dict(msg)
+                if "content" in new_msg:
+                    new_msg["content"] = self._strip_image_blocks(new_msg["content"])
+                sanitized_messages.append(new_msg)
+            else:
+                sanitized_messages.append(msg)
+        return sanitized_messages
+
+    def _is_image_deserialize_error(self, error_text: str) -> bool:
+        """Detect provider errors caused by image blocks on text-only models."""
+        lowered = error_text.lower()
+        return ("image_url" in lowered and "expected `text`" in lowered) or "unknown variant `image_url`" in lowered
+


Client defines _strip_image_blocks, _strip_images_from_messages, and _is_image_deserialize_error twice (the second set starting here duplicates the earlier implementation). This duplication makes future changes error-prone (it’s easy to edit one copy and forget the other). Remove the duplicate method definitions and keep a single authoritative implementation.

Suggested change

def _strip_image_blocks(self, content):

"""Remove image_url blocks from multimodal content arrays."""

if isinstance(content, list):

sanitized = []

for item in content:

if isinstance(item, dict):

if item.get("type") == "image_url":

continue

sanitized.append(item)

else:

sanitized.append(item)

return sanitized

return content

def _strip_images_from_messages(self, messages):

"""Create a copy of messages with image_url blocks removed."""

sanitized_messages = []

for msg in messages:

if isinstance(msg, dict):

new_msg = dict(msg)

if "content" in new_msg:

new_msg["content"] = self._strip_image_blocks(new_msg["content"])

sanitized_messages.append(new_msg)

else:

sanitized_messages.append(msg)

return sanitized_messages

def _is_image_deserialize_error(self, error_text: str) -> bool:

"""Detect provider errors caused by image blocks on text-only models."""

lowered = error_text.lower()

return ("image_url" in lowered and "expected `text`" in lowered) or "unknown variant `image_url`" in lowered

Good catch — the three methods (_strip_image_blocks, _strip_images_from_messages, _is_image_deserialize_error) were indeed duplicated with identical implementations. This was a merge artifact. Removed the duplicate definitions, keeping a single authoritative copy. Fixed. fixed in comment ID 6c0da55

Copilot · 2026-03-22T08:21:40Z

py-src/data_formulator/tables_routes.py

@@ -208,7 +214,11 @@ def list_tables():
                    except Exception:
                        row_count = 0
                if row_count is None:
-                    row_count = 0
+                    try:
+                        df = workspace.read_data_as_df(table_name)
+                        row_count = len(df)
+                    except Exception:
+                        row_count = 0


list_tables() falls back to workspace.read_data_as_df(table_name) to infer columns and row_count when metadata/schema is missing. This can load entire tables into memory (and it may happen twice: once for columns and once for row_count), which is expensive for large parquet tables and can degrade API latency. Prefer using parquet schema/row-count metadata only (or a cheap head/sample) and avoid full reads inside the listing endpoint.

Actually, this issue is more complex than described and requires a comprehensive solution. This will be included in the upcoming priority fix plan.

Copilot · 2026-03-22T08:21:41Z

py-src/data_formulator/tables_routes.py

+        logger.error(f"Error parsing file: {str(e)}")
+        return jsonify({"status": "error", "message": str(e)}), 400


parse_file() returns str(e) directly to the client on exceptions. This can leak internal details and results in inconsistent error messages vs other routes that sanitize DB/IO errors. Return a stable, user-safe message (and keep the full exception only in logs), ideally reusing the same sanitization pattern used elsewhere in this module.

Suggested change

logger.error(f"Error parsing file: {str(e)}")

return jsonify({"status": "error", "message": str(e)}), 400

# Log full exception details for diagnostics, but return a sanitized message to the client.

logger.error("Error parsing file", exc_info=True)

safe_msg, _status_code = sanitize_db_error_message(e)

return jsonify({"status": "error", "message": safe_msg}), 400

I looked into this carefully and here's my assessment:

The security risk here is minimal. The exceptions in parse_file() come from pandas parsing in-memory streams (pd.ExcelFile(file.stream), pd.read_csv(file.stream)), not from database or filesystem operations. The resulting error messages describe data format issues (e.g. "Excel file format cannot be determined", "Error tokenizing data") and don't leak server internals like file paths or credentials. Returning these messages to the client is intentional — they tell the user what's wrong with their uploaded file.

Using sanitize_db_error_message here would actually be worse:

It's semantically designed for DB errors — none of its patterns match pandas parsing errors.
The fallback (f"An unexpected error occurred: {error_msg}") still exposes the full str(e), so it doesn't actually sanitize anything.
It would return status 500 instead of 400 — incorrect, since a malformed upload is a client error.
What I did adopt: Changed the logger call to use exc_info=True for full stack traces in logs, which is genuinely better for diagnostics than just logging str(e).

fix: improve file parsing error logging in comment ID 9b73ce4

Use exc_info=True to record complete exception stack trace for easier troubleshooting

…mponent - Replaced direct state selectors for selectedModelId and models with a single selector for activeModel. - Updated report generation logic to utilize activeModel directly, improving clarity and reducing redundancy. - Enhanced unit tests for dfSelectors to cover new globalModels handling and fallback scenarios.

…eld naming Fix chart ID replacement logic in report view to ensure correct matching and usage of cached images. Optimize internationalization field naming by changing "count" to more explicit "totalRows". Also improve chart processing workflow by adding sequential ID mapping to ensure correct chart references during report generation.

Clean up no longer needed _strip_image_blocks and _strip_images_from_messages methods as these features are no longer in use

Use exc_info=True to record complete exception stack trace for easier troubleshooting

…ure and commands Update README.md in the test directory, including: - More detailed test type descriptions - Updated test commands - Changed frontend testing tool from pytest to Vitest - Added frontend test directory layout explanation

…s and JSON Automatically request completion of missing parts when model only generates JSON or code. Integrate this feature in DataRecAgent and DataTransformAgent. Add logging and performance statistics.

xlrd is no longer maintained and has limited support for xlsx files. Add openpyxl as an alternative to provide better Excel file support.

Add a toggle option in the configuration dialog to control whether AI automatically generates data insights when creating new charts. Include relevant i18n text and state management logic.

Add support for Unicode filenames while maintaining backward compatibility with legacy formats. Introduce path traversal checks to prevent security vulnerabilities, throwing exceptions when illegal paths are detected. Fallback to legacy safe filename format if Unicode-named file does not exist.

…place existing implementation Add safe_data_filename function for handling Unicode filenames, replacing the original secure_filename implementation Update filename processing logic in multiple files to ensure support for Chinese and other Unicode characters Also update Dockerfile to set LANG=C.UTF-8 environment variable to support Unicode filenames

…concurrent scenarios Implement atomic metadata read-write operations to prevent lost update issues caused by concurrent modifications - Add update_metadata function to provide atomic read-modify-write operations - Add _atomic_update_metadata method for local and Azure Blob workspaces - Refactor file upload logic to use atomic operations to ensure table name uniqueness - Apply asynchronous processing to frontend table loading functions

…BLAS Preload heavy libraries like numpy and pandas before installing audit hooks to prevent BLAS-related libraries from being intercepted Allow already loaded ctypes modules to be re-imported to support BLAS access for scipy/sklearn

Implement batch table deletion by source filename for cleaning up old data when re-uploading files. Also optimize frontend table state management logic, add local table deletion method, and support skipping duplicate checks when replacing source files.

Chenglong-MS

I think the main update to consider is to define a agentCofing / agentOptions to help manage shared configs for different agents; and diagnostic info can be wrapped separately to avoid duplicate.

Chenglong-MS · 2026-03-22T18:53:44Z

py-src/data_formulator/agents/agent_data_rec.py

diag component seems to have some duplicates in both agent_data_load and agent_data_rec. It might be good to use a separate class to manage diagnostic messages.

Chenglong-MS · 2026-03-22T18:58:19Z

py-src/data_formulator/agents/agent_utils.py

    return separator.join(table_summaries)


+def ensure_output_variable_in_code(code: str, output_variable: str) -> tuple[str, bool, str]:


I think a better solution is to:

in prompt hint it to put output variable in the last line, and

ask the agent to also put the output variable in output json object

So the output variable prioritize 2 if it is available, otherwise finds the last line variable output (typically notebook structure)

if neither exists we just consider execution fail and ask agent to repair with an error message "output variable not specified."

Chenglong-MS · 2026-03-22T19:01:13Z

py-src/data_formulator/agents/data_agent.py

now that language instruction, repair attempt are becoming a little bit heavy weight to define for every agent. I'm wondering if we should just define an "agentConfig" class to pass these information to the agent, including

coding rules

exploration rules

max repair attempts

max iterations

language instructions

...
and we can extend properties later if needed

Chenglong-MS · 2026-03-22T19:06:08Z

py-src/data_formulator/data_loader/external_data_loader.py

I'm wondering besides manual cleaning here, is there any benefit using libraries like https://github.com/kayak/pypika to deal with sanitization?

Chenglong-MS · 2026-03-22T19:13:27Z

src/views/ChartGallery.tsx

ChartGallery is tempory for backend visualization library testing and will be updated later... looks good for now.

…t updates, and file replacement scenarios Add multiple test files to cover the following regression scenarios: 1. Resolve Chinese filename truncation issue 2. Prevent data loss caused by concurrent updates 3. Ensure proper cleanup of old tables during file replacement 4. Verify overwrite logic when uploading files with the same name

…K encoding support Implement cross-platform file encoding handling, including: 1. Add readFileText function on frontend to handle UTF-8 and GBK encodings

…ion logic - Add trusted encoding detection set, optimize GBK-first strategy - Add integration tests to verify Chinese CSV file processing - Improve encoding detection fallback chain, finally fallback to latin-1

zhb-y-agent added 30 commits March 20, 2026 01:44

add *.egg-info/ to .gitignore

cf1cd3a

Add temporary documentation directory to .gitignore

48a2b11

feat(i18n): Add default prompt texts and enhance internationalization…

d235e0c

… support - Add default prompt texts to Chinese and English translation files - Modify chart recommendation box to use internationalized texts - Update report view to use internationalized texts

feat(fix style issues): Add message type icons and optimize button st…

c9da745

…yles Display different icons based on message type, and adjust button styles to reflect message severity. Remove unused style imports.

style(SelectableDataGrid): Fix bottom toolbar right spacing

243ecee

Increase right padding from 12px to 25px for better visual balance

style(view components): Adjust width styles for encoding shelf and re…

59ef25d

…lated components Uniformly adjust width values in EncodingShelfThread, VisualizationView, EncodingShelfCard, and EncodingBox components to improve layout and user experience

refactor(i18n): Remove field tooltip text and add channel tooltip text

e1a9ce8

Remove unused field tooltip text, while adding detailed tooltip text for encoding channels. Update related components to use the new tooltip system and remove the old tooltip

style: Adjust widths of multiple components to improve layout

fa2ffbe

Uniformly increase width values across multiple components, including elements in EncodingShelfThread, VisualizationView, EncodingShelfCard, and EncodingBox, to optimize interface layout and user experience.

feat(i18n): Update chart type translations and add category tooltips

8ae15a4

- Add new chart type translations to chart.json for both Chinese and English - Add chart category tooltip text - Implement chart name and category tooltip functionality in the EncodingShelfCard component

fix(useFormulateData): Add error handling callback and improve error …

ab251ed

…scenario handling Call the onError callback when data formulation fails, no results are returned, token mismatch occurs, or all candidates fail.

fix(layout): Adjust card layout calculation and margin styles

55f658f

- Fix card column count calculation logic using a more accurate formula to calculate the number of columns that can fit - Remove unnecessary right margin styles - Rename PANEL_PADDING to PANE_PADDING consistently

fix(DataView): Adjust row number column styles and width

d9fbdc4

- Unify row number column width to 56px and optimize style display - Remove special handling logic for virtual tables - There is still a remaining issue where the row number column width cannot be fixed.

style(ReportView): Adjust spacing and font styles in report view

fb6639f

Optimize visual details of the report creation interface, including: - Adjust element spacing and padding - Unify font sizes and colors - Improve style consistency for buttons and labels

feat(test): Add frontend unit testing infrastructure and test cases

507917e

- Add Vitest testing framework configuration - Add tests for data transformation, Redux selectors, and Excel parsing - Update README with test directory structure and how to run tests

zhb-y-agent added 6 commits March 21, 2026 22:38

docs(env): Update comments in environment variable template

d5f0cba

Add detailed documentation about data directories, including directory structure and resolution order.

zhb-y-agent added 3 commits March 22, 2026 16:02

Chenglong-MS requested a review from Copilot March 22, 2026 08:16

Copilot started reviewing on behalf of Chenglong-MS March 22, 2026 08:17 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

zhb-y-agent added 13 commits March 22, 2026 17:45

refactor(client): remove unused image block processing functionality

6c0da55

Clean up no longer needed _strip_image_blocks and _strip_images_from_messages methods as these features are no longer in use

fix: improve file parsing error logging

9b73ce4

Use exc_info=True to record complete exception stack trace for easier troubleshooting

feat(agent_utils): add functionality to supplement missing code block…

47c81a2

…s and JSON Automatically request completion of missing parts when model only generates JSON or code. Integrate this feature in DataRecAgent and DataTransformAgent. Add logging and performance statistics.

build: add openpyxl dependency for Excel file support

803acb7

xlrd is no longer maintained and has limited support for xlsx files. Add openpyxl as an alternative to provide better Excel file support.

feat(config): add toggle for auto-generating chart insights

ce3d88f

Add a toggle option in the configuration dialog to control whether AI automatically generates data insights when creating new charts. Include relevant i18n text and state management logic.

Chenglong-MS requested changes Mar 22, 2026

View reviewed changes

zhb-y-agent added 3 commits March 23, 2026 03:18

feat(file encoding): Add file reading functionality with UTF-8 and GB…

be433b2

…K encoding support Implement cross-platform file encoding handling, including: 1. Add readFileText function on frontend to handle UTF-8 and GBK encodings

feat(file encoding): Improve text file encoding detection and convers…

5df6c40

…ion logic - Add trusted encoding detection set, optimize GBK-first strategy - Add integration tests to verify Chinese CSV file processing - Improve encoding detection fallback chain, finally fallback to latin-1

		logger.error(f"Error parsing file: {str(e)}")
		return jsonify({"status": "error", "message": str(e)}), 400

-        logger.error(f"Error parsing file: {str(e)}")
-        return jsonify({"status": "error", "message": str(e)}), 400
+        # Log full exception details for diagnostics, but return a sanitized message to the client.
+        logger.error("Error parsing file", exc_info=True)
+        safe_msg, _status_code = sanitize_db_error_message(e)
+        return jsonify({"status": "error", "message": safe_msg}), 400

		return separator.join(table_summaries)


		def ensure_output_variable_in_code(code: str, output_variable: str) -> tuple[str, bool, str]:

Conversation

zhb-ai commented Mar 22, 2026

Uh oh!

zhb-ai commented Mar 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chenglong-MS left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants