Skip to content

Feature/i18n react i18next#261

Open
zhb-ai wants to merge 56 commits intodevfrom
feature/i18n-react-i18next
Open

Feature/i18n react i18next#261
zhb-ai wants to merge 56 commits intodevfrom
feature/i18n-react-i18next

Conversation

@zhb-ai
Copy link
Collaborator

@zhb-ai zhb-ai commented Mar 22, 2026

No description provided.

… translation files and internationalization component integration

Add Chinese and English translation files covering the main functional modules of the application. Integrate react-i18next to implement internationalization, and modify components to support language switching. Main changes include:

- Add locales directory containing en/zh translation files
- Configure i18n initialization and language detection
- Modify components such as ChatDialog and DataThreadCards to use translations
- Add i18next-related dependencies to package.json
… support

- Add default prompt texts to Chinese and English translation files
- Modify chart recommendation box to use internationalized texts
- Update report view to use internationalized texts
…ish switching

Add a language switch button component to the application top bar, allowing users to switch the interface language between Chinese and English. Implemented using MUI's ToggleButtonGroup with appropriate style adjustments.
… and column names

Add a complete test framework structure, including unit tests, integration tests, and contract tests
Add test cases for Chinese table and column name handling, covering name processing logic at different levels
Add marked tests for known issues to ensure verification when fixed in the future
Add support for Excel file parsing, including adding xlrd dependency and test cases
Fix Chinese table name handling issues in various scenarios, remove markers for known issues
Add integration tests to verify the complete flow of Excel upload and Chinese table name processing
Update test fixtures and documentation
…me handling

Add a new file parsing API endpoint for handling legacy Excel files that cannot be directly parsed by the client. Meanwhile, unify and improve table name processing logic to support Unicode characters and fix known issues. Remove annotations marked as known issues in tests since the related features have been fixed.

- Add /api/tables/parse-file endpoint to process .xls files
- Unify table name processing logic across multiple modules, supporting Unicode characters such as Chinese
- Fix prefix handling when table names start with numbers
- Update frontend upload component to use the new parsing API
Add complete support for table metadata after XLS file upload, including:

- Automatically retrieve column information from dataframe when schema information is missing
- Automatically calculate row count from dataframe when row count information is missing
- Save complete column type information when creating tables
- Add integration tests to verify table list functionality after XLS upload
…e file handling logic

- Add drag-and-drop upload related states and event handling
- Refactor file handling logic into shared functions for use by both drag-and-drop and file selection
- Add visual feedback effects during drag-and-drop
…yles

Display different icons based on message type, and adjust button styles to reflect message severity. Remove unused style imports.
Increase right padding from 12px to 25px for better visual balance
Add translation content for field tooltip texts and encoding channel labels, including both Chinese and English versions
Implement tooltip functionality in field cards and encoding cards to display field sources and calculation descriptions
Add internationalization translation guidelines documentation explaining translation rules and considerations
…lated components

Uniformly adjust width values in EncodingShelfThread, VisualizationView, EncodingShelfCard, and EncodingBox components to improve layout and user experience
…ll agents

Add AVAILABLE_LANGUAGES configuration option and language switcher
Add language_instruction parameter to all agent constructors
Implement agent_language.py to build multi-language prompt fragments
Pass current UI language to backend via Accept-Language header
Remove unused field tooltip text, while adding detailed tooltip text for encoding channels. Update related components to use the new tooltip system and remove the old tooltip
Uniformly increase width values across multiple components, including elements in EncodingShelfThread, VisualizationView, EncodingShelfCard, and EncodingBox, to optimize interface layout and user experience.
Update the TRANSLATION_GUIDE.md document with the following changes:

1. Rename the section "Tooltip Strategy for Non-Translatable Keywords" to "Tooltip Strategy for Encoding Channel Labels"
2. Simplify tooltip implementation instructions by removing fieldTooltip-related descriptions
3. Update example code to demonstrate the implementation of channel label tooltips
4. Adjust JSON file structure description by removing fieldTooltip-related entries
- Add new chart type translations to chart.json for both Chinese and English
- Add chart category tooltip text
- Implement chart name and category tooltip functionality in the EncodingShelfCard component
Enhance the translation guide documentation with detailed explanations for tooltip localization strategies for non-translatable UI labels:

1. Add a Core Principles section explaining why and how to use tooltips
2. Restructure encoding channel label explanations into subsections with implementation details
3. Add localization solutions for chart type names
4. Clarify applicable scenarios and limitations of the tooltip strategy
- Redesign session menu layout using button styles instead of plain text
- Add "Local File" category with export/import options to the menu
- Replace exit button icon with restart icon
- Add divider line to the top toolbar
- Update Chinese and English translation files by adding the "localFile" field
refactor(view components): Optimize column width calculation sampling and extract config slider component

- Change from random sampling to deterministic sampling for stable column widths
- Extract reusable config slider component to reduce code duplication
- Disable formula generation button when prompts are empty
…scenario handling

Call the onError callback when data formulation fails, no results are returned, token mismatch occurs, or all candidates fail.
- Fix card column count calculation logic using a more accurate formula to calculate the number of columns that can fit
- Remove unnecessary right margin styles
- Rename PANEL_PADDING to PANE_PADDING consistently
…ation

Improve scroll logic to smoothly follow content expansion during collapse animation, using requestAnimationFrame for smooth scrolling effects. Also adjust overflow styles to prevent horizontal scrollbar flickering.
- Unify row number column width to 56px and optimize style display
- Remove special handling logic for virtual tables
- There is still a remaining issue where the row number column width cannot be fixed.
Optimize visual details of the report creation interface, including:

- Adjust element spacing and padding
- Unify font sizes and colors
- Improve style consistency for buttons and labels
- Add Vitest testing framework configuration
- Add tests for data transformation, Redux selectors, and Excel parsing
- Update README with test directory structure and how to run tests
…mponents

- Add safe rendering logic to ensure object values (such as Date instances from Excel) are converted to strings before rendering
- Fix date and rich text value conversion in Excel file processing
- Add unit tests to verify safe rendering patterns
…erve non-ASCII characters

- Fix the issue where default json.dumps escapes non-ASCII characters, ensuring Chinese and other characters remain unchanged during serialization
- Add test cases to verify character preservation behavior in various scenarios
…ata functions

- Update error checking to handle various error statuses more robustly.
- Improve logging to provide clearer information on repair attempts and final statuses.
- Add exception handling during follow-up calls to prevent crashes and log errors appropriately.
- Ensure that error messages are sanitized before logging to maintain security.
- Add tests to ensure DuckDB prompts include non-ASCII identifier quoting rules
- Add tests to verify file manager table name handling logic
- Add integration tests to verify data repair loop logic
- Test error messages using sanitize_model_error processing
- Add server-side model registry, supporting global model configuration via environment variables
- Frontend distinguishes between server-managed models and user-defined models, optimizing model selection interface
- Add model connectivity test API, supporting parallel status checks for multiple models
- Remove automatic testing logic, switch to on-demand manual testing
- Update i18n multilingual support, improve model management related text
…y, and registry

Add three test files to test:

1. /list-global-models API endpoint returns correct model list without leaking sensitive information
2. Security features of global models including credential parsing and error message sanitization
3. ModelRegistry's model discovery functionality and security to ensure API keys are not leaked
Add detailed documentation about data directories, including directory structure and resolution order.
…l for improved navigation

- Introduced TopNavButton component for better navigation handling in the AppBar.
- Refactored AppFC to AppShell, integrating location-based logic for page selection.
- Enhanced AppBar with dynamic button rendering based on the current route.
- Improved layout and styling for a more cohesive user experience.
@zhb-ai
Copy link
Collaborator Author

zhb-ai commented Mar 22, 2026

@microsoft-github-policy-service agree

…diagnostics

- Added `model_info` parameter to `DataRecAgent` and `DataTransformationAgent` for better model context handling.
- Updated `derive_data` and `refine_data` functions to pass model information to agents.
- Improved error handling and diagnostics reporting in agent responses, including detailed diagnostics in the frontend.
- Enhanced JSON spec parsing and output variable assignment checks to ensure correct variable usage in generated code.
- Adjusted `DataRecAgent` and `DataTransformationAgent` to insert language instructions before the execution environment marker, improving context relevance.
- Enhanced prompt construction to reduce recency-bias interference on chart-type selection by ensuring language instructions are positioned effectively.
…ion reset

- Refactored ResetDialog to change exit functionality to reset, updating state management and button actions accordingly.
- Updated i18n strings in English and Chinese to reflect the new reset terminology and warnings.
- Increased default formulate timeout from 30 to 60 seconds in ConfigDialog for improved user experience.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces React i18n support (react-i18next) and expands internationalization across the frontend, while also adding a substantial set of backend/frontend tests and improving Unicode handling and model configuration (including server-managed “global models”).

Changes:

  • Add i18next + react-i18next setup, locale resources (en/zh), and replace many hard-coded UI strings with t(...) calls.
  • Add global model registry + API support (server-managed models) and related backend tests; improve agent prompt language control and diagnostics payloads.
  • Add Vitest-based frontend unit tests and broaden Python test coverage around Unicode table-name sanitization, JSON serialization, and upload/parse flows (including legacy .xls parsing).

Reviewed changes

Copilot reviewed 116 out of 121 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
vitest.config.ts Adds Vitest configuration for frontend unit tests (jsdom + setup file).
tests/frontend/unit/views/safeCellRender.test.tsx Tests safe rendering of object/boolean cell values in React tables/grids.
tests/frontend/unit/views/checkIsLikelyTextOnlyModel.test.ts Tests heuristic for text-only model detection.
tests/frontend/unit/data/resolveExcelCellValue.test.ts Tests Excel cell value normalization for richText/hyperlink/formula/error.
tests/frontend/unit/data/coerceDate.test.ts Tests date coercion behavior for Date/null/strings/timestamps.
tests/frontend/unit/app/dfSelectors.test.ts Tests selector logic for active model selection.
tests/frontend/setup.ts Adds jest-dom matchers for Vitest.
tests/frontend/README.md Documents frontend test layout and Vitest commands.
tests/conftest.py Ensures py-src is importable for Python tests.
tests/backend/unit/test_workspace_fresh_names.py Tests workspace fresh-name generation with Unicode names.
tests/backend/unit/test_unicode_table_name_sanitization.py Tests Unicode preservation across multiple sanitizers.
tests/backend/unit/test_parquet_utils_table_names.py Tests parquet sanitizer for safety, casing, and Unicode.
tests/backend/unit/test_model_registry.py Tests env-var model registry loading + credential isolation.
tests/backend/unit/test_list_global_models_api.py Tests /api/agent/list-global-models response shape and secret redaction.
tests/backend/unit/test_json_chinese_serialization.py Regression tests for json.dumps(..., ensure_ascii=False) patterns.
tests/backend/unit/test_global_model_security.py Tests global model credential resolution + error sanitization properties.
tests/backend/unit/test_file_manager_table_names.py Tests file-manager table name sanitization with Unicode and edge cases.
tests/backend/unit/test_external_data_loader_table_names.py Tests external loader table name sanitization rules/limits.
tests/backend/unit/test_duckdb_notes_prompt.py Ensures DuckDB notes mention non-ASCII identifier quoting rule.
tests/backend/unit/test_client_image_strip.py Tests client retry logic and stripping image_url blocks for text-only models.
tests/backend/unit/test_agent_utils_sql_table_names.py Tests SQL sanitizer and DuckDB view creation with Unicode names.
tests/backend/unit/README.md Documents backend unit test conventions.
tests/backend/integration/test_parse_file_endpoint.py Integration tests for server-side parse-file endpoint.
tests/backend/integration/test_excel_fixture_parsing.py Ensures .xls fixture can be parsed by pandas.
tests/backend/integration/test_create_table_xls_upload.py End-to-end .xls upload flow tests (workspace + list-tables).
tests/backend/integration/README.md Documents backend integration testing scope.
tests/backend/fixtures/README.md Documents fixture directory usage.
tests/backend/contract/test_table_name_contracts.py Contract tests for route-level sanitization guarantees with Unicode.
tests/backend/contract/README.md Documents contract test intent for boundary stability.
tests/backend/README.md Documents backend test layering and recommended expansion order.
tests/README.md Documents overall test tree organization and commands.
src/views/SelectableDataGrid.tsx Adds i18n strings, fixed table layout/colgroup, safer cell rendering, UI tweaks.
src/views/ReactTable.tsx Prevents rendering object values directly by stringifying objects.
src/views/MultiTablePreview.tsx i18n for empty/preview/remove labels and rows×cols display.
src/views/MessageSnackbar.tsx Adds diagnostics viewer, i18n, and message-button severity indicator.
src/views/EncodingShelfThread.tsx Adjusts encoding shelf width.
src/views/EncodingBox.tsx i18n for channel labels/tips and several UI strings; minor refactors.
src/views/DataView.tsx Makes row sampling deterministic and standardizes row-id column sizing.
src/views/DataThreadCards.tsx Adds t(...) to table-card tooltips/aria labels for i18n.
src/views/DataLoadingThread.tsx i18n throughout, adds text-only model heuristic + safer preview formatting.
src/views/DataFormulator.tsx Uses unified model selector, i18n for landing/footer, changes model auto-select behavior.
src/views/DBTableManager.tsx i18n for DB manager UI strings and status messages.
src/views/ChatThreadView.tsx i18n for labels, accessibility improvements for collapsible rows.
src/views/ChatDialog.tsx i18n for dialog labels and buttons.
src/views/ChartifactDialog.tsx i18n for report title/footer text via i18n.t(...).
src/views/ChartRecBox.tsx i18n placeholders/tooltips and minor prompt helper refactor.
src/views/AgentRulesDialog.tsx i18n for rule dialog labels and buttons.
src/views/About.tsx i18n for feature descriptions and accessibility labels.
src/scss/DataView.scss Adjusts styling for row-id header cell sizing.
src/index.tsx Initializes i18n on app startup.
src/i18n/locales/zh/upload.json Adds Chinese strings for upload flow.
src/i18n/locales/zh/navigation.json Adds Chinese strings for navigation.
src/i18n/locales/zh/model.json Adds Chinese strings for model UI.
src/i18n/locales/zh/messages.json Adds Chinese strings for messages UI.
src/i18n/locales/zh/index.ts Aggregates zh locale modules.
src/i18n/locales/zh/encoding.json Adds Chinese strings for encoding UI.
src/i18n/locales/zh/chart.json Adds Chinese strings for chart UI.
src/i18n/locales/index.ts Exports en and zh locale bundles.
src/i18n/locales/en/upload.json Adds English strings for upload flow.
src/i18n/locales/en/navigation.json Adds English strings for navigation.
src/i18n/locales/en/model.json Adds English strings for model UI.
src/i18n/locales/en/messages.json Adds English strings for messages UI.
src/i18n/locales/en/index.ts Aggregates en locale modules.
src/i18n/locales/en/encoding.json Adds English strings for encoding UI.
src/i18n/locales/en/chart.json Adds English strings for chart UI.
src/i18n/index.ts Initializes i18next + language detection and registers resources.
src/data/utils.ts Adds resolveExcelCellValue() and uses it when reading Excel via ExcelJS.
src/data/types.ts Updates date coercion to convert Date objects to ISO strings.
src/app/utils.tsx Adds URLs, passes Accept-Language header, and exposes getAgentLanguage().
src/app/useFormulateData.ts Adds onError callbacks and attaches diagnostics payload to failure messages.
src/app/store.ts Blacklists globalModels from persistence to avoid stale server-managed models.
src/app/dfSlice.tsx Adds global models support, new thunks, selector changes, and status handling.
requirements.txt Adds xlrd dependency for legacy .xls parsing.
pytest.ini Defines test paths/markers and standardizes pytest discovery.
pyproject.toml Adds xlrd runtime dep and pytest dev dependency.
py-src/data_formulator/tables_routes.py Adds parse-file endpoint; improves metadata filling and list-tables fallback behavior.
py-src/data_formulator/sandbox/not_a_sandbox.py Improves error diagnostics for missing/incorrect output DataFrame variable.
py-src/data_formulator/sandbox/local_sandbox.py Adds diagnostics (DataFrame variable names) for sandbox execution results/errors.
py-src/data_formulator/sandbox/docker_sandbox.py Improves error message with DataFrame variable diagnostics.
py-src/data_formulator/model_registry.py Adds env-based global model registry with safe public listing.
py-src/data_formulator/datalake/workspace.py Ensures session JSON uses ensure_ascii=False for Unicode readability.
py-src/data_formulator/datalake/parquet_utils.py Updates table-name sanitization to preserve Unicode while staying safe.
py-src/data_formulator/datalake/file_manager.py Updates table-name sanitization to preserve Unicode while staying safe.
py-src/data_formulator/datalake/azure_blob_workspace.py Ensures session JSON uses ensure_ascii=False for Unicode readability.
py-src/data_formulator/data_loader/external_data_loader.py Improves sanitizer to preserve Unicode and normalize separators safely.
py-src/data_formulator/app.py Adds AVAILABLE_LANGUAGES to app config; refactors logging config.
py-src/data_formulator/agents/data_agent.py Adds language instruction injection into the system prompt.
py-src/data_formulator/agents/client_utils.py Adds image-block stripping + retry logic and a ping() helper.
py-src/data_formulator/agents/agent_utils_sql.py Updates SQL table-name sanitization for Unicode and safer prefixes.
py-src/data_formulator/agents/agent_utils.py Adds lenient JSON parsing and output-variable heuristic patch helper.
py-src/data_formulator/agents/agent_sort_data.py Uses ensure_ascii=False when serializing inputs for LLM prompts.
py-src/data_formulator/agents/agent_report_gen.py Adds language instruction injection for report generation prompts.
py-src/data_formulator/agents/agent_language.py Adds shared language-instruction builder for agent prompts.
py-src/data_formulator/agents/agent_interactive_explore.py Adds language instruction injection into system prompt.
py-src/data_formulator/agents/agent_data_load.py Adds language instruction injection into system prompt.
py-src/data_formulator/agents/agent_data_clean_stream.py Adds language instruction injection and uses ensure_ascii=False in stream output.
py-src/data_formulator/agents/agent_code_explanation.py Adds language instruction injection into system prompt.
py-src/data_formulator/agents/agent_chart_insight.py Adds language instruction injection into system prompt.
package.json Adds i18n deps and Vitest/testing-library tooling + test scripts.
.gitignore Ignores *.egg-info/ and tmp-docs/.
.env.template Documents LOG_LEVEL, DATA_FORMULATOR_HOME, and AVAILABLE_LANGUAGES.
Comments suppressed due to low confidence (1)

src/views/MessageSnackbar.tsx:133

  • This effect uses displayedMessageIdx and dispatch but the dependency array only includes messages. That can lead to stale reads of displayedMessageIdx (and will trip react-hooks/exhaustive-deps). Include displayedMessageIdx and dispatch in the dependency list (or refactor to derive the next index from messages.length if that’s the intended trigger).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +85
log_level_str = os.getenv("LOG_LEVEL", "INFO").strip().upper()
app_log_level = getattr(logging, log_level_str, logging.INFO)

logging.basicConfig(
level=logging.ERROR,
level=logging.WARNING,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[logging.StreamHandler(sys.stdout)]
)

# Enable INFO for agent modules so timing logs are visible
logging.getLogger('data_formulator.agents').setLevel(logging.INFO)

# Suppress verbose logging from third-party libraries
logging.getLogger('data_formulator').setLevel(app_log_level)

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configure_logging() reads LOG_LEVEL, but logging.basicConfig(level=logging.WARNING, ...) sets the root logger to WARNING, so INFO/DEBUG logs from data_formulator will still be filtered out even if LOG_LEVEL=INFO/DEBUG. Set the root/handler level to app_log_level (or configure a dedicated handler for data_formulator) so the env var actually controls emitted log verbosity.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, but this issue does not apply to the current code. The snippet referenced in your comment (with duplicate level=logging.ERROR / level=logging.WARNING parameters and a separate data_formulator.agents logger) appears to be from an intermediate revision and is not present in the final version.

In the current implementation, logging.basicConfig(level=logging.WARNING, ...) sets the root logger level to WARNING, while logging.getLogger('data_formulator').setLevel(app_log_level) explicitly sets our application logger to the level specified by the LOG_LEVEL env var. This works correctly because Python's log propagation (callHandlers) only checks handler levels when walking up the logger hierarchy — it does not re-check parent logger levels. Since the root StreamHandler defaults to NOTSET, all records that pass the originating logger's effective level check will be emitted. The root WARNING level only serves to suppress noisy INFO/DEBUG output from third-party libraries that don't have an explicit level override, which is the intended behavior.

No changes needed here.

tests/README.md Outdated
Comment on lines +9 to +12
- `tests/frontend`
- frontend-related contract tests driven by `pytest`
- focused on request/response boundary behavior
- not intended for React component-level unit tests yet
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README says tests/frontend are pytest-driven contract tests and recommends running them via pytest tests/frontend, but this PR introduces Vitest + React Testing Library tests under tests/frontend/** and adds npm test/vitest scripts. Update this documentation to reflect the new frontend test runner and commands (and clarify what, if anything, remains pytest-driven under tests/frontend).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root tests/README.md was outdated and contradicted the actual test setup:

It described tests/frontend/ as "pytest-driven contract tests," but all files there are .test.ts / .test.tsx running under Vitest + @testing-library/react.
It stated "not intended for React component-level unit tests yet," but safeCellRender.test.tsx is exactly that.
The suggested command pytest tests/frontend would find zero tests since there are no Python files in that directory.
Meanwhile, tests/frontend/README.md already correctly documented the Vitest setup — so the two READMEs directly contradicted each other.

Fixed by updating the root tests/README.md to accurately reflect the frontend test stack (Vitest), the correct commands (npm test / npm run test:watch), and a cross-reference to tests/frontend/README.md for layout details.
comment ID:
07fe32f

Comment on lines +83 to +114
def _strip_image_blocks(self, content):
"""Remove image_url blocks from multimodal content arrays."""
if isinstance(content, list):
sanitized = []
for item in content:
if isinstance(item, dict):
if item.get("type") == "image_url":
continue
sanitized.append(item)
else:
sanitized.append(item)
return sanitized
return content

def _strip_images_from_messages(self, messages):
"""Create a copy of messages with image_url blocks removed."""
sanitized_messages = []
for msg in messages:
if isinstance(msg, dict):
new_msg = dict(msg)
if "content" in new_msg:
new_msg["content"] = self._strip_image_blocks(new_msg["content"])
sanitized_messages.append(new_msg)
else:
sanitized_messages.append(msg)
return sanitized_messages

def _is_image_deserialize_error(self, error_text: str) -> bool:
"""Detect provider errors caused by image blocks on text-only models."""
lowered = error_text.lower()
return ("image_url" in lowered and "expected `text`" in lowered) or "unknown variant `image_url`" in lowered

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client defines _strip_image_blocks, _strip_images_from_messages, and _is_image_deserialize_error twice (the second set starting here duplicates the earlier implementation). This duplication makes future changes error-prone (it’s easy to edit one copy and forget the other). Remove the duplicate method definitions and keep a single authoritative implementation.

Suggested change
def _strip_image_blocks(self, content):
"""Remove image_url blocks from multimodal content arrays."""
if isinstance(content, list):
sanitized = []
for item in content:
if isinstance(item, dict):
if item.get("type") == "image_url":
continue
sanitized.append(item)
else:
sanitized.append(item)
return sanitized
return content
def _strip_images_from_messages(self, messages):
"""Create a copy of messages with image_url blocks removed."""
sanitized_messages = []
for msg in messages:
if isinstance(msg, dict):
new_msg = dict(msg)
if "content" in new_msg:
new_msg["content"] = self._strip_image_blocks(new_msg["content"])
sanitized_messages.append(new_msg)
else:
sanitized_messages.append(msg)
return sanitized_messages
def _is_image_deserialize_error(self, error_text: str) -> bool:
"""Detect provider errors caused by image blocks on text-only models."""
lowered = error_text.lower()
return ("image_url" in lowered and "expected `text`" in lowered) or "unknown variant `image_url`" in lowered

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the three methods (_strip_image_blocks, _strip_images_from_messages, _is_image_deserialize_error) were indeed duplicated with identical implementations. This was a merge artifact. Removed the duplicate definitions, keeping a single authoritative copy. Fixed. fixed in comment ID 6c0da55

Comment on lines 203 to +221
@@ -208,7 +214,11 @@ def list_tables():
except Exception:
row_count = 0
if row_count is None:
row_count = 0
try:
df = workspace.read_data_as_df(table_name)
row_count = len(df)
except Exception:
row_count = 0
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_tables() falls back to workspace.read_data_as_df(table_name) to infer columns and row_count when metadata/schema is missing. This can load entire tables into memory (and it may happen twice: once for columns and once for row_count), which is expensive for large parquet tables and can degrade API latency. Prefer using parquet schema/row-count metadata only (or a cheap head/sample) and avoid full reads inside the listing endpoint.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this issue is more complex than described and requires a comprehensive solution. This will be included in the upcoming priority fix plan.

Comment on lines +533 to +534
logger.error(f"Error parsing file: {str(e)}")
return jsonify({"status": "error", "message": str(e)}), 400
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_file() returns str(e) directly to the client on exceptions. This can leak internal details and results in inconsistent error messages vs other routes that sanitize DB/IO errors. Return a stable, user-safe message (and keep the full exception only in logs), ideally reusing the same sanitization pattern used elsewhere in this module.

Suggested change
logger.error(f"Error parsing file: {str(e)}")
return jsonify({"status": "error", "message": str(e)}), 400
# Log full exception details for diagnostics, but return a sanitized message to the client.
logger.error("Error parsing file", exc_info=True)
safe_msg, _status_code = sanitize_db_error_message(e)
return jsonify({"status": "error", "message": safe_msg}), 400

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this carefully and here's my assessment:

The security risk here is minimal. The exceptions in parse_file() come from pandas parsing in-memory streams (pd.ExcelFile(file.stream), pd.read_csv(file.stream)), not from database or filesystem operations. The resulting error messages describe data format issues (e.g. "Excel file format cannot be determined", "Error tokenizing data") and don't leak server internals like file paths or credentials. Returning these messages to the client is intentional — they tell the user what's wrong with their uploaded file.

Using sanitize_db_error_message here would actually be worse:

It's semantically designed for DB errors — none of its patterns match pandas parsing errors.
The fallback (f"An unexpected error occurred: {error_msg}") still exposes the full str(e), so it doesn't actually sanitize anything.
It would return status 500 instead of 400 — incorrect, since a malformed upload is a client error.
What I did adopt: Changed the logger call to use exc_info=True for full stack traces in logs, which is genuinely better for diagnostics than just logging str(e).

fix: improve file parsing error logging in comment ID 9b73ce4

Use exc_info=True to record complete exception stack trace for easier troubleshooting

…mponent

- Replaced direct state selectors for selectedModelId and models with a single selector for activeModel.
- Updated report generation logic to utilize activeModel directly, improving clarity and reducing redundancy.
- Enhanced unit tests for dfSelectors to cover new globalModels handling and fallback scenarios.
…eld naming

Fix chart ID replacement logic in report view to ensure correct matching and usage of cached images. Optimize internationalization field naming by changing "count" to more explicit "totalRows". Also improve chart processing workflow by adding sequential ID mapping to ensure correct chart references during report generation.
Clean up no longer needed _strip_image_blocks and _strip_images_from_messages
methods as these features are no longer in use
Use exc_info=True to record complete exception stack trace for easier troubleshooting
…ure and commands

Update README.md in the test directory, including:
- More detailed test type descriptions
- Updated test commands
- Changed frontend testing tool from pytest to Vitest
- Added frontend test directory layout explanation
…s and JSON

Automatically request completion of missing parts when model only generates
JSON or code. Integrate this feature in DataRecAgent and DataTransformAgent.
Add logging and performance statistics.
xlrd is no longer maintained and has limited support for xlsx files.
Add openpyxl as an alternative to provide better Excel file support.
Add a toggle option in the configuration dialog to control whether AI
automatically generates data insights when creating new charts. Include
relevant i18n text and state management logic.
Add support for Unicode filenames while maintaining backward compatibility
with legacy formats. Introduce path traversal checks to prevent security
vulnerabilities, throwing exceptions when illegal paths are detected.
Fallback to legacy safe filename format if Unicode-named file does not exist.
…place existing implementation

Add safe_data_filename function for handling Unicode filenames, replacing the original secure_filename implementation

Update filename processing logic in multiple files to ensure support for Chinese and other Unicode characters

Also update Dockerfile to set LANG=C.UTF-8 environment variable to support Unicode filenames
…concurrent scenarios

Implement atomic metadata read-write operations to prevent lost update issues caused by concurrent modifications

- Add update_metadata function to provide atomic read-modify-write operations
- Add _atomic_update_metadata method for local and Azure Blob workspaces
- Refactor file upload logic to use atomic operations to ensure table name uniqueness
- Apply asynchronous processing to frontend table loading functions
…BLAS

Preload heavy libraries like numpy and pandas before installing audit hooks to prevent BLAS-related libraries from being intercepted

Allow already loaded ctypes modules to be re-imported to support BLAS access for scipy/sklearn
Implement batch table deletion by source filename for cleaning up old data when re-uploading files. Also optimize frontend table state management logic, add local table deletion method, and support skipping duplicate checks when replacing source files.
Copy link
Collaborator

@Chenglong-MS Chenglong-MS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main update to consider is to define a agentCofing / agentOptions to help manage shared configs for different agents; and diagnostic info can be wrapped separately to avoid duplicate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diag component seems to have some duplicates in both agent_data_load and agent_data_rec. It might be good to use a separate class to manage diagnostic messages.

return separator.join(table_summaries)


def ensure_output_variable_in_code(code: str, output_variable: str) -> tuple[str, bool, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution is to:

  1. in prompt hint it to put output variable in the last line, and
  2. ask the agent to also put the output variable in output json object

So the output variable prioritize 2 if it is available, otherwise finds the last line variable output (typically notebook structure)

if neither exists we just consider execution fail and ask agent to repair with an error message "output variable not specified."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that language instruction, repair attempt are becoming a little bit heavy weight to define for every agent. I'm wondering if we should just define an "agentConfig" class to pass these information to the agent, including

  1. coding rules
  2. exploration rules
  3. max repair attempts
  4. max iterations
  5. language instructions
  6. ...
    and we can extend properties later if needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering besides manual cleaning here, is there any benefit using libraries like https://github.com/kayak/pypika to deal with sanitization?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChartGallery is tempory for backend visualization library testing and will be updated later... looks good for now.

…t updates, and file replacement scenarios

Add multiple test files to cover the following regression scenarios:

1. Resolve Chinese filename truncation issue
2. Prevent data loss caused by concurrent updates
3. Ensure proper cleanup of old tables during file replacement
4. Verify overwrite logic when uploading files with the same name
…K encoding support

Implement cross-platform file encoding handling, including:

1. Add readFileText function on frontend to handle UTF-8 and GBK encodings
…ion logic

- Add trusted encoding detection set, optimize GBK-first strategy
- Add integration tests to verify Chinese CSV file processing
- Improve encoding detection fallback chain, finally fallback to latin-1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants