Skip to content

1655 dataset builder from define xml#1677

Open
DmitryMK wants to merge 8 commits intomainfrom
1655-dataset-builder-from-define-xml
Open

1655 dataset builder from define xml#1677
DmitryMK wants to merge 8 commits intomainfrom
1655-dataset-builder-from-define-xml

Conversation

@DmitryMK
Copy link
Copy Markdown
Collaborator

Addressing #1655

Updated 2 builders to keep define.xml information:

  • VariablesMetadataWithDefineAndLibraryDatasetBuilder
  • VariablesMetadataWithDefineDatasetBuilder

I have changed VariablesMetadataWithDefineDatasetBuilder to be consistent with the VariablesMetadataWithDefineAndLibraryDatasetBuilder and added a test for this builder:

  • Added Filling NAs with blanks
  • Added variable_has_empty_values and variable_is_empty, as logically we might want to run CG0015 without Library metadata, as define.xml has a role attribute and if variable is not in the library/following some standard which is not in the library, we can try to rely on the role values from define.xml instead of the library.

@gerrycampion @SFJohnson24 I have 2 questions:

  • In VariablesMetadataWithDefineDatasetBuilder I'm populating variable_has_empty_values and variable_is_empty which was not the case before and it might take a significant time on large data and slow down the engine.
  • When the information is coming from define.xml only, I'm keeping variable_name blank. In the report we can show define_variable_name for such cases. We can fill variable_name with define_variable_name, but I believe in this case we would need to add a new attribute like variable_is_present to identify the cases when a variable is present in define.xml only.
  • I had to made also some changes in test_rules_engine. I have updated variable name from TEST to TEST2 and added a mock for get_dataset in the test test_validate_variable_metadata_against_define_xml. The mock is needed because of the update in VariablesMetadataWithDefineDatasetBuilder, but I'm not sure if TEST was intentional variable name before at this line.

@SFJohnson24
Copy link
Copy Markdown
Collaborator

@DmitryMK there is a merge conflict--can you resolve it so I can review.

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please resolve the conflict on the branch too?

right_on="define_variable_name",
how="left",
how="outer",
).fillna("")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the Nan's a empty string "" we lose information if the columns was missing or it is just empty. I think better would be to handle this in a different way. Please let me know if I am not understanding the real intent here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ramil, I did it for consistency with metadata+define+library dataset builder, as it had fillna("") in it.

What is the difference between missing and empty column? In my understanding it either comes from dataset(metadata) or define. So the missing values logically appear only when it is present in define only. I might be wrong here.

@DmitryMK
Copy link
Copy Markdown
Collaborator Author

DmitryMK commented May 4, 2026

@SFJohnson24 @RamilCDISC I have merged the main branch and updated the tests accordingly.
I have one of the checks failing. It does not looks like it is caused by this PR. Do you know what it can be connected with.

@gerrycampion
Copy link
Copy Markdown
Collaborator

@SFJohnson24 @RamilCDISC I have merged the main branch and updated the tests accordingly. I have one of the checks failing. It does not looks like it is caused by this PR. Do you know what it can be connected with.

@DmitryMK this issue is resolved now

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For manual validation I ran the rule in the connected ticket description against negative dataset in the Sharepoint. The builder had an error. The trace is below.

python core.py validate -s sdtmig -v 3.4 -lr rule.yml -dp "/Users/ramilabbasov/Library/CloudStorage/OneDrive-CDISC/Documents - CORE Rules/unitTesting/SDTMIG/CG0015/negative/01/data/unit-test-coreid-CG0015-negative1.xlsx" -dxp "/Users/ramilabbasov/Library/CloudStorage/OneDrive-CDISC/Documents - CORE Rules/unitTesting/SDTMIG/CG0015/negative/01/data/define_CG0015-negative.xml"
Traceback (most recent call last):
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/parsers/base_parser.py", line 838, in _cast_types
    values = astype_array(values, cast_type, copy=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 183, in astype_array
    values = _astype_nansafe(values, dtype, copy=copy)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 134, in _astype_nansafe
    return arr.astype(dtype, copy=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: '2021-04-27'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/core.py", line 1094, in <module>
    cli()
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/core.py", line 669, in validate
    run_validation(
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/scripts/run_validation.py", line 180, in run_validation
    ).get_data_service(args.dataset_paths)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/data_service_factory.py", line 89, in get_data_service
    return self.get_service(
           ^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/data_service_factory.py", line 147, in get_service
    return self._registered_services_map.get(service_name).get_instance(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/excel_data_service.py", line 51, in get_instance
    service = cls(
              ^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/excel_data_service.py", line 39, in __init__
    super(ExcelDataService, self).__init__(
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/base_data_service.py", line 120, in __init__
    self._initialize_datasets_metadata(**kwargs)
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/excel_data_service.py", line 178, in _initialize_datasets_metadata
    dataset = self.__get_dataset(dataset_filename)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/ISS-1655/cdisc-rules-engine/cdisc_rules_engine/services/data_services/excel_data_service.py", line 96, in __get_dataset
    dataframe = pd.read_excel(
                ^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/excel/_base.py", line 517, in read_excel
    data = io.parse(
           ^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/excel/_base.py", line 1629, in parse
    return self._reader.parse(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/excel/_base.py", line 931, in parse
    raise err
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/excel/_base.py", line 918, in parse
    output[asheetname] = parser.read(nrows=nrows)
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1748, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 289, in read
    conv_data = self._convert_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 360, in _convert_data
    return self._convert_to_ndarrays(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/parsers/base_parser.py", line 593, in _convert_to_ndarrays
    cvals = self._cast_types(cvals, cast_type, c)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ramilabbasov/Desktop/ISSUES/venv/lib/python3.12/site-packages/pandas/io/parsers/base_parser.py", line 840, in _cast_types
    raise ValueError(
ValueError: Unable to convert column SVENDTC to type float64 (sheet: sv.xpt)
((venv) ) ramilabbasov@Ramils-MacBook-Pro cdisc-rules-engine % 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Rule Types that include define-xml to start the dataset builder from define-xml (and not only from dataset).

4 participants