Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
30fd2cf
Draft new ThermoML-like data entry type
mattwthompson Feb 2, 2026
88e5986
Port most of Evaluator's ThermoML parsing
mattwthompson Feb 2, 2026
dd1d5cd
Port more code
mattwthompson Feb 3, 2026
060a93d
Port more code
mattwthompson Feb 3, 2026
aad037a
Progress towards passing tests
mattwthompson Feb 3, 2026
efcb193
Closer to passing tests
mattwthompson Feb 3, 2026
cfcd177
More directly port over more Evaluator code
mattwthompson Feb 4, 2026
1eefb8a
Try using new configs?
mattwthompson Feb 4, 2026
d12e8c8
Merge remote-tracking branch 'upstream/main' into data-entry-types
mattwthompson Feb 12, 2026
fcd4c7b
Refactor ThermoML parsing
mattwthompson Feb 12, 2026
59227d0
Half of single-property parsing tests are passing
mattwthompson Feb 12, 2026
5b4c1b3
Fix dielectric test
mattwthompson Feb 12, 2026
1af1ae7
Fix most properties
mattwthompson Feb 16, 2026
ff616c4
Fix test(s), add vapor pressure entry
mattwthompson Feb 16, 2026
df8c6f3
Remove debug code / make pylint happy
mattwthompson Feb 16, 2026
5cb30f2
RDKit is required
mattwthompson Feb 18, 2026
c2c991b
Remove some dead code
mattwthompson Feb 18, 2026
d799d16
Copy Evaluator license
mattwthompson Feb 18, 2026
6e74864
Remove some dead plugin and serialization code
mattwthompson Feb 18, 2026
1496f28
Drop `ThermodynamicState`, `UNDEFINED`
mattwthompson Feb 19, 2026
9783036
Remove most object-oriented complexity
mattwthompson Feb 19, 2026
21874e4
Update dimsim/datasets/thermoml/thermoml.py
mattwthompson Feb 25, 2026
9d40b8a
Merge remote-tracking branch 'upstream/main' into data-entry-types
mattwthompson Feb 25, 2026
54266aa
Use better class-mapping strategy
mattwthompson Feb 25, 2026
59a6029
Revert "Remove some dead plugin and serialization code"
mattwthompson Feb 25, 2026
c3915ac
No tags
mattwthompson Feb 25, 2026
c417b33
Remove some un-used code
mattwthompson Feb 25, 2026
cf26732
Start adding data schema
mattwthompson Feb 27, 2026
ee7e926
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 27, 2026
6d23b88
Lint
mattwthompson Mar 3, 2026
cf82a0a
Remove re-export
mattwthompson Mar 3, 2026
c0d1f35
Drop unused type
mattwthompson Mar 3, 2026
ad6461c
Add file
mattwthompson Mar 3, 2026
c71c7f3
Remove duplicate entry definition
mattwthompson Mar 3, 2026
0939c1e
Fix property phase definition
mattwthompson Mar 3, 2026
74fe411
Move phase into new file for import safety
mattwthompson Mar 3, 2026
4c66618
Fix bad source handling
mattwthompson Mar 3, 2026
e85bf69
Merge remote-tracking branch 'upstream/main' into data-entry-types
mattwthompson Mar 3, 2026
e774424
Add back "tags" in entries
mattwthompson Mar 3, 2026
7fc69bb
Move properties around
mattwthompson Mar 3, 2026
8df3285
Clean up processing of entry types
mattwthompson Mar 3, 2026
6371280
Do not use subclasses for entry types
mattwthompson Mar 3, 2026
5589fa7
Drop properties, fix type-checking
mattwthompson Mar 4, 2026
578e710
Update Pandas export
mattwthompson Mar 4, 2026
3a847c1
Update Pandas export, promote id
mattwthompson Mar 4, 2026
5f25c6b
Fix type-checking
mattwthompson Mar 4, 2026
49de48e
Test basic Pandas roundtrip
mattwthompson Mar 4, 2026
1e4c0cf
Update dataframe import
mattwthompson Mar 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/gh-ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
matrix:
os: [macOS-latest, ubuntu-latest]
python-version: ["3.11", "3.12"]
include-rdkit: [false, true]
include-rdkit: [true]
include-openeye: [false, true]
exclude:
- include-rdkit: false
Expand Down
23 changes: 23 additions & 0 deletions LICENSE-3RD-PARTY
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
=============================== OpenFF Evaluator ===============================

MIT License

Copyright (c) 2019 Open Force Field Consortium

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ Contributions are welcome! Please feel free to submit a Pull Request.

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

This project derives from other projects. See [LICENSE-3RD-PARTY](LICENSE-3RD-PARTY) for details.

## Authors

- Lily Wang
Expand Down
6 changes: 3 additions & 3 deletions devtools/conda-envs/dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
# Base dependencies
- pydantic >=2.0
- pyyaml
- pandas =2

# Core dependencies
- openff-toolkit >=0.18
Expand Down Expand Up @@ -47,6 +48,5 @@ dependencies:
# Typing
- mypy
- types-PyYAML

- pip:
- git+https://github.com/openforcefield/openff-sphinx-theme.git@main
- types-requests
- types-python-dateutil
3 changes: 3 additions & 0 deletions devtools/conda-envs/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies:
# Base dependencies
- pydantic >=2.0
- pyyaml
- pandas =2

# Core dependencies
- openff-toolkit
Expand Down Expand Up @@ -47,3 +48,5 @@ dependencies:
# Typing
- mypy
- types-PyYAML
- types-requests
- types-python-dateutil
167 changes: 108 additions & 59 deletions dimsim/_tests/datasets/test_thermoml.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,33 @@
import numpy as np
import uuid

import numpy
import pytest
from openff.toolkit import Molecule
from openff.units import Unit

from dimsim._tests.utils import get_test_data_path
from dimsim.datasets.thermoml import (
thermoml_dataset_from_doi,
thermoml_dataset_from_xml,
)
from dimsim.datasets.thermoml.thermoml import ThermoMLDataSet


@pytest.mark.skip(reason="Not implemented yet")
@pytest.mark.parametrize(
"filename, expected",
[
(
"single_density.xml",
{
"type": "density",
"x_a": 1.0,
"x_b": None,
"x": [1.0],
"temperature": 293.15,
"pressure": 1.0,
"value": 0.96488,
"std": 0.05,
"std": 0.00005,
"units": "g/mL",
"source": "",
},
),
(
"single_dhmix.xml",
{
"type": "dhmix",
"x_a": 0.219,
"x_b": 0.781,
"x": [0.219, 0.781],
"temperature": 298.15,
"pressure": 0.997,
"value": 0.03021,
Expand All @@ -44,9 +39,7 @@
(
"single_dhvap.xml",
{
"type": "dhvap",
"x_a": 1.0,
"x_b": None,
"x": [1.0],
"temperature": 298.15,
"pressure": None,
"value": 10.51625,
Expand All @@ -58,10 +51,8 @@
(
"single_dielectric.xml",
{
"type": "dielectric_constant",
"x_a": 1.0,
"x_b": None,
"temperature": 298.15,
"x": [1.0],
"temperature": 293.15,
"pressure": 0.997,
"value": 11.76,
"std": 0.02,
Expand All @@ -71,38 +62,54 @@
),
],
)
def test_load_property_types(filename: str, expected: dict):
"""Test loading a single data type from a ThermoML XML file"""
dataset = thermoml_dataset_from_xml(get_test_data_path(f"thermoml/{filename}"))
assert len(dataset) == 1
class TestThermoMLDataset:
"""Class set up only to make convenient re-use of parametrized test cases."""

def test_load_property_types(self, filename: str, expected: dict):
"""Test loading a single data type from a ThermoML XML file"""
dataset = ThermoMLDataSet.from_xml(open(get_test_data_path(f"thermoml/{filename}")).read())
assert len(dataset) == 1

entry = next(iter(dataset))
assert entry["x"] == expected["x"]

entry = dataset[0]
assert entry["type"] == expected["type"]
assert entry["x_a"] == expected["x_a"]
assert len(entry["x"]) == len(entry["smiles"])
assert len(entry["x"]) == len(expected["x"])

# assert mapped smiles
Molecule.from_mapped_smiles(entry["smiles_a"])
if expected["x_b"] is not None:
assert entry["x_b"] == expected["x_b"]
Molecule.from_mapped_smiles(entry["smiles_b"])
else:
assert entry["smiles_b"] is None
assert entry["x_b"] is None
for found_x, expected_x, found_smiles in zip(
entry["x"],
expected["x"],
entry["smiles"],
):
assert found_x == expected_x

assert entry["temperature"] == expected["temperature"]
if expected["pressure"] is not None:
assert np.isclose(entry["pressure"], expected["pressure"], atol=1e-3)
else:
assert entry["pressure"] is None
# just make sure it's valid SMILES
Molecule.from_smiles(found_smiles)

assert np.isclose(entry["value"], expected["value"], atol=1e-5)
assert np.isclose(entry["std"], expected["std"], atol=1e-5)
assert entry["units"] == expected["units"]
# Evaluator uses non-mapped SMILES, pseudocode here used mapped
# Molecule.from_mapped_smiles(found_smiles)

assert entry["source"] == expected["source"]
assert entry["temperature"] == expected["temperature"]
if expected["pressure"] is not None:
assert numpy.isclose(entry["pressure"], expected["pressure"], atol=1e-3)
else:
assert entry["pressure"] is None

assert numpy.isclose(entry["value"], expected["value"], atol=1e-5)
assert numpy.isclose(entry["std"], expected["std"], atol=1e-5)
assert Unit(entry["units"]) == Unit(expected["units"])

@pytest.mark.skip(reason="Not implemented yet")
assert entry["source"] == expected["source"]

def test_pandas_roundtrip(self, filename, expected):
dataset = ThermoMLDataSet.from_xml(open(get_test_data_path(f"thermoml/{filename}")).read())

roundtripped = ThermoMLDataSet.from_pandas(dataset.to_pandas())

assert len(dataset) == len(roundtripped)


@pytest.mark.skip(reason="Implement next")
def test_load_single_osmotic():
"""
Test loading a single osmotic coefficient data point from a ThermoML XML file.
Expand All @@ -111,31 +118,73 @@ def test_load_single_osmotic():
but is included here to ensure that ions are dealt with correctly.

"""
dataset = thermoml_dataset_from_xml(get_test_data_path("thermoml/single_osmotic.xml"))
dataset = ThermoMLDataSet.from_xml(open(get_test_data_path("thermoml/single_osmotic.xml")).read())
assert len(dataset) == 1

entry = dataset[0]
assert entry["type"] == "osmotic_coefficient"
entry = next(iter(dataset))

assert "." in entry["smiles_a"]
Molecule.from_mapped_smiles(entry["smiles_a"])
assert entry["x_a"] == 0.00086
assert "." in entry["smiles"]
Molecule.from_mapped_smiles(entry["smiles"])
assert entry["x"] == 0.00086

Molecule.from_mapped_smiles(entry["smiles_b"])
assert entry["x_b"] == 0.99914
Molecule.from_mapped_smiles(entry["smiles"])
assert entry["x"] == 0.99914

assert np.isclose(entry["temperature"], 298.15, atol=1e-3)
assert numpy.isclose(entry["temperature"], 298.15, atol=1e-3)
assert entry["pressure"] is None
assert np.isclose(entry["value"], 0.7389, atol=1e-5)
assert np.isclose(entry["std"], 0.00655, atol=1e-5)
assert numpy.isclose(entry["value"], 0.7389, atol=1e-5)
assert numpy.isclose(entry["std"], 0.00655, atol=1e-5)
assert entry["units"] == "dimensionless"
assert entry["source"] == "10.1016/j.fluid.2006.09.025"


@pytest.mark.skip(reason="Not implemented yet")
def test_load_from_doi():
"""Test loading a ThermoML dataset from a DOI"""
dataset = thermoml_dataset_from_doi("10.1016/j.fluid.2014.12.023")
assert len(dataset) == 9
dataset = ThermoMLDataSet.from_doi("10.1016/j.fluid.2014.12.023")
assert len(dataset) == 186
for entry in dataset:
assert entry["source"] == "10.1016/j.fluid.2014.12.023"


def test_to_pandas():
"""A test to ensure that data sets are convertable to pandas objects."""

thermoml_dataset = ThermoMLDataSet()

density_entry = {
"id": str(uuid.uuid4()).replace("-", ""),
"tag": "density",
"x": [1.0],
"smiles": ["[C:1]([O:5][C:3]([C:2]([O:4][H:13])([H:9])[H:10])([H:11])[H:12])([H:6])([H:7])[H:8]"],
"temperature": 293.15,
"pressure": 1.0,
"value": 0.96488,
"std": 0.00005,
"units": "g/mL",
"source": "",
}

thermoml_dataset.add_properties(density_entry)

dataframe = thermoml_dataset.to_pandas()

required_columns = [
"Id",
"tag",
"Temperature (K)",
"Pressure (kPa)",
"N Components",
"Component 1",
"Mole Fraction 1",
"Value",
"Uncertainty",
"Source",
]

assert all(x in dataframe for x in required_columns)

assert dataframe is not None
assert dataframe.shape == (1, 10)

data_set_without_na = dataframe.dropna(axis=1, how="all")
assert data_set_without_na.shape == (1, 9)
Empty file added dimsim/configs/__init__.py
Empty file.
53 changes: 53 additions & 0 deletions dimsim/configs/targets/thermo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import typing

import datasets
import pyarrow

from dimsim.molecule import map_smiles

DATA_SCHEMA = pyarrow.schema(
[
("id", pyarrow.string()),
("tag", pyarrow.string()),
("smiles", pyarrow.list_(pyarrow.string())),
("x", pyarrow.list_(pyarrow.float64())),
("temperature", pyarrow.float64()),
("pressure", pyarrow.float64()),
("value", pyarrow.float64()),
("std", pyarrow.float64()),
("units", pyarrow.string()),
("source", pyarrow.string()),
]
)


class DataEntry(typing.TypedDict):
id: str

tag: str # was previously EntryTag

smiles: list[str]

x: list[float]

temperature: float

pressure: float

value: float

std: float | None

units: str

source: str


def create_dataset(rows: typing.Iterable[DataEntry]) -> datasets.Dataset:
for row in rows:
row["smiles"] = [map_smiles(value) for value in row["smiles"]]

# TODO: validate rows
table = pyarrow.Table.from_pylist([*rows], schema=DATA_SCHEMA)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work? phases is defined as a list of strings, but the defaults are actually given as an IntFlag enum. Could you please add a test?


return datasets.Dataset(datasets.table.InMemoryTable(table))
Loading