Skip to content

Conversation

@rwest
Copy link
Member

@rwest rwest commented Feb 20, 2020

Motivation or Problem

If you ran Molecule.is_isomorphic with an incorrect initial map, it would assume nothing was wrong with the map, and if it found no non-isomorphic things in the rest of the molecule, would conclude the molecules were isomorphic.

This was causing some incorrect failing database tests.

Description of Changes

If given an initial mapping, check it's valid.

Testing

Made a unit test that fails, then fixed it and it passes.

@codecov
Copy link

codecov bot commented Feb 20, 2020

Codecov Report

Merging #1894 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1894      +/-   ##
==========================================
+ Coverage   47.10%   47.14%   +0.03%     
==========================================
  Files          88       88              
  Lines       23189    23189              
  Branches     6033     6033              
==========================================
+ Hits        10923    10932       +9     
+ Misses      11114    11106       -8     
+ Partials     1152     1151       -1     
Impacted Files Coverage Δ
arkane/kinetics.py 12.24% <0.00%> (ø)
rmgpy/molecule/draw.py 52.74% <0.00%> (+0.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc9bb3b...bcd515c. Read the comment docs.

@mjohnson541
Copy link
Contributor

I think @mliu49 and I talked about doing this before and decided not to. I think at the time the argument was that if one cares about the correctness of the initial map (assuming the map is unique) one can simply use the generate_initial_map option which will give the right result. However inside RMG we already know the map and know it is correct the vast majority of times so in all of those situations it doesn't make sense to run this check. I'm unsure how much speed difference this would make, but given how often we use is_isomorphic it could be significant.

@rwest
Copy link
Member Author

rwest commented Feb 20, 2020

Calling with generate_initial_map doesn't enforce all labeled atoms are in both molecules.
For example in databaseTest.py line 569 kinetics_check_adjlists_nonidentical creates a mapping and then passes it to is_isomorphic.
If you replace this code with just calling is_isomorphic with generate_initial_mapping=True instead of passing in the initial mapping, then a bunch of families fail the tests, because things with and without labeled atoms are marked as isomorphic.

There are other ways to rewrite this particular unit test to work around the problem instead of the solution currently in this PR, but then we still have the fact that a function called is_isomorphic sometimes returns True for two molecules that are clearly not isomorphic. I would argue that to accept this it should be 1) definite speed benefit, and 2) very well documented.

@mliu49
Copy link
Contributor

mliu49 commented Feb 20, 2020

I do remember having that discussion, but I don't remember much of the details.

In one of my many attempts to refactor these comparisons, I wrote an entirely new method with multiple arguments to handle all comparison types. The code is still here, and it has an option to verify the initial mapping, whether it was provided or generated. I ended up abandoning that branch because it seemed like a big function with a ton of options was less efficient than having multiple individual ones.

I think it would be worthwhile to add this option to is_isomorphic, though it might be a good idea to check whether it does impact performance.

rwest added a commit to mazeau/RMG-Py that referenced this pull request Feb 20, 2020
Previously, if there was an non-isomorphism present in the 
labeled atoms, then the mapping would be invalid, but this 
is not (currently) checked by the is_isomorphic method,
so two non-isomorphic molecules would be found isomorphic,
and would fail the test.

Now we ensure that the mapping is valid, before passing it 
to is_isomorphic. This should work whatever the outcome of 
the discussion around PR ReactionMechanismGenerator#1894
@rwest
Copy link
Member Author

rwest commented Feb 20, 2020

So mazeau@8c78e91 now works around this problem so #1884 is no longer dependent on this PR.

@alongd
Copy link
Member

alongd commented Jun 10, 2020

looking good! please rebase

@mjohnson541
Copy link
Contributor

Are we certain we want to make this check non-optional?

@rwest
Copy link
Member Author

rwest commented Aug 18, 2020

We could add a assume_initial_map_is_correct=False argument?
If the method is called with assume_initial_map_is_correct=True then it skips the check?

When would we then use the feature? I suppose if we have just algorithmically generated the initial map, but I'm not sure where that's the case.

@mjohnson541
Copy link
Contributor

I'm not sure I understand...aren't all initial maps in an RMG job algorithmically generated?

@rwest
Copy link
Member Author

rwest commented Aug 18, 2020

If there are labelled atoms then the labels are used to generate the initial mapping. If those atoms are NOT actually isomorphic, we never check them. See the example in the unit test that is added in this PR.
It shows butane adsorbed to the surface through the first or the second C atom. Clearly different graphs, but were identified as isomorphic

CH2-CH2-CH2-CH3
|
X 

CH3-CH-CH2-CH3
    |
    X 

This was causing tests in the database to incorrectly fail.

@mjohnson541
Copy link
Contributor

Oh sorry, this is just Molecule.is_isomorphic, not all of them.

@rwest
Copy link
Member Author

rwest commented Aug 19, 2020

Am I right that the only remaining concern is that checking isomorphism properly might slow things down a bit?

The continuous-integration/rmg-tests show these timings:

Observables Test Case: Aromatics Comparison
Execution time, Benchmark:
Execution time (DD:HH:MM:SS): 00:00:00:56
Execution time, Tested:
Execution time (DD:HH:MM:SS): 00:00:00:56

Observables Test Case: NC Comparison
Execution time, Benchmark:
Execution time (DD:HH:MM:SS): 00:00:01:31
Execution time, Tested:
Execution time (DD:HH:MM:SS): 00:00:01:32

Observables Test Case: SO2 Comparison
Execution time, Benchmark:
Execution time (DD:HH:MM:SS): 00:00:01:00
Execution time, Tested:
Execution time (DD:HH:MM:SS): 00:00:01:00

Observables Test Case: EG1 Comparison
Execution time, Benchmark:
Execution time (DD:HH:MM:SS): 00:00:00:57
Execution time, Tested:
Execution time (DD:HH:MM:SS): 00:00:00:56

All these jobs are only ~1 minute, so not a convincing argument, but I don't see evidence that we should worry. One is 1 second faster, and one is 1 second slower, and the other two are the same.

Do we have (or should we have?) a more rigorous or accurate away to benchmark speed changes in pull requests?

Weirdly, looking carefully at these logs, there appears to be some non-identical thermo in the rmg/nitrogen test

Non-identical thermo!
1119original:	O1[C]=N1
1120tested:	O1[C]=N1
1121Hf(300K)  |S(300K)   |Cp(300K)  |Cp(400K)  |Cp(500K)  |Cp(600K)  |Cp(800K)  |Cp(1000K) |Cp(1500K) 
1122    129.73|     51.79|      9.14|      9.59|      9.99|     10.34|     10.90|     11.38|     12.40
1123    154.90|     56.55|      9.79|      9.15|      8.58|      8.33|      9.02|     10.26|     10.96
1124thermo: Thermo group additivity estimation: missing(O2s-CdN3d) + group(N3s-CsHH) + group(Cds-CdsCsCs) + ring(Cyclopropene) + radical(Cds_P)
1125thermo: Thermo group additivity estimation: missing(O2s-CdN3d) + group(N3s-CsHH) + group(Cds-CdsCsCs) + ring(oxirene) + radical(Cds_P)

But the same oxirene and Cyclopropene interchange was noticed on many different PRs so I think is an underlying non-deterministic behavior that should be addressed separately, in issue #2010.

@mjohnson541
Copy link
Contributor

I think that's sufficient for this case, this one doesn't get as much use as its is_subgraph_isomorphic counterpart.

It would be nice for us to have some speed benchmarks...as you've noted though RMG-tests examples aren't necessarily in the regime we're worried about. I think we'd need to add an example using the restart feature to start at a higher species load. To make it test properly on faster/slower processors I think we'd need to run an RMG independent speed test at the beginning that we could scale the RMG runtime to.

rwest added 2 commits August 19, 2020 22:13
The problem this identifies is that if you give it two different molecules
but the only difference is in a set of atoms that you have already
passed in as a map (eg. because of their labels) then they are thought
to be isomorphic. I.e the initial mapping you pass is not checked.
This fixes the recently added failing isomorphism check.
If you pass in a mapping, it is checked for validity.

Note that this is not done at the Graph level or VF2, only at 
the Molecule level. I.e. this bug probably still exists for Graph
and maybe Group objects.
@mjohnson541 mjohnson541 merged commit ea8b784 into master Aug 20, 2020
@mjohnson541 mjohnson541 deleted the isomorphismfix branch August 20, 2020 12:05
@rwest
Copy link
Member Author

rwest commented Aug 20, 2020

Thanks

mjohnson541 pushed a commit that referenced this pull request Apr 26, 2021
Previously, if there was an non-isomorphism present in the 
labeled atoms, then the mapping would be invalid, but this 
is not (currently) checked by the is_isomorphic method,
so two non-isomorphic molecules would be found isomorphic,
and would fail the test.

Now we ensure that the mapping is valid, before passing it 
to is_isomorphic. This should work whatever the outcome of 
the discussion around PR #1894
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants