1699: get_parent_model_column_order calculation made only for unique rdomains and merged by#1711
Conversation
SFJohnson24
left a comment
There was a problem hiding this comment.
I am unfortunately not getting the hang up described in the issue locally-- that said, the changes you made dont appear to impact the performance. The original was O(n) for the number of supp-- rows since _get_parent_variable_names_list was called once per unique RDOMAIN value via set(rdomain_values) loop. Your refactor is also O(n) since the setdefault call inside the generator still iterates over all n rows of rdomain_values and the expensive _get_parent_variable_names_list call is still only made once per unique RDOMAIN value, just lazily rather than upfront. I may be off here if you were able to reproduce the hangup? I still believe the bottleneck is likely the is_contained_by as it is O(n*m) presently -- number of rows (64k in this case) x the length of the comparator list. It may not be the iloc but that adds extra overhead of an already expensive loop.
- I think reducing n by using a distinct
- we could also use a set on comparison_data before the loop so each is_in check is O(1) instead of O(m)
What do you think?
|
The setdefault approach inside a generator does not actually cache the results of _get_parent_variable_names_list. In Python, function arguments are evaluated eagerly — meaning the second argument to setdefault is fully evaluated before setdefault is even called. So _get_parent_variable_names_list gets called once per row (n times total), not once per unique RDOMAIN value. d = {}
def expensive(x):
print(f"called with {x}")
return x * 10
for v in ["A", "A", "A"]:
d.setdefault(v, expensive(v))
# Output:
# called with A
# called with A
# called with A ← 3 times, not 1So the original setdefault version is O(n rows), while the explicit set() loop version: for rdomain in set(rdomain_values):
rdomain_names_list[rdomain] = self._get_parent_variable_names_list(...)is O(k unique RDOMAIN values), which is a real improvement, not just a cosmetic refactor. |
There was a problem hiding this comment.
This PR correctly preserves get_parent_model_column_order while optimizing it for performance, going from O(n) I/O calls down to O(u) by deduplication with set(). Neither dev was able to reproduce the hangup using the pilot data for both the full SDTMIG and just the one problematic rule. Should the hangup persist in v0.16 we will look to is_contain_by and work to optimize there
No description provided.