Skip to content

Conversation

@ziotom78
Copy link
Member

@ziotom78 ziotom78 commented Mar 14, 2025

This commit attempts to fix issue #364 by revisiting the way MPI processes are used in several places in the framework. It adds a new test that checks that a simple E2E pipeline can run even if one of the MPI process has no TOD samples in it.

  • Add a failing test (using MPI)
  • Add a method _GridCommClass.is_this_process_in_grid()
  • Fix Observations._set_attributes_from_list_of_dict()
  • Use Observation.no_mueller_hwp()
  • Use Observation.no_mueller_hwp()
  • Be sure that each observation has a name
  • Fix spelling error in comment
  • Remove unused whitespace
  • Fix bugs in the way the map-makers use the MPI grid
  • Run MPI tests with 1, 2, 3 processes

@ziotom78
Copy link
Member Author

@anand-avinash, I put together some fixes that might solve the problems you described in #364. Before merging this PR, could you please check that this is indeed the case?

@anand-avinash
Copy link
Contributor

anand-avinash commented Mar 17, 2025

Hi @ziotom78! Thank you applying the fixes. The problem is not completely resolved yet. I am now getting two new (but related) errors:

  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/playground/lbsim_set_n_blocks/lbsim_mwe.py", line 80, in <module>
    sim.prepare_pointings()
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 252, in profile_wrapper
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 1318, in prepare_pointings
    prepare_pointings(
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/pointings_in_obs.py", line 52, in prepare_pointings
    if cur_obs.mueller_hwp[idet] is None:
       ~~~~~~~~~~~~~~~~~~~^^^^^^
IndexError: list index out of range
Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/playground/lbsim_set_n_blocks/lbsim_mwe.py", line 80, in <module>
    sim.prepare_pointings()
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 252, in profile_wrapper
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 1318, in prepare_pointings
    prepare_pointings(
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/pointings_in_obs.py", line 51, in prepare_pointings
    for idet in cur_obs.det_idx:
                ^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable

The assignment of muller_hwp is still the same as you observed earlier:

#1: sim.observations[0].mueller_hwp=[None]
#2: sim.observations[0].mueller_hwp=[None]
#3: sim.observations[0].mueller_hwp=None

I am using the same script as this for testing.

@anand-avinash
Copy link
Contributor

I found that the issue is limited primarily to the following section:

if (
MPI_COMM_GRID.COMM_OBS_GRID == MPI_COMM_GRID.COMM_NULL
): # The process does not own any detector (and TOD)
null_det = DetectorInfo()
attribute = getattr(null_det, name, None)
value = 0 if isinstance(attribute, numbers.Number) else None
setattr(self, name, value)
return

This section is applicable only to the MPI processes that do not contain any detector. On such MPI processes, in this section, we first acquire the default value of the attribute (name) from DetectorInfo class. If this value is number-like, we assign 0 to the attribute. If the value is something else or if it doesn't exist, we assign None to the attribute. Once it is done, the section returns immediately. As a result, we see sim.observations[0].mueller_hwp is None on rank 3 in the example (instead of [None]). This produces the first error.

Also since det_idx is not an attribute of DetectorInfo class, we end up assigning None to it. This produces the second error.

If I modify this section as following, it resolves the two errors completely:

if (
    MPI_COMM_GRID.COMM_OBS_GRID == MPI_COMM_GRID.COMM_NULL
):  # The process does not own any detector (and TOD)
    null_det = DetectorInfo()
    attribute = getattr(null_det, name, None)
    value = [0] if isinstance(attribute, numbers.Number) else [None]
    if name=="det_idx":
        value = [0]
    setattr(self, name, value)
    return

After this, I get two errors, I will report them one-by-one:

1st error

This is coming from rank 1 that in fact contains a detector(!):

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/playground/lbsim_set_n_blocks/lbsim_mwe.py", line 81, in <module>
    sim.prepare_pointings()
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 252, in profile_wrapper
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 1318, in prepare_pointings
    prepare_pointings(
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/pointings_in_obs.py", line 57, in prepare_pointings
    if cur_obs.mueller_hwp[idet] is None:
       ~~~~~~~~~~~~~~~~~~~^^^^^^
IndexError: list index out of range

In the Observation class, we assign det_idx using self.setattr_det_global("det_idx", np.arange(self._n_detectors_global), root). As det_idx refers to the global detector index, on rank 1, we have cur_obs.det_idx as [1] that leads to error that we see. It can be resolved by replacing line 51 of pointings_in_obs.py with for idet in range(cur_obs.n_detectors):.

2nd error

This error is again due to the section I referred before. Since the default value of quat in the DetectorInfo class is not number-like, we assign it to None on the processes that do not contain a detector. This can be solved similar to as what I proposed above with det_idx, but still opens another pandora box of errors, and I don't have any good intuition for them.

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/playground/lbsim_set_n_blocks/lbsim_mwe.py", line 119, in <module>
    lbs.scan_map_in_observations(
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/scan_map.py", line 332, in scan_map_in_observations
    hwp_angle = cur_obs.get_pointings()[1]
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/observations.py", line 892, in get_pointings
    return self.get_pointings(
           ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/observations.py", line 941, in get_pointings
    _ = self.get_pointings(
        ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/observations.py", line 876, in get_pointings
    return self.pointing_provider.get_pointings(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/pointings.py", line 118, in get_pointings
    full_quaternions = (self.bore2ecliptic_quats * detector_quat).slerp(
                        ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/scanning.py", line 479, in __mul__
    if (self.start_time is not None) and (other.start_time is not None):
                                          ^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'start_time'

@ziotom78
Copy link
Member Author

Are you sure you still use the script mentioned in that comment? The issue

  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/playground/lbsim_set_n_blocks/lbsim_mwe.py", line 80, in <module>
    sim.prepare_pointings()
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 252, in profile_wrapper
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/simulations.py", line 1318, in prepare_pointings
    prepare_pointings(
  File "/mnt/Data/Projects/uniroma2/coding/dev_litebirdsim/litebird_sim/litebird_sim/pointings_in_obs.py", line 52, in prepare_pointings
    if cur_obs.mueller_hwp[idet] is None:
       ~~~~~~~~~~~~~~~~~~~^^^^^^
IndexError: list index out of range

triggers line 52 in pointings_in_obs.py (last line of the stack trace), which is only called when hwp is not None. However, I see no calls to Simulation.set_hwp() in your test script. Maybe you updated it after that post?

Could you please modify the test test_empty_observations() in test/test_mpi.py so that it triggers the same errors you are seeing?

Many thanks!

@anand-avinash
Copy link
Contributor

Ah you are right! I was running the script after uncommenting the set_hwp part. With that commented, there is no error from the script. However, it throws some if I call destriper.

base_path=base_path,
start_time=0.0,
duration_s=mission_time_days * 24 * 60.0,
duration_s=mission_time_days * 24 * 60 * 60.0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? It makes the test far slower…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no need, I agree with you. I just wanted make the conversion to seconds correctly.

ziotom78 and others added 7 commits March 19, 2025 19:16
This should work even if some MPI processes have no detector
As stated at the beginning of the file, we do not use PyTest for MPI
processes, as we want to be sure that all processes share the same
temporary directory, and this is critical for I/O tests.
The reason is that if some MPI process crashes, the barrier fails if
not every MPI process is in the grid.
We use a StringIO stream to write things in memory, so we can test
print() statements without polluting the terminal
@ziotom78
Copy link
Member Author

ziotom78 commented Apr 8, 2025

Tests now fail because function add_convolved_sky() in litebird_sim/beam_convolution.py contains this line:

if mueller_hwp is not None:
    assert tod.shape[0] == mueller_hwp.shape[0]

where mueller_hwp = [None, None]. (The culprit is test_beam_convolution() in test/test_beam_convolution.py, which creates two detectors without Mueller HWP, hence the list of two elements for mueller_hwp.)

I am unsure why this error is happening or how we can fix it. I see a few possibilities:

  • We force mueller_hwp to become None every time it is a list of None types. (Hacky, as we would need to do this with an explicit escape hatch.)
  • We modify add_convolved_sky() to handle a list of None values. (What does happen if the list contains both None values and actual Mueller matrices? Do we want to support this case?)
  • Other solutions?

@anand-avinash
Copy link
Contributor

I am thinking about a different resolution for #364. The problem arises simply because we try to use more number of MPI processes than the number_of_time_blocks X number_of_det_blocks. It is a genuine concern because we cannot easily predict the value of number_of_det_blocks when we create detector blocks based on the detector attributes. Now how about instead of solving the problems arising from it on case-to-case basis, we can provide a script (similar to plot_fp.py and install_imo.py), that will simply tell the users exactly how many MPI process will be needed to run the script with given configuration for data distribution. This way, the users can determine the number of MPI processes needed to run their simulations without running it anywhere.

@ziotom78
Copy link
Member Author

I am thinking about a different resolution for #364. The problem arises simply because we try to use more number of MPI processes than the number_of_time_blocks X number_of_det_blocks. It is a genuine concern because we cannot easily predict the value of number_of_det_blocks when we create detector blocks based on the detector attributes. Now how about instead of solving the problems arising from it on case-to-case basis, we can provide a script (similar to plot_fp.py and install_imo.py), that will simply tell the users exactly how many MPI process will be needed to run the script with given configuration for data distribution. This way, the users can determine the number of MPI processes needed to run their simulations without running it anywhere.

I like this idea! The assumption that every MPI process had at least one observation is so pervasive that fixing it is taking too much time. A script that helps the user find the right number of processes is easier to understand, does not complicate the codebase, and makes better use of computer resources.

anand-avinash added a commit that referenced this pull request Jun 9, 2025
@anand-avinash anand-avinash mentioned this pull request Jun 9, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants