Skip to content

MPI error messages on multi-node runs are interleaved, making logs unreadable #4017

@MelReyCG

Description

@MelReyCG

When GEOS runs on 48+ MPI ranks across multiple cluster nodes (not on P4), long log messages (both to stdout and YAML error file) are severely interleaved and unreadable. Messages from different ranks cut each other off mid-sentence, particularly visible with long diagnostic output containing tables and stack traces.

The root cause is that each rank writes independently to the same stdout pipe or shared file. POSIX only guarantees atomic write() up to PIPE_BUF (4096 bytes on Linux, IIUC). The error messages produced by GEOS (attribute tables, stack traces, context info) can exceed this threshold.

Proposed solutions, both requiring to structure the log a bit to always add rank & timestamp at stat of messages:

  • Option 1 — One independent file per rank (immediate, robust)
    Redirect error output to error_rank_<N>.log / error_rank_<N>.yaml using the existing geos::InitializeLogger() and ErrorLogger::setOutputFilename(). Files can be merged or filtered by timestamp after the run. Zero coordination required.

  • Option 2 — Per-rank files with periodic aggregation to rank 0 (structured)
    Each rank writes its own error file. At defined synchronization points (MPI barriers in normal execution flow), surviving ranks forward their file content to rank 0, which produces a unified report. On fatal error, rank 0 performs a final read of all reachable rank files before terminating.
    This option preserves a single consolidated view without introducing any MPI coordination in the critical error path itself.

Example of issue:

***** Exception
***** LOCATION: /data/home/xxxxxxxxxxxxx/GEOS/Developments/GEOS/src/coreComponents/dataRepository/Group.cpp:252
***** Error cause: processedAttributes.count( attributeName ) == 0
***** Rank 0
***** Message from compositionalMultiphaseFVMSolver (GNL_Flow_Thermal_CO2Injection.xml, l.8):
XML Node at '/Problem/Solvers/CompositionalMultiphaseFVM' contains unused attribute 'maxAbsolutePresChange'.
Valid attributes are:

----------------------------------------------------------------------------------------------------------------------------------------------------
|                    name                     |  Requirement  |                                    Description                                     |
|---------------------------------------------|---------------|------------------------------------------------------------------------------------|
|  cflFactor                                  |   OPTIONAL    |  Factor to apply to the `CFL condition                                             |
|                                             |               |  <http://en.wikipedia.org/wiki/Courant-Friedrichs-Lewy_condition>`_ when           |
|                                             |               |  calculating the maximum allowable time step. Values should be in the interval     |
|                                             |               |  (0,1]                                                                             |
|  discretization                             |   REQUIRED    |  Name of discretization object (defined in the :ref:`NumericalMethodsManager`) to  |
|                                             |               |  use for this solver. For instance, if this is a Finite Element Solver, the name   |
|                                             |               |  of a :ref:`FiniteElement` should be specified. If this is a Finite Volume         |
|                                             |               |  Method, the name of a :ref:`FiniteVolume` discretization should be specified.     |
|  targetRegions                              |   REQUIRED    |  Allowable regions that the solver may be applied to. Note that this does not      |
|                                             |               |  indicate that the solver will be applied to these regions, only that allocation   |
|                                             |               |  will occur such that the solver may be applied to these regions. The decision     |
|                                             |               |  about what regions this solver will beapplied to rests in the EventManager.       |
|  initialDt                                  |   OPTIONAL    |  Initial time-step value required by the solver to the event manager.              |
|  writeLinearSystem                          |   OPTIONAL    |  Write matrix, rhs, solution to screen ( = 1) or file ( = 2).                      |
|  allowNonConvergedLinearSolverSolution      |   OPTIONAL    |  Cut time step if linear solution fail without going until max nonlinear           |
|                                             |               |  iterations.                                                                       |
|  usePhysicsScaling                          |   OPTIONAL    |  Enable physics-based scaling of the linear system. Default: true.                 |
|  writeStatistics                            |   OPTIONAL    |  When set to `iteration`, output iterations information to a csv                   |
|                                             |               |  When set to `convergence`, output convergence information to a csv                |
|                                             |               |  When set to `all` output both convergence & iteration information to a csv.       |
|  logLevel                                   |   OPTIONAL    |  Sets the level of information to write in the standard output (the console        |
          -- then the log msg is brutally cut by another one from another rank --

Metadata

Metadata

Assignees

Labels

type: bugSomething isn't workingtype: newA new issue has been created and requires attention

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions