Skip to content

Possible message drop when receiving large number of messages #3860

@knelli2

Description

@knelli2

Hello, I am one of the SpECTRE developers and we had a case where one of our executables built with charm would hang after a long while of running without explanation. After a lot of debugging, we believe we were able to trace down the issue to a message definitely being sent from one chare, but not being received by a different chare. Since the test case where this happened was pretty complicated, I've developed a minimal example which doesn't depend on SpECTRE that I believe also reproduces the issue. Attached is a tar that contains the source code necessary to run the minimal example.

MessageDrop.tar.gz

Some things to note about the issue:

  • In the SpECTRE test case, this happened after a large number of messages were received by an array chare with a single element. Roughly 2^32 / 10 ($\approx$ 4.3e8) messages.
  • The issue persisted even after checkpoint/restart (meaning this likely isn't caused by running out of memory)
  • Even though a message seems to have been dropped and no more messages are sent, quiescence detection doesn't do anything.
  • In the minimal example, I construct two single-element array chares; the Sender and the Receiver. The Sender sends a total of 2^32 messages ($\approx$ 4.3e9) in batches of 1e8 messages at a time (to avoid running out of memory). The Receiver receives the 1e8 messages, increments a counter, then tells the Sender to send another batch of messages. After around 2.2e9 messages sent, the Receiver no longer prints that it is receiving messages, and the executable hangs.

Also, this was built/configured for mpi-linux-x86_64-smp with intel MPI as the backend

Any help you can provide to try and understand what is happening would be very helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions