Skip to content

Conversation

@dheeraj-vanamala
Copy link

Description

This PR fixes issue #4517 where the OTLP gRPC exporter fails to reconnect to the collector after a restart (returning UNAVAILABLE).

Changes:

  • Detected StatusCode.UNAVAILABLE in the export loop.
  • Added logic to close the existing channel and re-initialize it before retrying.
  • Added a regression test test_unavailable_reconnects to verify the reconnection behavior.

Fixes #4517

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I added a new regression test case test_unavailable_reconnects in exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py.

  • test_unavailable_reconnects: Verifies that the exporter closes and re-initializes the gRPC channel when the server returns StatusCode.UNAVAILABLE.

Does This PR Require a Contrib Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@dheeraj-vanamala dheeraj-vanamala requested a review from a team as a code owner November 30, 2025 15:26
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 30, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: dheeraj-vanamala / name: Dheeraj Vanamala (436ecc9)

@dheeraj-vanamala dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from c670f77 to b7620d0 Compare November 30, 2025 16:00
@dheeraj-vanamala dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from b7620d0 to 436ecc9 Compare November 30, 2025 16:13
@dheeraj-vanamala
Copy link
Author

dheeraj-vanamala commented Nov 30, 2025

I understand this issue is related to the upstream gRPC bug (grpc/grpc#38290).

I've analyzed that issue in depth, and the root cause appears to be a regression in the gRPC 'backup poller' (introduced in grpcio>=1.68.0) which fails to recover connections when the primary EventEngine is disabled (common in Python for fork safety).

While upstream fixes are being explored (e.g., grpc/grpc#38480), the issue has persisted for months, leaving exporters stuck in an UNAVAILABLE state indefinitely after collector restarts.

This PR implements a robust mitigation: detecting the persistent UNAVAILABLE state and forcing a channel re-initialization. This effectively resets the underlying poller state, allowing the exporter to recover immediately without requiring a full application restart. This approach provides stability for users while the complex upstream fix is finalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient error StatusCode.UNAVAILABLE

1 participant