Skip to content

Conversation

@towseef41
Copy link

@towseef41 towseef41 commented Nov 30, 2025

Description

Add configurable retries with exponential backoff/jitter to the Prometheus Remote Write exporter so transient 429/408/5xx and connection/timeouts don’t drop metrics silently. Updated README with the new retry knobs.

Fixes #3985

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • tox -e py311-test-exporter-prometheus-remote-write
  • tox -e lint-exporter-prometheus-remote-write

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@towseef41 towseef41 requested a review from a team as a code owner November 30, 2025 11:23
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 30, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: towseef41 / name: Towseef Altaf (1121cb6)

@herin049
Copy link
Contributor

herin049 commented Dec 1, 2025

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)

I find this approach preferable to rolling our own backoff-retry loop.

@towseef41
Copy link
Author

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)

I find this approach preferable to rolling our own backoff-retry loop.

Thanks for the suggestion. I’m good to move to a requests.Session + HTTPAdapter using urllib3.Retry (POST allowed), mapping our existing knobs, and I’ll add a tiny Retry subclass only to keep jitter/backoff cap. I’ll drop the manual loop and update tests.

For context, I initially considered a custom loop to keep full control over jitter/backoff cap and explicit logging and avoid relying on adapter/session setup, but I agree urllib3.Retry is battle-tested and clearer.

@herin049
Copy link
Contributor

herin049 commented Dec 1, 2025

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)
I find this approach preferable to rolling our own backoff-retry loop.

Thanks for the suggestion. I’m good to move to a requests.Session + HTTPAdapter using urllib3.Retry (POST allowed), mapping our existing knobs, and I’ll add a tiny Retry subclass only to keep jitter/backoff cap. I’ll drop the manual loop and update tests.

For context, I initially considered a custom loop to keep full control over jitter/backoff cap and explicit logging and avoid relying on adapter/session setup, but I agree urllib3.Retry is battle-tested and clearer.

Sounds good, we can get the opinion of other members as I might be in the minority here.

On a related note, I'm not sure if sub-classing urllib3.Retry is necessary if you just want to bound the backoff delay, you can simply set backoff_jitter and backoff_max to something reasonable.

@towseef41
Copy link
Author

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)
I find this approach preferable to rolling our own backoff-retry loop.

Thanks for the suggestion. I’m good to move to a requests.Session + HTTPAdapter using urllib3.Retry (POST allowed), mapping our existing knobs, and I’ll add a tiny Retry subclass only to keep jitter/backoff cap. I’ll drop the manual loop and update tests.
For context, I initially considered a custom loop to keep full control over jitter/backoff cap and explicit logging and avoid relying on adapter/session setup, but I agree urllib3.Retry is battle-tested and clearer.

Sounds good, we can get the opinion of other members as I might be in the minority here.

On a related note, I'm not sure if sub-classing urllib3.Retry is necessary if you just want to bound the backoff delay, you can simply set backoff_jitter and backoff_max to something reasonable.

Makes sense. I’ll switch to urllib3.Retry and avoid subclassing if possible: use a requests.Session + HTTPAdapter with Retry(total=..., backoff_factor=..., backoff_max=..., status_forcelist=..., allowed_methods={"POST"}). The requests-bundled urllib3 we have doesn’t expose backoff_jitter, so if we want jitter I’ll add the smallest possible override; otherwise I’ll stick to base Retry with a sensible backoff_max.

@xrmx
Copy link
Contributor

xrmx commented Dec 1, 2025

The OTLP exports implement retries manually but AFAICS don't expose any tunable (e.g. _export in exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py from opentelemetry-python repo). If you don't expose this mechanism I think it's fine to reuse http libraries code for that.

@towseef41
Copy link
Author

The OTLP exports implement retries manually but AFAICS don't expose any tunable (e.g. _export in exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py from opentelemetry-python repo). If you don't expose this mechanism I think it's fine to reuse http libraries code for that.

@xrmx
Thanks for the pointers. I’m planning to reuse the HTTP stack’s retries instead of a custom loop: requests.Session + HTTPAdapter with urllib3.Retry (POST allowed, small status_forcelist). I kept a few knobs exposed (max retries, backoff factor/cap, status list) so users can tune if needed, but otherwise it follows the built-in behavior. If you’d rather keep it non-tunable (closer to OTLP) and just lean on Retry defaults, I can pare that back.

…entelemetry/exporter/prometheus_remote_write/__init__.py

Co-authored-by: Lukas Hering <40302054+herin049@users.noreply.github.com>
@towseef41 towseef41 requested a review from herin049 December 2, 2025 03:02
@xrmx xrmx moved this to Ready for review in @xrmx's Python PR digest Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready for review

Development

Successfully merging this pull request may close these issues.

Add configurable retries with backoff to Prometheus Remote Write exporter

3 participants