Skip to content

Conversation

@dlqqq
Copy link
Contributor

@dlqqq dlqqq commented Dec 4, 2025

Description

  • Closes Excessive token use causes rate limit exceptions on Anthropic #27.
  • Enables ephemeral prompt caching by passing the required arguments to litellm.acompletion().
  • Decreased token usage by 85-95 percentage points on a small tool-calling invocation with a ~10 messages in the chat history.
  • The ChatLiteLLM provider has been made significantly more type-safe, and the _astream() method now clarifies the type of every object created & used there.
  • Implements the required LiteLLM <=> LangChain metrics integration to show cache metrics in LangSmith.

Demo

Screenshot 2025-12-03 at 6 17 09 PM

(low-resolution video because of GitHub's 10MB file upload limit)

Screen.Recording.2025-12-03.at.5.53.59.PM-2.mov

Minor "breaking" changes to the ChatLiteLLM provider

  • I have removed the _stream() method implementation to avoid code duplication. This can be easily re-implemented (without duplication) if needed in the future; the code comment there details how.

  • I needed to change the API of the _create_usage_metadata() helper function to provide the cache metrics in LangSmith and to improve its type safety. This means that every other "invocation" method except astream() (e.g. generate()) is likely broken since they eventually call this function. This should not have any impact on Jupyternaut since we are always calling astream() anyways.

@dlqqq dlqqq added the enhancement New feature or request label Dec 4, 2025
@dlqqq dlqqq changed the title Enable ephemeral prompt caching with LangServe metrics Enable ephemeral prompt caching with LangSmith metrics Dec 4, 2025
@dlqqq
Copy link
Contributor Author

dlqqq commented Dec 4, 2025

Correction: LangSmith is being used to provide the dashboard, not LangServe

@dlqqq dlqqq requested a review from 3coins December 4, 2025 19:48
@dlqqq dlqqq force-pushed the impl-prompt-caching branch from c3cd673 to 353850c Compare December 5, 2025 18:49
@dlqqq
Copy link
Contributor Author

dlqqq commented Dec 5, 2025

Rebased to include #29.

@3coins
Copy link
Contributor

3coins commented Dec 5, 2025

@dlqqq
Thanks for submitting this change, prompt caching should be a huge improvement. I see the following errors while using bedrock and haiku-4.5. Here is the model id I used: bedrock/global.anthropic.claude-haiku-4-5-20251001-v1:0

 litellm.exceptions.MidStreamFallbackError: litellm.ServiceUnavailableError: litellm.MidStreamFallbackError: litellm.BadRequestError: BedrockException - serviceUnavailableException {"message":"Bedrock is unable to process your request."} Original exception: BadRequestError: litellm.BadRequestError: BedrockException - serviceUnavailableException {"message":"Bedrock is unable to process your request."}
    During task with name 'model' and id '71a7d956-8d16-5ffa-e0d6-127818e2a03c'

Don't see this issue when I switched to main, or using anthropic directly.

@3coins
Copy link
Contributor

3coins commented Dec 5, 2025

Seems like this might be related to the prompt caching args added, which are not being passed correctly or missing some other config for bedrock. Once I removed the prompt caching block, things seem to work.

@3coins
Copy link
Contributor

3coins commented Dec 5, 2025

@dlqqq
Bedrock converse with model id bedrock/converse/global.anthropic.claude-haiku-4-5-20251001-v1:0 seems to work without any errors, so this is an issue with invoke only.

@dlqqq
Copy link
Contributor Author

dlqqq commented Dec 5, 2025

Thanks for catching this. Would it be sufficient to disable this feature if the model ID starts with bedrock/, but does not start with bedrock/converse/?

Copy link
Contributor

@3coins 3coins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@dlqqq dlqqq merged commit 1404b38 into jupyter-ai-contrib:main Dec 6, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Excessive token use causes rate limit exceptions on Anthropic

2 participants