Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: traceable decorator errors crash application code #1306

Open
nikkogg opened this issue Dec 9, 2024 · 6 comments
Open

Issue: traceable decorator errors crash application code #1306

nikkogg opened this issue Dec 9, 2024 · 6 comments

Comments

@nikkogg
Copy link

nikkogg commented Dec 9, 2024

Issue you'd like to raise.

During the recent outage, we have noticed a rather weird behaviour:

we are using traceable decorator in our Python code, like so:

  @traceable(project_name=LANGSMITH_PROJECT_NAME, tags=[ENVIRONMENT_NAME])
  def search(
        self,
        query: str,
        user_id: str,
        community_id: str,
        company_id: str,
        start_date: str,
        end_date: str,
        available_communities: List[str],
    ) -> dict:

and during recent outage we have noticed, that error raised in the decorator, crashes the application code. I believe this should not be the case, as tracing should not impact availability of our services.

In our case error raised was 403 Forbidden returned by LangSmith API.

Thank you

Suggestion:

I think although important tracing is not imperative to the functionality of the application and error raised in tracing decorator should not take out the entire/service/endpoint/feature. Maybe a warning log, or a notification through email/langsmith UI would be a better option than crashing?

@hinthornw
Copy link
Collaborator

hinthornw commented Dec 9, 2024

@nikkogg thanks for raising! 100% agree that observability shouldn't impact your app execution. Could you confirm a few things with me?

  • What version of langsmith SDK you are using?
  • do you have any more details on how you're using langsmith in your application?
  • Are you making any calls directly with the Langsmith client? (like client.create_run or client.create_feedback, etc)

All the API calls in @traceable are done within a try/catch blog and in a background thread, which would still emit warning logs but not crash the main thread.

When I run any app I can create with no internet or with incorrect credentials and get a 403, the app runs as expected but warning logs are emitted, as expected.

Or perhaps @billpku, @viones1995,@jlcstephens, or @ukyen8, it seems from your reactions that you also are running into this? My apologies if so! Do you have more information?

@billpku
Copy link

billpku commented Dec 9, 2024

@hinthornw Thank you for your quick response. Let me provide details about our setup and solution:
Current Setup

  • LangSmith SDK Version: langsmith==0.1.140 (maintained for LangChain compatibility)
  • We use LangSmith SDK to track LangChain agent calls (handling multiple tools)

Initial Implementation

from langsmith import traceable

@traceable(
    project_name=LANGSMITH_PROJECT_NAME,
    tags=[ENVIRONMENT_NAME]
)
def search(self, query: str) -> dict:

Issue Identified
During local debugging, we discovered that when the LangSmith API URL is unreachable:
The connection attempts continue many times, causing service timeouts

Solution Implementation
We created a safer version of the traceable decorator that:

  • Prevents service blocking
  • Handles connection errors gracefully
  • Tests connectivity before tracing
from langsmith import Client, traceable
from langsmith.utils import Retry
from functools import wraps

SEARCH_LANGSMITH_CLIENT = Client(
    retry_config=Retry(total=1),  
)

def safe_traceable(project_name, tags, client):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                # Test LangSmith connectivity
                info = client.request_with_retries("GET", "/info", stop_after_attempt=1)
                logger.info(f"[SearchAgentInvoker] Langsmith info: {info}")

                return traceable(project_name=project_name, tags=tags, client=client)(func)(
                    *args, **kwargs
                )
            except Exception as e:
                logger.warning(
                    f"Langsmith tracing failed, continuing without tracing. Error: {str(e)}",
                    exc_info=True,
                )
                # Fallback to original function
                return func(*args, **kwargs)
        return wrapper
    return decorator

@safe_traceable(
    project_name=LANGSMITH_PROJECT_NAME,
    tags=[ENVIRONMENT_NAME],
    client=SEARCH_LANGSMITH_CLIENT
)
def search(self, query: str) -> dict:

Feel free to share any suggestions to help us ensure our LangSmith logging implementation is robust and efficient, with minimal impact on service performance and latency, thanks

@hinthornw
Copy link
Collaborator

@billpku what service is timing out in your case?

I wouldn't recommend the proposed wrapper, since that adds an unchached, blocking IO request in the main thread.

Tracing already occurs in a background thread, meaning that while it will finish up ingestion before the program exits, it shouldn't be adding noticeable latency or errors to your program's execution.

I'd also recommend updating your langsmith SDK version since we've made a number of serialization and speed improvements in the intermediating releases.

@billpku
Copy link

billpku commented Dec 10, 2024

@hinthornw Thank you for your feedback. Let me clarify our specific issue:

Connection Timeout Issue

We encountered blocking behavior when the LangSmith API URL is unreachable. Here's how we reproduced the issue using a mock API URL:

"level":"WARNING","location":"_post_batch_ingest_runs:1440","message":"Failed to batch ingest runs: langsmith.utils.LangSmithConnectionError: Connection error caused failure to POST https://api.smith.langchain1.com/runs/batch in LangSmith API. Please confirm your LANGCHAIN_ENDPOINT. ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'api.smith.langchain1.com\', port=443): Max retries exceeded with url: /runs/batch (Caused by NameResolutionError(\"<urllib3.connection.HTTPSConnection object at 0x15e247650>: Failed to resolve \'api.smith.langchain1.com\' ([Errno 8] nodename nor servname provided, or not known)\"))'))\nContent-Length: 1118\nAPI Key: lsv2_********************************************c5\npatch: trace=a4b7407d-c1a0-43a2-a952-xxxxx,id=a4b7407d-c1a0-43a2-a952-b9d47xxxx","timestamp":"2024-12-10 11:44:40,077+0000","service":"service_undefined","name":"langsmith.client"} 

Our Findings

  1. This behavior occurs in langsmith==0.1.140
  2. Interestingly, when testing with an incorrect LANGCHAIN_API_KEY, the trace doesn't block the function
  3. Currently, there is a issue for using langsmith==0.2.1 with langchain,here

We plan to upgrade to langsmith==0.2.1 if langchain support, then update new finding

@billpku
Copy link

billpku commented Dec 11, 2024

@hinthornw

We've made progress after updating our dependencies:

  1. Updated langchain-core to version 0.3.24 (adds compatibility with latest LangSmith, detail)
  2. Updated langsmith to version 0.2.1

After testing two scenarios:

  • Unreachable API URL
  • Invalid API key

Results: The traceable decorator no longer blocks function execution in either case.

Highly recommend upgrading to langsmith==0.2.1.
Thank you for your guidance!

@hinthornw
Copy link
Collaborator

Thank you for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants