Issue: traceable decorator errors crash application code #1306

nikkogg · 2024-12-09T09:30:21Z

Issue you'd like to raise.

During the recent outage, we have noticed a rather weird behaviour:

we are using traceable decorator in our Python code, like so:

  @traceable(project_name=LANGSMITH_PROJECT_NAME, tags=[ENVIRONMENT_NAME])
  def search(
        self,
        query: str,
        user_id: str,
        community_id: str,
        company_id: str,
        start_date: str,
        end_date: str,
        available_communities: List[str],
    ) -> dict:

and during recent outage we have noticed, that error raised in the decorator, crashes the application code. I believe this should not be the case, as tracing should not impact availability of our services.

In our case error raised was 403 Forbidden returned by LangSmith API.

Thank you

Suggestion:

I think although important tracing is not imperative to the functionality of the application and error raised in tracing decorator should not take out the entire/service/endpoint/feature. Maybe a warning log, or a notification through email/langsmith UI would be a better option than crashing?

The text was updated successfully, but these errors were encountered:

hinthornw · 2024-12-09T15:33:04Z

@nikkogg thanks for raising! 100% agree that observability shouldn't impact your app execution. Could you confirm a few things with me?

What version of langsmith SDK you are using?
do you have any more details on how you're using langsmith in your application?
Are you making any calls directly with the Langsmith client? (like client.create_run or client.create_feedback, etc)

All the API calls in @traceable are done within a try/catch blog and in a background thread, which would still emit warning logs but not crash the main thread.

When I run any app I can create with no internet or with incorrect credentials and get a 403, the app runs as expected but warning logs are emitted, as expected.

Or perhaps @billpku, @viones1995,@jlcstephens, or @ukyen8, it seems from your reactions that you also are running into this? My apologies if so! Do you have more information?

billpku · 2024-12-09T16:44:52Z

@hinthornw Thank you for your quick response. Let me provide details about our setup and solution:
Current Setup

LangSmith SDK Version: langsmith==0.1.140 (maintained for LangChain compatibility)
We use LangSmith SDK to track LangChain agent calls (handling multiple tools)

Initial Implementation

from langsmith import traceable

@traceable(
    project_name=LANGSMITH_PROJECT_NAME,
    tags=[ENVIRONMENT_NAME]
)
def search(self, query: str) -> dict:

Issue Identified
During local debugging, we discovered that when the LangSmith API URL is unreachable:
The connection attempts continue many times, causing service timeouts

Solution Implementation
We created a safer version of the traceable decorator that:

Prevents service blocking
Handles connection errors gracefully
Tests connectivity before tracing

from langsmith import Client, traceable
from langsmith.utils import Retry
from functools import wraps

SEARCH_LANGSMITH_CLIENT = Client(
    retry_config=Retry(total=1),  
)

def safe_traceable(project_name, tags, client):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                # Test LangSmith connectivity
                info = client.request_with_retries("GET", "/info", stop_after_attempt=1)
                logger.info(f"[SearchAgentInvoker] Langsmith info: {info}")

                return traceable(project_name=project_name, tags=tags, client=client)(func)(
                    *args, **kwargs
                )
            except Exception as e:
                logger.warning(
                    f"Langsmith tracing failed, continuing without tracing. Error: {str(e)}",
                    exc_info=True,
                )
                # Fallback to original function
                return func(*args, **kwargs)
        return wrapper
    return decorator

@safe_traceable(
    project_name=LANGSMITH_PROJECT_NAME,
    tags=[ENVIRONMENT_NAME],
    client=SEARCH_LANGSMITH_CLIENT
)
def search(self, query: str) -> dict:

Feel free to share any suggestions to help us ensure our LangSmith logging implementation is robust and efficient, with minimal impact on service performance and latency, thanks

hinthornw · 2024-12-09T18:47:49Z

@billpku what service is timing out in your case?

I wouldn't recommend the proposed wrapper, since that adds an unchached, blocking IO request in the main thread.

Tracing already occurs in a background thread, meaning that while it will finish up ingestion before the program exits, it shouldn't be adding noticeable latency or errors to your program's execution.

I'd also recommend updating your langsmith SDK version since we've made a number of serialization and speed improvements in the intermediating releases.

billpku · 2024-12-10T11:52:34Z

@hinthornw Thank you for your feedback. Let me clarify our specific issue:

Connection Timeout Issue

We encountered blocking behavior when the LangSmith API URL is unreachable. Here's how we reproduced the issue using a mock API URL:

"level":"WARNING","location":"_post_batch_ingest_runs:1440","message":"Failed to batch ingest runs: langsmith.utils.LangSmithConnectionError: Connection error caused failure to POST https://api.smith.langchain1.com/runs/batch in LangSmith API. Please confirm your LANGCHAIN_ENDPOINT. ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'api.smith.langchain1.com\', port=443): Max retries exceeded with url: /runs/batch (Caused by NameResolutionError(\"<urllib3.connection.HTTPSConnection object at 0x15e247650>: Failed to resolve \'api.smith.langchain1.com\' ([Errno 8] nodename nor servname provided, or not known)\"))'))\nContent-Length: 1118\nAPI Key: lsv2_********************************************c5\npatch: trace=a4b7407d-c1a0-43a2-a952-xxxxx,id=a4b7407d-c1a0-43a2-a952-b9d47xxxx","timestamp":"2024-12-10 11:44:40,077+0000","service":"service_undefined","name":"langsmith.client"}

Our Findings

This behavior occurs in langsmith==0.1.140
Interestingly, when testing with an incorrect LANGCHAIN_API_KEY, the trace doesn't block the function
Currently, there is a issue for using langsmith==0.2.1 with langchain,here

We plan to upgrade to langsmith==0.2.1 if langchain support, then update new finding

billpku · 2024-12-11T13:21:46Z

@hinthornw

We've made progress after updating our dependencies:

Updated langchain-core to version 0.3.24 (adds compatibility with latest LangSmith, detail)
Updated langsmith to version 0.2.1

After testing two scenarios:

Unreachable API URL
Invalid API key

Results: The traceable decorator no longer blocks function execution in either case.

Highly recommend upgrading to langsmith==0.2.1.
Thank you for your guidance!

hinthornw · 2024-12-14T17:40:44Z

Thank you for your patience!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: traceable decorator errors crash application code #1306

Issue: traceable decorator errors crash application code #1306

nikkogg commented Dec 9, 2024 •

edited

Loading

hinthornw commented Dec 9, 2024 •

edited

Loading

billpku commented Dec 9, 2024

hinthornw commented Dec 9, 2024

billpku commented Dec 10, 2024 •

edited

Loading

billpku commented Dec 11, 2024

hinthornw commented Dec 14, 2024

Issue: traceable decorator errors crash application code #1306

Issue: traceable decorator errors crash application code #1306

Comments

nikkogg commented Dec 9, 2024 • edited Loading

Issue you'd like to raise.

Suggestion:

hinthornw commented Dec 9, 2024 • edited Loading

billpku commented Dec 9, 2024

hinthornw commented Dec 9, 2024

billpku commented Dec 10, 2024 • edited Loading

Connection Timeout Issue

Our Findings

billpku commented Dec 11, 2024

hinthornw commented Dec 14, 2024

nikkogg commented Dec 9, 2024 •

edited

Loading

hinthornw commented Dec 9, 2024 •

edited

Loading

billpku commented Dec 10, 2024 •

edited

Loading