-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random 500 Error returned by Traefik #1926
Comments
Thanks for your interest in Traefik ! Could you try with the latest version (1.3.5) ? |
@idez I am testing version 1.3.5 in preprod. Will update this issue with my result in a day |
@ldez tested with 1.3.5 and it broke all the websocket connection related to this change #1930. So i can't say for sure if the version 1.3.5 fixed my original reported issue. I tried the patch mentioned in #1930 and the wbsocket issue is fixed. So I am going to apply the patch in rest of my preprod stack and test. Do you have any timelines on when #1930 will be merged and released? |
could you provide more logs? |
@ldez Are you looking for any specific events so i can filter out those events. Below the some of the log events for which we received 500 status code.
|
@karthimohan that's weird 🤔 [web.statistics]
RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
Buckets=[0.1,0.3,1.2,5.0]
[retry] |
@emilevauge Not sure how relevant is to remove those. But I just did the suggested change in my preprod and going to monitor it for a day and update here |
I am seeing the i ran a, performance test with
here are the results:
notice the 99 timeouts these are all if i increase the connections it gets much worse sometimes to the point all requests are failing are there some settings that need to be tweaked for heavier load? not sure if this could be related #1849 |
One reason why this can apparently happen is due to disappearing clients (see also golang/go#20071). Is there a chance that Traefik is running out of capacity or backends cannot keep up with the request rate causing clients to eventually lose patience and prematurely terminate the request to Traefik? |
I got same problem. It's too weird. Most of all backend behind is ok but one of them always return 500 with
I tried with v1.5.0-rc3 and 1.4 stable but same results |
We're experiencing the same issue: requests where the client disappears before a response could be sent are logged as errors. The problem is that it's really easy to reproduce. You can just load a site in your browser that takes 1s+ to load and hit reload multiple times before the page is fully rendered to "reissue" the request. The client now disappeared for the first requests and traefik logs a 500. I think it's debatable whether this should be reported as a status 500 error. IMHO it's not an error. But even if we classify it as an error, I don't think it should be reported as an "Internal Server Error". Our alerting system (based on grafana + prometheus) goes nuts if users are impatient and hit reload multiple times. |
hey, we're experiencing the same issue. any suggestions? |
I suggest to start traefik with the following env var: |
I think that issue lies in vulcand/oxy library. The default error handler does this: func (e *StdHandler) ServeHTTP(w http.ResponseWriter, req *http.Request, err error) {
statusCode := http.StatusInternalServerError
if e, ok := err.(net.Error); ok {
if e.Timeout() {
statusCode = http.StatusGatewayTimeout
} else {
statusCode = http.StatusBadGateway
}
} else if err == io.EOF {
statusCode = http.StatusBadGateway
}
w.WriteHeader(statusCode)
w.Write([]byte(http.StatusText(statusCode)))
log.Debugf("'%d %s' caused by: %v", statusCode, http.StatusText(statusCode), err)
} so it for ContextCancelled error tries to write 500 SC which is then catched by metrics middleware in Traefik and reported as 500. The question is should it be fixed in Traefik by using custom ErrorHandler or in oxy? It drives our Monitoring crazy so I would like to help with PR I just want to know the preference where and how to fix it. |
I want to confirm @mrnugget's findings. I'm able to reproduce this error when running siege and cancelling it midway. An example: echo "GET https://example.com" | vegeta attack -duration=600s -rate=10 | tee results.bin | vegeta report -reporter=json | jq I'm able to reproduce the issue if I CTRL+C midway through the stress test (the same effect as constantly refreshing a page in the browser before the client has received a response). It doesn't seem to affect our users in any way, feels more like a "false alarm" by how Traefik is logging these errors. |
Submitted vulcand/oxy#155 with the fix to oxy, and confirmed that it no longer reports those false alarms using |
Closed by #3777. |
Report a bug
What version of Traefik are you using (traefik version)?
v1.3.0
What did you do?
We are using Traefik as a load balancer using Marathon provider in our Production
Below is our setup
Edge Proxy - Haproxy
Internal proxy Layer - Traefik
We use Haproxy as our edge proxy due to complex rewrites/ redirects and the backends are configured configured with Traefik servers.
Also all our internal service to service communication happens via Traefik.
Here is our traffic flow
API Call --> Haproxy --> Traefik Servers --> Docker containers
Initially when we introduced Traefik in Prod, we just switched the inter service communication that used to happen via internal haproxy load balancers, we did not see any issues.
So we decided to move the Edge Proxy Traffic via Traefik instead of sending directly to app containers. We made this change only to send 10% of our traffic to observe before we switch all traffic.
So out of 3 Edge Haproxy Servers, only one server sends traffic via Traefik and other 2 servers send directly to app containers.
What did you expect to see?
No Change in Behavior
What did you see instead?
After we made this change, we started observing random 500 Errors returned by the Haproxy server sending Traffic via Traefik server. But we dont see that issue on Proxy servers for similar api calls. We dont the see those failed requests reaching the Backend app containers as we dont see any events in the logs during this time
Log Event from Haproxy
Log Event from Traefik
What is your environment & configuration (arguments, toml, provider, platform, ...)?
configuration
We use Traefik Healthcheck and below is the Marathon Labels for one of the service where we see the failure
We are not able to identify a pattern. We were running Traefik serveres in C4.xlarge and changed the Instance Type to c4.2xlarge to check if any network bandwidth/ resource issue. Appreciate if your help in troubleshooting this issue.
The text was updated successfully, but these errors were encountered: