Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

IMDSv2 Requests Causing Context Cancelled #398

Open
dmizelle opened this issue Apr 29, 2020 · 26 comments
Open

IMDSv2 Requests Causing Context Cancelled #398

dmizelle opened this issue Apr 29, 2020 · 26 comments

Comments

@dmizelle
Copy link

dmizelle commented Apr 29, 2020

When a downstream deployment that uses a recent version of aws-java-sdk attempts to request credentials, it ends up using IMDSv2 (a PUT request to /latest/api/token) but it seems that the kiam-agent reports back a 502 to the client.

Eventually, the client falls back to requesting credentials in the old way, but it adds about a second of additional latency for our applications.

Looking through the logs on the kiam-agent side, I see the following:

2020/04/29 14:58:34 http: proxy error: context canceled
{"addr":"10.99.99.99:46952","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-04-29T14:58:34Z"}

From the kiam-server side, I do see that it ends up logging that it has credentials:

{"level":"info","msg":"found role","pod.iam.role":"my-role-name-here","pod.ip":"10.99.99.99","time":"2020-04-29T14:58:35Z"}

We see this in our test environment as well, so I'm going to throw KIAM into debug log level and enable GRPC logging.

@Joseph-Irving
Copy link
Contributor

What version of Kiam are you using? There was support added for IMDSv2 in 3.6-rc1

@dmizelle
Copy link
Author

dmizelle commented Apr 29, 2020

👋 We are currently running 3.6-rc1. Sorry, forgot to include that 🙄 , its been a hell of a week 💤 😴

EDIT: Something else I noticed is that only a fraction of these requests are actually showing a 502 back to the client, most complete with 200 and no message about "context canceled"

@Joseph-Irving
Copy link
Contributor

So the kiam agent just proxies any requests to the /api/token to the actual instance metadata api, it doesn't forward them to the kiam server.
So this error indicates that the Kiam agent went to the metadata api and received a bad response. I'm not entirely sure how to dig deeper into this, does the metadata api itself have any metrics we can query to see if it did respond with errors.

cc @rbvigilante

@rbvigilante
Copy link
Contributor

This is interesting. We've been running 3.6rc1 as well; I tested PUT /latest/api/token and had it work, but I'm not sure whether any of our teams have started using IMDSv2 in the meantime (so I haven't observed it running IMDSv2 with a "production" volume of requests).

As I understand it, Kiam is proxying more or less 100% transparently now (not even setting X-Forwarded-For). I'm not sure what would cause the context to be cancelled - this issue suggests the only reason the ReverseProxy should be doing that is if the underlying connection to the client goes away.

@project0
Copy link

just facing the same issues with last official 3.6 release:

2020/07/29 10:34:02 http: proxy error: context canceled
{"addr":"10.112.48.16:51780","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:02Z"}
2020/07/29 10:34:02 http: proxy error: context canceled
{"addr":"10.112.51.213:50932","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:02Z"}
2020/07/29 10:34:03 http: proxy error: context canceled
{"addr":"10.112.44.69:52234","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:03Z"}
2020/07/29 10:34:03 http: proxy error: context canceled
{"addr":"10.112.51.213:51920","duration":1000,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:03Z"}
2020/07/29 10:34:04 http: proxy error: context canceled

@rbvigilante
Copy link
Contributor

We're running the official 3.6 release too now; I'll see if I can reproduce this as soon as I get a chance (probably on Friday)

@kwizzn
Copy link

kwizzn commented Jul 30, 2020

The duration in @project0's logs is consistently ~1s. Could this be some timeout issue, @rbvigilante?

@Joseph-Irving
Copy link
Contributor

The proxy just uses the default transport in go
https://golang.org/pkg/net/http/#RoundTripper

var DefaultTransport RoundTripper = &Transport{
    Proxy: ProxyFromEnvironment,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }).DialContext,
    ForceAttemptHTTP2:     true,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
}

The only default timeout that's 1 second is the ExpectContinueTimeout

   // ExpectContinueTimeout, if non-zero, specifies the amount of
    // time to wait for a server's first response headers after fully
    // writing the request headers if the request has an
    // "Expect: 100-continue" header. Zero means no timeout and
    // causes the body to be sent immediately, without
    // waiting for the server to approve.
    // This time does not include the time to send the request header.

so might be worth investigating that, perhaps the AWS token api is a bit slow to respond?

@project0
Copy link

project0 commented Jul 30, 2020

There seems to be similar problems within the SDKs, aws/aws-sdk-go#2972.
There is an interesting link to an AWS blog entry about the new IMDSv2.

With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.

I don't think its the issue here (as we proxy through a ssl protected GRPC connection), but its good to know ;-)

@rbvigilante
Copy link
Contributor

I'm still definitely able to get tokens out of the metadata API through Kiam 3.6. I'm still not sure if we have anything requesting IMDSv2 with any great volume internally; I'll try to find out.

@project0 or @dmizelle - would it be possible for one of you to try running something like curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" from within a pod?

@project0
Copy link

project0 commented Aug 3, 2020

@rbvigilante I just tested it within a container, but i was not able to reproduce the 502 which appears randomly in the logs.
while sleep 0.1; do curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" -o /dev/null -w "%{http_code} - %{time_total}s" -s ; echo; done.
Requests are usually super fast, in some rare cases i got >200ms or 1s.

Just to be clear, it seems to work in generic. I just cannot find the correlation why we get those timeouts in the logs. Looks like its a really rare case that it fails.

@rbvigilante
Copy link
Contributor

Right - so the problem as you see it is intermittent. I wonder whether this is being caused by Kiam, or by the metadata API itself (or some combination of the metadata API being occasionally slow to respond and Kiam's proxy cutting the request off in that scenario).

It seems weird that the metadata API could get that slow, though. Hmm.

@ekristen
Copy link

IMDSv2 introduces some breaking change security features to include max hop count which can be modifying on the instance metadata options. By default it is 1 now, so if Kiam is running in a container, IMDSv2 calls will FAIL but seem to fallback to IMDSv1 after a long period of time, but otherwise it seems to just hang, so this could be the reason the context is cancelled.

@rbvigilante
Copy link
Contributor

Yeah, I thought the max hops thing might be a problem, but it seems like IMDSv2 works through Kiam most of the time, which seems to preclude that being the issue we're seeing here. I can confirm that I'm able to get a token from IMDSv2, then use that token to pull information out of the metadata service.

@gillg
Copy link

gillg commented Aug 27, 2020

IMDSv2 introduces some breaking change security features to include max hop count which can be modifying on the instance metadata options. By default it is 1 now, so if Kiam is running in a container, IMDSv2 calls will FAIL but seem to fallback to IMDSv1 after a long period of time, but otherwise it seems to just hang, so this could be the reason the context is cancelled.

Say me if I'm wrong, but if server runs as "hostNetwork" it should have only one hop ? So it should pass the TTL of 1 ?
Then it stores token and fetch credentials by the same way of IMDSv1 (+ token in headers).

@ekristen
Copy link

That should work.

@rbvigilante
Copy link
Contributor

Can confirm our Kiam servers run on the host network.

@gillg
Copy link

gillg commented Aug 28, 2020

I made a simple test and I confirm IMDSv2 works with hostNetwork :

kubectl run --rm test -n kube-system --image=python --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"hostNetwork": true}}' -ti -- bash

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
> && curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    56  100    56    0     0   6222      0 --:--:-- --:--:-- --:--:--  6222
* Expire in 0 ms for 6 (transfer 0x55daef95af50)
*   Trying 169.254.169.254...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55daef95af50)
* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/ HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.64.0
> Accept: */*
> X-aws-ec2-metadata-token: AQAEAOEtzQG3wYxxxxxxxxxxxxxxxxxxxMuD-RNftZ6Hpt1NSFlg==
> 
< HTTP/1.1 200 OK
< X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600
< Content-Type: text/plain
< Accept-Ranges: none
< Last-Modified: Fri, 28 Aug 2020 07:45:14 GMT
< Content-Length: 290
< Date: Fri, 28 Aug 2020 07:51:19 GMT
< Server: EC2ws
< Connection: close
< 
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
events/
hostname
iam/
identity-credentials/
instance-action
instance-id
instance-life-cycle
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-keys/
reservation-id
security-groups

In a non "hostNetwork" pod, curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" works too even if kiam-agent capture traffic. But I'm blocked by non whitelisted URLs.

So, I not see any problems... ?
Could you make my tests ?

@gillg
Copy link

gillg commented Oct 9, 2020

Hi,

I tried some things yesteday for improve security on my nodes.
I forced IMDSv2 only and TTL hop of 1.
My previous test is working, so a container with hostnetwork can fetch metadata and token.
But... When kiam server starts it facing an issue :
{"level":"fatal","msg":"error creating listener: error detecting arn prefix: aws metadata api not available","time":"2020-10-08T20:48:33Z"}

It seems related to native AWS SDK "ec2metadata" usage here : https://github.com/uswitch/kiam/blob/master/pkg/aws/sts/resolver_detect_arn.go#L46
So I don't really undertand what is the problem.
I will try to tcpdump it today, but if you have any clues... I take it ! :)

@gillg
Copy link

gillg commented Oct 13, 2020

After some tests it's stange... Change max hops limit not works better !
But pass HttpTokens to optionnal makes server fonctionnal.
It's like "ec2metadata" not implements ITDSv2.

2020-10-13T13:41:33.222549738Z {"level":"fatal","msg":"error creating listener: error detecting arn prefix: aws metadata api not available","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.221316411Z {"level":"info","msg":"starting server","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.221345825Z {"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.221639121Z {"level":"info","msg":"detecting arn prefix","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.222549738Z {"level":"fatal","msg":"error creating listener: error detecting arn prefix: aws metadata api not available","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:43:01.423584172Z {"level":"info","msg":"starting server","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.423628080Z {"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-10-13T13:43:01Z"}
##### HttpTokens become optional here
2020-10-13T13:43:01.423665965Z {"level":"info","msg":"detecting arn prefix","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.431974673Z {"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxx:role/","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432681058Z {"level":"info","msg":"will serve on 0.0.0.0:443","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432695151Z {"level":"info","msg":"starting credential manager process 0","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432700325Z {"level":"info","msg":"starting credential manager process 1","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432713849Z {"level":"info","msg":"starting credential manager process 2","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432717605Z {"level":"info","msg":"starting credential manager process 3","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432769774Z {"level":"info","msg":"starting credential manager process 4","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432775919Z {"level":"info","msg":"starting credential manager process 5","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432779938Z {"level":"info","msg":"starting credential manager process 6","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432783482Z {"level":"info","msg":"starting credential manager process 7","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432787334Z {"level":"info","msg":"started cache controller","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.533074250Z {"level":"info","msg":"started namespace cache controller","time":"2020-10-13T13:43:01Z"}

@rehevkor5
Copy link

rehevkor5 commented Jan 23, 2021

I'm also seeing 502 status responses for /latest/api/token when using Kiam 3.6, but the only reason I noticed it was because my app is experiencing high latency approximately every 55 minutes. This lasts for 2-5 minutes and then resolves again. Since my Kiam session duration is set to 60 minutes, the credential refresh happens about every 55 minutes. And the log messages in Kiam coincide with the latency problem as well. So, Kiam seems to be the cause of this problem. My kiam-server is running with hostNetwork: true so is it true that the hop limit is not relevant?

So the kiam agent just proxies any requests to the /api/token to the actual instance metadata api, it doesn't forward them to the kiam server.

Why is that? I thought, based on the description of the design, that Kiam is supposed to handle all the metadata service requests so that it can provide fast cached responses?

Does anyone have ideas of things I can try, to resolve the high latency issue? Or can I provide you with any additional info?

Edit: Manually setting the hop limit on the EKS nodes to 2 didn't make any difference.

@rbvigilante
Copy link
Contributor

I'm also seeing 502 status responses for /latest/api/token when using Kiam 3.6, but the only reason I noticed it was because my app is experiencing high latency approximately every 55 minutes. This lasts for 2-5 minutes and then resolves again. Since my Kiam session duration is set to 60 minutes, the credential refresh happens about every 55 minutes.

We noticed a similar-sounding problem with a Java-based application run by one of our teams, where traffic to Kiam spiked really hard for a fifteen-minute period before the application's token expired, causing everything to slow down. This happened because the AWS SDK for Java is coded to request a new token fifteen minutes before the old one expires, but Kiam only invalidated its internal cache five minutes before. We fixed this by adding --session-refresh=15m to our Kiam command-line.

@rehevkor5
Copy link

@rbvigilante I guess I'm unclear why this would suddenly begin to be a problem in Kiam 3.6, whereas 3.5 was working fine. Has Kiam's behavior for session refresh changed in 3.6?

I guess the behavior of the AWS SDK in that situation is to request new credentials from Kiam, Kiam gives back the cached credentials, but the SDK sees that the "new" credentials expire in <15m and therefore request new credentials again? And this continues until Kiam generates new credentials? I'll try adjusting my --session-refresh and see if that helps. Thanks for the suggestion.

@kartik-moolya
Copy link

So has setting the refresh interval fixed your issue @rehevkor5 ? or any other workarounds ?

@rehevkor5
Copy link

Yes, it did.

@kartik-moolya
Copy link

Could you help me please, I see the same error which says context canceled
Im using the kiam helm chart (version: kiam-5.9.0 )and below are the configurtions

kiam-server:

      /kiam
      server
    Args:
      --json-log
      --level=debug
      --bind=0.0.0.0:443
      --cert=/etc/kiam/tls/cert
      --key=/etc/kiam/tls/key
      --ca=/etc/kiam/tls/ca
      --role-base-arn-autodetect
      --session-duration=15m
      --sync=1m
      --prometheus-listen-addr=0.0.0.0:9620
      --prometheus-sync-interval=5s
      --session-refresh=5m

kiam-agent:

    Command:
      /kiam
      agent
    Args:
      --iptables
      --no-iptables-remove
      --host-interface=!eth0
      --json-log
      --level=debug
      --port=8181
      --cert=/etc/kiam/tls/cert
      --key=/etc/kiam/tls/key
      --ca=/etc/kiam/tls/ca
      --server-address=kiam-server:443
      --prometheus-listen-addr=0.0.0.0:9620
      --prometheus-sync-interval=5s
      --whitelist-route-regexp=^/latest/.*
      --gateway-timeout-creation=1s

I get internmittent errors which says context canceled on the agent logs
and the java application with aws sdk 2.9 says software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from service endpoint

I tried setting the session-refresh to 15mins but then the application threw credential expired error after which I reduced it to 5mins

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants