IMDSv2 Requests Causing Context Cancelled #398

dmizelle · 2020-04-29T15:24:48Z

When a downstream deployment that uses a recent version of aws-java-sdk attempts to request credentials, it ends up using IMDSv2 (a PUT request to /latest/api/token) but it seems that the kiam-agent reports back a 502 to the client.

Eventually, the client falls back to requesting credentials in the old way, but it adds about a second of additional latency for our applications.

Looking through the logs on the kiam-agent side, I see the following:

2020/04/29 14:58:34 http: proxy error: context canceled
{"addr":"10.99.99.99:46952","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-04-29T14:58:34Z"}

From the kiam-server side, I do see that it ends up logging that it has credentials:

{"level":"info","msg":"found role","pod.iam.role":"my-role-name-here","pod.ip":"10.99.99.99","time":"2020-04-29T14:58:35Z"}

We see this in our test environment as well, so I'm going to throw KIAM into debug log level and enable GRPC logging.

The text was updated successfully, but these errors were encountered:

Joseph-Irving · 2020-04-29T15:36:13Z

What version of Kiam are you using? There was support added for IMDSv2 in 3.6-rc1

dmizelle · 2020-04-29T16:48:34Z

👋 We are currently running 3.6-rc1. Sorry, forgot to include that 🙄 , its been a hell of a week 💤 😴

EDIT: Something else I noticed is that only a fraction of these requests are actually showing a 502 back to the client, most complete with 200 and no message about "context canceled"

Joseph-Irving · 2020-04-30T13:54:45Z

So the kiam agent just proxies any requests to the /api/token to the actual instance metadata api, it doesn't forward them to the kiam server.
So this error indicates that the Kiam agent went to the metadata api and received a bad response. I'm not entirely sure how to dig deeper into this, does the metadata api itself have any metrics we can query to see if it did respond with errors.

cc @rbvigilante

rbvigilante · 2020-04-30T14:06:01Z

This is interesting. We've been running 3.6rc1 as well; I tested PUT /latest/api/token and had it work, but I'm not sure whether any of our teams have started using IMDSv2 in the meantime (so I haven't observed it running IMDSv2 with a "production" volume of requests).

As I understand it, Kiam is proxying more or less 100% transparently now (not even setting X-Forwarded-For). I'm not sure what would cause the context to be cancelled - this issue suggests the only reason the ReverseProxy should be doing that is if the underlying connection to the client goes away.

project0 · 2020-07-29T11:38:01Z

just facing the same issues with last official 3.6 release:

2020/07/29 10:34:02 http: proxy error: context canceled
{"addr":"10.112.48.16:51780","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:02Z"}
2020/07/29 10:34:02 http: proxy error: context canceled
{"addr":"10.112.51.213:50932","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:02Z"}
2020/07/29 10:34:03 http: proxy error: context canceled
{"addr":"10.112.44.69:52234","duration":1001,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:03Z"}
2020/07/29 10:34:03 http: proxy error: context canceled
{"addr":"10.112.51.213:51920","duration":1000,"headers":{},"level":"info","method":"PUT","msg":"processed request","path":"/latest/api/token","status":502,"time":"2020-07-29T10:34:03Z"}
2020/07/29 10:34:04 http: proxy error: context canceled

rbvigilante · 2020-07-29T12:37:02Z

We're running the official 3.6 release too now; I'll see if I can reproduce this as soon as I get a chance (probably on Friday)

kwizzn · 2020-07-30T10:16:03Z

The duration in @project0's logs is consistently ~1s. Could this be some timeout issue, @rbvigilante?

Joseph-Irving · 2020-07-30T10:53:44Z

The proxy just uses the default transport in go
https://golang.org/pkg/net/http/#RoundTripper

var DefaultTransport RoundTripper = &Transport{
    Proxy: ProxyFromEnvironment,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }).DialContext,
    ForceAttemptHTTP2:     true,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
}

The only default timeout that's 1 second is the ExpectContinueTimeout

   // ExpectContinueTimeout, if non-zero, specifies the amount of
    // time to wait for a server's first response headers after fully
    // writing the request headers if the request has an
    // "Expect: 100-continue" header. Zero means no timeout and
    // causes the body to be sent immediately, without
    // waiting for the server to approve.
    // This time does not include the time to send the request header.

so might be worth investigating that, perhaps the AWS token api is a bit slow to respond?

project0 · 2020-07-30T11:45:37Z

There seems to be similar problems within the SDKs, aws/aws-sdk-go#2972.
There is an interesting link to an AWS blog entry about the new IMDSv2.

With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.

I don't think its the issue here (as we proxy through a ssl protected GRPC connection), but its good to know ;-)

rbvigilante · 2020-08-03T07:59:07Z

I'm still definitely able to get tokens out of the metadata API through Kiam 3.6. I'm still not sure if we have anything requesting IMDSv2 with any great volume internally; I'll try to find out.

@project0 or @dmizelle - would it be possible for one of you to try running something like curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" from within a pod?

project0 · 2020-08-03T08:23:10Z

@rbvigilante I just tested it within a container, but i was not able to reproduce the 502 which appears randomly in the logs.
while sleep 0.1; do curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" -o /dev/null -w "%{http_code} - %{time_total}s" -s ; echo; done.
Requests are usually super fast, in some rare cases i got >200ms or 1s.

Just to be clear, it seems to work in generic. I just cannot find the correlation why we get those timeouts in the logs. Looks like its a really rare case that it fails.

rbvigilante · 2020-08-03T09:11:25Z

Right - so the problem as you see it is intermittent. I wonder whether this is being caused by Kiam, or by the metadata API itself (or some combination of the metadata API being occasionally slow to respond and Kiam's proxy cutting the request off in that scenario).

It seems weird that the metadata API could get that slow, though. Hmm.

ekristen · 2020-08-16T01:47:28Z

IMDSv2 introduces some breaking change security features to include max hop count which can be modifying on the instance metadata options. By default it is 1 now, so if Kiam is running in a container, IMDSv2 calls will FAIL but seem to fallback to IMDSv1 after a long period of time, but otherwise it seems to just hang, so this could be the reason the context is cancelled.

rbvigilante · 2020-08-17T03:39:08Z

Yeah, I thought the max hops thing might be a problem, but it seems like IMDSv2 works through Kiam most of the time, which seems to preclude that being the issue we're seeing here. I can confirm that I'm able to get a token from IMDSv2, then use that token to pull information out of the metadata service.

gillg · 2020-08-27T08:27:06Z

IMDSv2 introduces some breaking change security features to include max hop count which can be modifying on the instance metadata options. By default it is 1 now, so if Kiam is running in a container, IMDSv2 calls will FAIL but seem to fallback to IMDSv1 after a long period of time, but otherwise it seems to just hang, so this could be the reason the context is cancelled.

Say me if I'm wrong, but if server runs as "hostNetwork" it should have only one hop ? So it should pass the TTL of 1 ?
Then it stores token and fetch credentials by the same way of IMDSv1 (+ token in headers).

ekristen · 2020-08-27T23:44:46Z

That should work.

rbvigilante · 2020-08-28T03:48:13Z

Can confirm our Kiam servers run on the host network.

gillg · 2020-08-28T07:59:45Z

I made a simple test and I confirm IMDSv2 works with hostNetwork :

kubectl run --rm test -n kube-system --image=python --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"hostNetwork": true}}' -ti -- bash

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
> && curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    56  100    56    0     0   6222      0 --:--:-- --:--:-- --:--:--  6222
* Expire in 0 ms for 6 (transfer 0x55daef95af50)
*   Trying 169.254.169.254...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55daef95af50)
* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/ HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.64.0
> Accept: */*
> X-aws-ec2-metadata-token: AQAEAOEtzQG3wYxxxxxxxxxxxxxxxxxxxMuD-RNftZ6Hpt1NSFlg==
> 
< HTTP/1.1 200 OK
< X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600
< Content-Type: text/plain
< Accept-Ranges: none
< Last-Modified: Fri, 28 Aug 2020 07:45:14 GMT
< Content-Length: 290
< Date: Fri, 28 Aug 2020 07:51:19 GMT
< Server: EC2ws
< Connection: close
< 
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
events/
hostname
iam/
identity-credentials/
instance-action
instance-id
instance-life-cycle
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-keys/
reservation-id
security-groups

In a non "hostNetwork" pod, curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" works too even if kiam-agent capture traffic. But I'm blocked by non whitelisted URLs.

So, I not see any problems... ?
Could you make my tests ?

gillg · 2020-10-09T07:29:32Z

Hi,

I tried some things yesteday for improve security on my nodes.
I forced IMDSv2 only and TTL hop of 1.
My previous test is working, so a container with hostnetwork can fetch metadata and token.
But... When kiam server starts it facing an issue :
{"level":"fatal","msg":"error creating listener: error detecting arn prefix: aws metadata api not available","time":"2020-10-08T20:48:33Z"}

It seems related to native AWS SDK "ec2metadata" usage here : https://github.com/uswitch/kiam/blob/master/pkg/aws/sts/resolver_detect_arn.go#L46
So I don't really undertand what is the problem.
I will try to tcpdump it today, but if you have any clues... I take it ! :)

gillg · 2020-10-13T13:47:44Z

After some tests it's stange... Change max hops limit not works better !
But pass HttpTokens to optionnal makes server fonctionnal.
It's like "ec2metadata" not implements ITDSv2.

2020-10-13T13:41:33.222549738Z {"level":"fatal","msg":"error creating listener: error detecting arn prefix: aws metadata api not available","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.221316411Z {"level":"info","msg":"starting server","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.221345825Z {"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.221639121Z {"level":"info","msg":"detecting arn prefix","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:41:33.222549738Z {"level":"fatal","msg":"error creating listener: error detecting arn prefix: aws metadata api not available","time":"2020-10-13T13:41:33Z"}
2020-10-13T13:43:01.423584172Z {"level":"info","msg":"starting server","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.423628080Z {"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-10-13T13:43:01Z"}
##### HttpTokens become optional here
2020-10-13T13:43:01.423665965Z {"level":"info","msg":"detecting arn prefix","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.431974673Z {"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxx:role/","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432681058Z {"level":"info","msg":"will serve on 0.0.0.0:443","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432695151Z {"level":"info","msg":"starting credential manager process 0","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432700325Z {"level":"info","msg":"starting credential manager process 1","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432713849Z {"level":"info","msg":"starting credential manager process 2","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432717605Z {"level":"info","msg":"starting credential manager process 3","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432769774Z {"level":"info","msg":"starting credential manager process 4","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432775919Z {"level":"info","msg":"starting credential manager process 5","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432779938Z {"level":"info","msg":"starting credential manager process 6","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432783482Z {"level":"info","msg":"starting credential manager process 7","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.432787334Z {"level":"info","msg":"started cache controller","time":"2020-10-13T13:43:01Z"}
2020-10-13T13:43:01.533074250Z {"level":"info","msg":"started namespace cache controller","time":"2020-10-13T13:43:01Z"}

rehevkor5 · 2021-01-23T00:57:25Z

I'm also seeing 502 status responses for /latest/api/token when using Kiam 3.6, but the only reason I noticed it was because my app is experiencing high latency approximately every 55 minutes. This lasts for 2-5 minutes and then resolves again. Since my Kiam session duration is set to 60 minutes, the credential refresh happens about every 55 minutes. And the log messages in Kiam coincide with the latency problem as well. So, Kiam seems to be the cause of this problem. My kiam-server is running with hostNetwork: true so is it true that the hop limit is not relevant?

So the kiam agent just proxies any requests to the /api/token to the actual instance metadata api, it doesn't forward them to the kiam server.

Why is that? I thought, based on the description of the design, that Kiam is supposed to handle all the metadata service requests so that it can provide fast cached responses?

Does anyone have ideas of things I can try, to resolve the high latency issue? Or can I provide you with any additional info?

Edit: Manually setting the hop limit on the EKS nodes to 2 didn't make any difference.

rbvigilante · 2021-01-25T05:34:51Z

I'm also seeing 502 status responses for /latest/api/token when using Kiam 3.6, but the only reason I noticed it was because my app is experiencing high latency approximately every 55 minutes. This lasts for 2-5 minutes and then resolves again. Since my Kiam session duration is set to 60 minutes, the credential refresh happens about every 55 minutes.

We noticed a similar-sounding problem with a Java-based application run by one of our teams, where traffic to Kiam spiked really hard for a fifteen-minute period before the application's token expired, causing everything to slow down. This happened because the AWS SDK for Java is coded to request a new token fifteen minutes before the old one expires, but Kiam only invalidated its internal cache five minutes before. We fixed this by adding --session-refresh=15m to our Kiam command-line.

rehevkor5 · 2021-01-25T17:31:20Z

@rbvigilante I guess I'm unclear why this would suddenly begin to be a problem in Kiam 3.6, whereas 3.5 was working fine. Has Kiam's behavior for session refresh changed in 3.6?

I guess the behavior of the AWS SDK in that situation is to request new credentials from Kiam, Kiam gives back the cached credentials, but the SDK sees that the "new" credentials expire in <15m and therefore request new credentials again? And this continues until Kiam generates new credentials? I'll try adjusting my --session-refresh and see if that helps. Thanks for the suggestion.

kartik-moolya · 2021-02-22T12:49:44Z

So has setting the refresh interval fixed your issue @rehevkor5 ? or any other workarounds ?

rehevkor5 · 2021-02-22T16:03:36Z

Yes, it did.

kartik-moolya · 2021-02-23T06:01:57Z

Could you help me please, I see the same error which says context canceled
Im using the kiam helm chart (version: kiam-5.9.0 )and below are the configurtions

kiam-server:

      /kiam
      server
    Args:
      --json-log
      --level=debug
      --bind=0.0.0.0:443
      --cert=/etc/kiam/tls/cert
      --key=/etc/kiam/tls/key
      --ca=/etc/kiam/tls/ca
      --role-base-arn-autodetect
      --session-duration=15m
      --sync=1m
      --prometheus-listen-addr=0.0.0.0:9620
      --prometheus-sync-interval=5s
      --session-refresh=5m

kiam-agent:

    Command:
      /kiam
      agent
    Args:
      --iptables
      --no-iptables-remove
      --host-interface=!eth0
      --json-log
      --level=debug
      --port=8181
      --cert=/etc/kiam/tls/cert
      --key=/etc/kiam/tls/key
      --ca=/etc/kiam/tls/ca
      --server-address=kiam-server:443
      --prometheus-listen-addr=0.0.0.0:9620
      --prometheus-sync-interval=5s
      --whitelist-route-regexp=^/latest/.*
      --gateway-timeout-creation=1s

I get internmittent errors which says context canceled on the agent logs
and the java application with aws sdk 2.9 says software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from service endpoint

I tried setting the session-refresh to 15mins but then the application threw credential expired error after which I reduced it to 5mins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMDSv2 Requests Causing Context Cancelled #398

IMDSv2 Requests Causing Context Cancelled #398

dmizelle commented Apr 29, 2020 •

edited

Loading

Joseph-Irving commented Apr 29, 2020

dmizelle commented Apr 29, 2020 •

edited

Loading

Joseph-Irving commented Apr 30, 2020

rbvigilante commented Apr 30, 2020

project0 commented Jul 29, 2020

rbvigilante commented Jul 29, 2020

kwizzn commented Jul 30, 2020

Joseph-Irving commented Jul 30, 2020

project0 commented Jul 30, 2020 •

edited

Loading

rbvigilante commented Aug 3, 2020

project0 commented Aug 3, 2020 •

edited

Loading

rbvigilante commented Aug 3, 2020

ekristen commented Aug 16, 2020

rbvigilante commented Aug 17, 2020

gillg commented Aug 27, 2020

ekristen commented Aug 27, 2020

rbvigilante commented Aug 28, 2020

gillg commented Aug 28, 2020

gillg commented Oct 9, 2020 •

edited

Loading

gillg commented Oct 13, 2020

rehevkor5 commented Jan 23, 2021 •

edited

Loading

rbvigilante commented Jan 25, 2021

rehevkor5 commented Jan 25, 2021

kartik-moolya commented Feb 22, 2021

rehevkor5 commented Feb 22, 2021

kartik-moolya commented Feb 23, 2021

IMDSv2 Requests Causing Context Cancelled #398

IMDSv2 Requests Causing Context Cancelled #398

Comments

dmizelle commented Apr 29, 2020 • edited Loading

Joseph-Irving commented Apr 29, 2020

dmizelle commented Apr 29, 2020 • edited Loading

Joseph-Irving commented Apr 30, 2020

rbvigilante commented Apr 30, 2020

project0 commented Jul 29, 2020

rbvigilante commented Jul 29, 2020

kwizzn commented Jul 30, 2020

Joseph-Irving commented Jul 30, 2020

project0 commented Jul 30, 2020 • edited Loading

rbvigilante commented Aug 3, 2020

project0 commented Aug 3, 2020 • edited Loading

rbvigilante commented Aug 3, 2020

ekristen commented Aug 16, 2020

rbvigilante commented Aug 17, 2020

gillg commented Aug 27, 2020

ekristen commented Aug 27, 2020

rbvigilante commented Aug 28, 2020

gillg commented Aug 28, 2020

gillg commented Oct 9, 2020 • edited Loading

gillg commented Oct 13, 2020

rehevkor5 commented Jan 23, 2021 • edited Loading

rbvigilante commented Jan 25, 2021

rehevkor5 commented Jan 25, 2021

kartik-moolya commented Feb 22, 2021

rehevkor5 commented Feb 22, 2021

kartik-moolya commented Feb 23, 2021

dmizelle commented Apr 29, 2020 •

edited

Loading

dmizelle commented Apr 29, 2020 •

edited

Loading

project0 commented Jul 30, 2020 •

edited

Loading

project0 commented Aug 3, 2020 •

edited

Loading

gillg commented Oct 9, 2020 •

edited

Loading

rehevkor5 commented Jan 23, 2021 •

edited

Loading