Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check storage class #98

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Check storage class #98

wants to merge 2 commits into from

Conversation

colearendt
Copy link
Contributor

@colearendt colearendt commented Dec 1, 2021

Close #90

The motivation: if a user deploys our chart directly onto an EKS cluster setting homeStorage.create=true, they will get a not functioning server, a hard to debug situation (have to see that PVC provisioning failed and why), and a hard to fix situation (have to basically delete everything and start over once you have a storage class that supports ReadWriteMany or have set accessModes properly - both of which are immutable b/c PVCs cannot be changed once created).

It would be much nicer to give the user an error at install time when they go to use this storage class.

Unfortunately, Kubernetes does not "store" whether a given storage class supports a given access mode. This is documented here, but not able to be "looked up" in the Kbuernetes API

As a result, I do a name lookup for ones that we know cause trouble (i.e. gp2), as well as provisioners that we know cause trouble (i.e. kubernetes.io/aws-ebs).

It's possible that these storage classes change over time, and with different renamings / etc., it is impossible that we catch everything. However, I have seen enough users get tripped up by this to think that it would do us well to catch as many as we can.

Thoughts?

Examples:

# success
helm upgrade --install rsw-4 ./charts/rstudio-workbench --set homeStorage.storageClassName=gp2 --set homeStorage.create=true --set 'homeStorage.accessModes[0]'=ReadWriteOnce
Release "rsw-4" has been upgraded. Happy Helming!


# fail b/c specified gp2
helm upgrade --install rsw-4 ./charts/rstudio-workbench --set homeStorage.storageClassName=gp2 --set homeStorage.create=true
Error: UPGRADE FAILED: template: rstudio-workbench/templates/NOTES.txt:44:3: executing "rstudio-workbench/templates/NOTES.txt" at <include "rstudio-library.check-storage-class" (dict "accessModes" .Values.homeStorage.accessModes "storageClassName" .Values.homeStorage.storageClassName "key" "homeStorage" "check" true)>: error calling include: template: rstudio-workbench/templates/_helpers.tpl:439:12: executing "rstudio-library.check-storage-class" at <fail (printf "\n\nThe storage class '%s'%s is known to not support 'ReadWriteMany' PersistentVolumeClaims. This is important for proper disk provisioning. \nFix by setting `%s.accessModes` appropriately, changing `%s.storageClassName` or disabling this check with `%s.checkAccessModes: false`\n\n" $storageClassName $isDefault $key $key $key)>: error calling fail: HELM_ERR_START

The storage class 'gp2' is known to not support 'ReadWriteMany' PersistentVolumeClaims. This is important for proper disk provisioning. 
Fix by setting `homeStorage.accessModes` appropriately, changing `homeStorage.storageClassName` or disabling this check with `homeStorage.checkAccessModes: false`

HELM_ERR_END

# fail b/c default is gp2
✗ helm upgrade --install rsw-4 ./charts/rstudio-workbench  --set homeStorage.create=true
Error: UPGRADE FAILED: template: rstudio-workbench/templates/NOTES.txt:42:3: executing "rstudio-workbench/templates/NOTES.txt" at <include "rstudio-library.check-storage-class" .>: error calling include: template: rstudio-workbench/templates/_helpers.tpl:425:10: executing "rstudio-library.check-storage-class" at <fail (printf "\n\nThe storage class '%s'%s is known to not support ReadWriteMany PersistentVolumeClaims. \nFix by setting `accessModes` appropriately, changing `storageClassName` or disabling this check with `checkAccessModes: false\n\n`" $storageClassName $isDefault)>: error calling fail: HELM_ERR_START

The storage class 'gp2' (the default on this cluster) is known to not support 'ReadWriteMany' PersistentVolumeClaims. 
Fix by setting `homeStorage.accessModes` appropriately, changing `homeStorage.storageClassName` or disabling this check with `homeStorage.checkAccessModes: false`

HELM_ERR_END


EDIT: Plan is to refactor into the rstudio-library chart, although only rolling out for rstudio-workbench for now, so that we can test a bit before integrating with Connect and RSPM.

(i.e. errors if `ReadWriteMany` is asked for, but using a known bad class like `gp2`)
@colearendt colearendt requested a review from jonyoder December 1, 2021 19:53
@atheriel
Copy link
Contributor

atheriel commented Dec 1, 2021

It's hard for me to give a credible line-by-line review of the YAML changes, but I'll make the following comments:

  • A static list of bad storage classes is not very elegant but it's fairly maintainable, in my view. Kubernetes providers don't introduce/change storage classes that often, and we could remove this workaround in favour of a more robust solution in the future without affecting users.

  • Consider adding AzureDisk here, it's the default storage class on Azure Kubernetes Service but does not support RWX -- a fact that bit me personally when I used to run AKS clusters.

  • Edit: We could also review what other Helm charts are doing about this problem, possibly with a few careful GitHub searches.

As an aside: this is a really painful example of Helm's limitations. It's not really a programming language, so when you want do stuff like this it ends up being really opaque.

Copy link
Contributor

@jonyoder jonyoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew! I can tell I've been away from Helm templates for a while. This looks good to me, although I really have very little knowledge in this arena.

@atheriel
Copy link
Contributor

atheriel commented Dec 1, 2021

As a follow up: I looked through the (deprecated, but still useful) helm/stable repo to see if I could find patterns for solving this. What I found is that many prominent charts -- wordpress, airflow, minio, sentry -- have exactly this requirement/issue but seem to solve it by having a large warning in the README telling users to make sure they use a RWX provisioner.

This makes me think there isn't an obvious, accepted solution in the Helm community for this problem, so that's even more of an argument for this hack.

@colearendt
Copy link
Contributor Author

colearendt commented Dec 1, 2021

Thanks for the awesome feedback, y'all! @atheriel do you know off-hand what the exact / default metadata.name of an AzureDisk is? Or what the provisioner is? I wanted to do AzureDisk too but wasn't sure what it was called and didn't want to spin up an AKS cluster to find out 🙈 (i.e. AWSElasticBlockStore is called gp2 😑 )

Also TOTALLY agree about helm's limitations 😢 It wants to be a programming language. I feel like I always just want a map function... and end up with a weird crazy range instead 🤣

EDIT: also - nice find on the helm chart examples!! We have such a warning and have noticed that nobody reads it 🤣

@colearendt colearendt marked this pull request as draft December 1, 2021 21:03
@atheriel
Copy link
Contributor

atheriel commented Dec 2, 2021

From the linked docs in my comment it appears the provisioner is kubernetes.io/azure-disk, and from my recollection the metadata.name is azuredisk. But I'm hazy on both. Microsoft has really good AKS docs, I'm sure we can find a tutorial with the details instead of spinning up a cluster.

@colearendt colearendt mentioned this pull request Dec 7, 2021
@d-m
Copy link
Contributor

d-m commented Dec 18, 2021

Could this check be done with a pre-install, pre-upgrade Helm hook that uses kubectl to attempt to create a test pvc? The prometheus-operator project gets creative with a helm hook that uses kubectl to clean up CRDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is there a way to catch storage class access modes at deployment time?
4 participants