-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics collector fails to create watcher #2434
Comments
Thanks for creating this issue @gigabyte132! /remove-label lifecycle/needs-triage |
I can't reproduce the result with https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cpu.yaml It turned out to be completed successfully in my environment:
@gigabyte132 For your reference, my setup environment is: Kuberentes version:
Katib controller version:
Katib Python SDK version:
Maybe you can upgrade the version of |
One other thing I was just curious about is if @gigabyte132 saw this error only for the |
@tariq-hasan for me, any type of experiment that uses the |
Unfortunately, package main
import (
"fmt"
"github.com/fsnotify/fsnotify"
)
func main() {
_, err := fsnotify.NewWatcher()
if err != nil {
fmt.Printf("failed to create Watcher %v", err)
}
} to see what error is logged when this is run in the experiment container. Similar as to the metrics container, it failed. the error message was |
What happened?
I tried running the example enas-cpu experiment with a
StdOut
collector and the experiment fails to run due to an error in the metrics-collector containerThere is a related issue to this #1769 . Since then katib has migrated from using the
hpcloud
library for tailing tonxadm
but it seems like I ran into the same exact issue regardless. This is with the following version of the metrics collector image https://hub.docker.com/layers/kubeflowkatib/file-metrics-collector/v1beta1-867c40a/images/sha256-3ab68e0932dd6c2028592dd7a7443ba4970e54f91ab145d6d35828112780eb0a?context=explore as the change tonxadm
wasn't included in the 0.17 release. I have tried both the 0.16 and 0.17 images as well, but the result was the same. I haven't had more time to debug this more in depth (e.g building my own image with extra logs, etc).What did you expect to happen?
The metrics collector should work normally, I have tried using the File metrics collector and things seem to be fine, although I haven't managed to get any Katib Experiment of any kind working with the
StdOut
one.Environment
Kubernetes version:
Katib controller version:
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}" kubeflow/kubeflowkatib/katib-controller:v0.16.0
Katib Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: