Metrics collector fails to create watcher #2434

gigabyte132 · 2024-09-25T10:00:46Z

What happened?

I tried running the example enas-cpu experiment with a StdOut collector and the experiment fails to run due to an error in the metrics-collector container

2024/09/24 13:19:08 FATAL -- failed to create Watcher
goroutine 18 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:26 +0x5e
github.com/nxadm/tail/util.Fatal({0xe14a9b?, 0xc000282000?}, {0x0, 0x0, 0x0})
        /go/pkg/mod/github.com/nxadm/[email protected]/util/util.go:23 +0x8b
github.com/nxadm/tail/watch.(*InotifyTracker).run(0xc0002b6000)
        /go/pkg/mod/github.com/nxadm/[email protected]/watch/inotify_tracker.go:220 +0x68
created by github.com/nxadm/tail/watch.init.func1 in goroutine 17
        /go/pkg/mod/github.com/nxadm/[email protected]/watch/inotify_tracker.go:55 +0x14e

There is a related issue to this #1769 . Since then katib has migrated from using the hpcloud library for tailing to nxadm but it seems like I ran into the same exact issue regardless. This is with the following version of the metrics collector image https://hub.docker.com/layers/kubeflowkatib/file-metrics-collector/v1beta1-867c40a/images/sha256-3ab68e0932dd6c2028592dd7a7443ba4970e54f91ab145d6d35828112780eb0a?context=explore as the change to nxadm wasn't included in the 0.17 release. I have tried both the 0.16 and 0.17 images as well, but the result was the same. I haven't had more time to debug this more in depth (e.g building my own image with extra logs, etc).

What did you expect to happen?

The metrics collector should work normally, I have tried using the File metrics collector and things seem to be fine, although I haven't managed to get any Katib Experiment of any kind working with the StdOut one.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/kubeflowkatib/katib-controller:v0.16.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich · 2024-09-25T11:57:14Z

Thanks for creating this issue @gigabyte132!
@tariq-hasan @Electronic-Waste Please can you help us to explore this issue ?

/remove-label lifecycle/needs-triage
/area backend

Electronic-Waste · 2024-09-28T05:30:13Z

I can't reproduce the result with https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cpu.yaml

It turned out to be completed successfully in my environment:

$ kubectl get experiment -n kubeflow 
NAME       TYPE        STATUS   AGE
enas-cpu   Succeeded   True     27m

@gigabyte132 For your reference, my setup environment is:

Kuberentes version:

$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
docker.io/kubeflowkatib/katib-controller:latest

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0

Maybe you can upgrade the version of katib-controller to latest and try it again?

cc @andreyvelich @tariq-hasan

tariq-hasan · 2024-10-01T09:04:13Z

One other thing I was just curious about is if @gigabyte132 saw this error only for the enas-cpu experiment or if the error was also seen for other experiments such as darts-cpu and file-metrics-collector.

gigabyte132 · 2024-10-01T12:49:52Z

@tariq-hasan for me, any type of experiment that uses the file-metrics-collector fails with this error

hahahannes · 2024-11-21T09:59:00Z

Unfortunately, nxadm and hpcloud both dont log the error when the fsnotify.Watcher is created.
I used a custom go image to run

package main

import (
	"fmt"

	"github.com/fsnotify/fsnotify"
)

func main() {
	_, err := fsnotify.NewWatcher()
	if err != nil {
		fmt.Printf("failed to create Watcher %v", err)
	}
}

to see what error is logged when this is run in the experiment container. Similar as to the metrics container, it failed. the error message was failed to create Watcher Too many open files(base). Which is really strange as in my case I had less open files than the limit. Checked with lsof and ulimit -n

gigabyte132 added kind/bug lifecycle/needs-triage labels Sep 25, 2024

gigabyte132 mentioned this issue Sep 25, 2024

Metrics collector fails to create watcher #1769

Closed

google-oss-prow bot added area/backend and removed lifecycle/needs-triage labels Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics collector fails to create watcher #2434

Metrics collector fails to create watcher #2434

gigabyte132 commented Sep 25, 2024 •

edited

Loading

andreyvelich commented Sep 25, 2024

Electronic-Waste commented Sep 28, 2024

tariq-hasan commented Oct 1, 2024

gigabyte132 commented Oct 1, 2024 •

edited

Loading

hahahannes commented Nov 21, 2024

Metrics collector fails to create watcher #2434

Metrics collector fails to create watcher #2434

Comments

gigabyte132 commented Sep 25, 2024 • edited Loading

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

andreyvelich commented Sep 25, 2024

Electronic-Waste commented Sep 28, 2024

tariq-hasan commented Oct 1, 2024

gigabyte132 commented Oct 1, 2024 • edited Loading

hahahannes commented Nov 21, 2024

gigabyte132 commented Sep 25, 2024 •

edited

Loading

gigabyte132 commented Oct 1, 2024 •

edited

Loading