Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: unary_unary() got an unexpected keyword argument '_registered_method' #2427

Open
Electronic-Waste opened this issue Sep 10, 2024 · 10 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed kind/bug

Comments

@Electronic-Waste
Copy link
Member

What happened?

When I run the following scripts:

import kubeflow.katib as katib

def train_mnist_model(parameters):
    import tensorflow as tf
    import kubeflow.katib as katib
    import numpy as np
    import logging

    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.INFO,
    )
    logging.info("--------------------------------------------------------------------------------------")
    logging.info(f"Input Parameters: {parameters}")
    logging.info("--------------------------------------------------------------------------------------\n\n")


    # Get HyperParameters from the input params dict.
    lr = float(parameters["lr"])
    num_epoch = int(parameters["num_epoch"])

    # Set dist parameters and strategy.
    is_dist = parameters["is_dist"]
    num_workers = parameters["num_workers"]
    batch_size_per_worker = 64
    batch_size_global = batch_size_per_worker * num_workers
    strategy = tf.distribute.MultiWorkerMirroredStrategy(
        communication_options=tf.distribute.experimental.CommunicationOptions(
            implementation=tf.distribute.experimental.CollectiveCommunication.RING
        )
    )

    # Callback class for logging training.
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    class CustomCallback(tf.keras.callbacks.Callback):
        def on_epoch_end(self, epoch, logs=None):
            katib.report_metrics({
                "accuracy": logs["accuracy"],
                "logs": logs["loss"],
            })
            

    # Prepare MNIST Dataset.
    def mnist_dataset(batch_size):
        (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
        x_train = x_train / np.float32(255)
        y_train = y_train.astype(np.int64)
        train_dataset = (
            tf.data.Dataset.from_tensor_slices((x_train, y_train))
            .shuffle(60000)
            .repeat()
            .batch(batch_size)
        )
        return train_dataset

    # Build and compile CNN Model.
    def build_and_compile_cnn_model():
        model = tf.keras.Sequential(
            [
                tf.keras.layers.InputLayer(input_shape=(28, 28)),
                tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
                tf.keras.layers.Conv2D(32, 3, activation="relu"),
                tf.keras.layers.Flatten(),
                tf.keras.layers.Dense(128, activation="relu"),
                tf.keras.layers.Dense(10),
            ]
        )
        model.compile(
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
            metrics=["accuracy"],
        )
        return model
    
    # Download Dataset.
    dataset = mnist_dataset(batch_size_global)

    # For dist strategy we should build model under scope().
    if is_dist:
        logging.info("Running Distributed Training")
        logging.info("--------------------------------------------------------------------------------------\n\n")
        with strategy.scope():
            model = build_and_compile_cnn_model()
    else:
        logging.info("Running Single Worker Training")
        logging.info("--------------------------------------------------------------------------------------\n\n")
        model = build_and_compile_cnn_model()
    
    # Start Training.
    model.fit(
        dataset,
        epochs=num_epoch,
        steps_per_epoch=70,
        callbacks=[CustomCallback()],
        verbose=0,
    )

# Set parameters with their distribution for HyperParameter Tuning with Katib.
parameters = {
    "lr": katib.search.double(min=0.1, max=0.2),
    "num_epoch": katib.search.int(min=10, max=15),
    "is_dist": False,
    "num_workers": 1
}

# Start the Katib Experiment.
katib_client = katib.KatibClient(namespace="kubeflow")
katib_client.tune(
    name="tune-mnist",
    objective=train_mnist_model, # Objective function.
    base_image="electronicwaste/tensorflow:git", # tensorflow/tensorflow:2.13.0 + git
    parameters=parameters, # HyperParameters to tune.
    algorithm_name="cmaes", # Alorithm to use.
    objective_metric_name="accuracy", # Katib is going to optimize "accuracy".
    additional_metric_names=["loss"], # Katib is going to collect these metrics in addition to the objective metric.
    max_trial_count=12, # Trial Threshold.
    parallel_trial_count=2,
    packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"],
    metrics_collector_config={"kind": "Push"},
)

The error happened:

Traceback (most recent call last):
  File "/tmp/tmp.fGitfCta5x/ephemeral_objective.py", line 97, in <module>
    train_mnist_model({'lr': '0.16377224201308005', 'num_epoch': '13', 'is_dist': False, 'num_workers': 1})
  File "/tmp/tmp.fGitfCta5x/ephemeral_objective.py", line 89, in train_mnist_model
    model.fit(
  File "/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/tmp.fGitfCta5x/ephemeral_objective.py", line 36, in on_epoch_end
    katib.report_metrics({
  File "/usr/local/lib/python3.8/dist-packages/kubeflow/katib/api/report_metrics.py", line 61, in report_metrics
    client = katib_api_pb2_grpc.DBManagerStub(channel)
  File "/usr/local/lib/python3.8/dist-packages/kubeflow/katib/katib_api_pb2_grpc.py", line 19, in __init__
    self.ReportObservationLog = channel.unary_unary(
TypeError: unary_unary() got an unexpected keyword argument '_registered_method'

What did you expect to happen?

Run without error.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
docker.io/kubeflowkatib/katib-controller:lates

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /home/ws/miniconda3/envs/katib/lib/python3.10/site-packages
Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3
Required-by: 

Python Packages Version in the Training Container:

$ pip list
Package                      Version
---------------------------- --------------------
absl-py                      1.4.0
astunparse                   1.6.3
cachetools                   5.3.1
certifi                      2019.11.28
chardet                      3.0.4
dbus-python                  1.2.16
flatbuffers                  23.5.26
gast                         0.4.0
google-auth                  2.21.0
google-auth-oauthlib         1.0.0
google-pasta                 0.2.0
grpcio                       1.56.0
h5py                         3.9.0
idna                         2.8
importlib-metadata           6.7.0
keras                        2.13.1
kubeflow-katib               0.17.0
kubernetes                   30.1.0
libclang                     16.0.0
Markdown                     3.4.3
MarkupSafe                   2.1.3
numpy                        1.24.3
oauthlib                     3.2.2
opt-einsum                   3.3.0
packaging                    23.1
pip                          23.1.2
protobuf                     4.23.3
pyasn1                       0.5.0
pyasn1-modules               0.3.0
PyGObject                    3.36.0
python-apt                   2.0.1+ubuntu0.20.4.1
python-dateutil              2.9.0.post0
PyYAML                       6.0.2
requests                     2.22.0
requests-oauthlib            1.3.1
requests-unixsocket          0.2.0
rsa                          4.9
setuptools                   68.0.0
six                          1.14.0
tensorboard                  2.13.0
tensorboard-data-server      0.7.1
tensorflow-cpu               2.13.0
tensorflow-estimator         2.13.0
tensorflow-io-gcs-filesystem 0.32.0
termcolor                    2.3.0
typing_extensions            4.5.0
urllib3                      1.25.8
websocket-client             1.8.0
Werkzeug                     2.3.6
wheel                        0.40.0
wrapt                        1.15.0
zipp                         3.15.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 10, 2024

FYR, I found a similar issue describing this error: open-telemetry/opentelemetry-python-contrib#2483

Maybe it concerns with the grpcio version.

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 10, 2024

However, it comes to run without error when I use tensorflow/tensorflow:2.17.0 for the base image of my Dockerfile to build a new training image:

FROM tensorflow/tensorflow:2.17.0

RUN apt-get -y update && \
    apt-get -y install git

But it cannot work out when I use tensorflow/tensorflow:2.13.0, which is our base image for users.

I think we should investigate this to ensure that Push MC works correctly. WDYT👀 @kubeflow/wg-automl-leads

@Electronic-Waste
Copy link
Member Author

In tensorflow/tensorflow:2.17.0, the Python packages versions are:

# pip list
Package                      Version
---------------------------- -------------
absl-py                      2.1.0
astunparse                   1.6.3
blinker                      1.4
cachetools                   5.5.0
certifi                      2024.7.4
charset-normalizer           3.3.2
cryptography                 3.4.8
dbus-python                  1.2.18
distro                       1.7.0
flatbuffers                  24.3.25
gast                         0.6.0
google-auth                  2.34.0
google-pasta                 0.2.0
grpcio                       1.64.1
h5py                         3.11.0
httplib2                     0.20.2
idna                         3.7
importlib-metadata           4.6.4
jeepney                      0.7.1
keras                        3.4.1
keyring                      23.5.0
kubeflow-katib               0.17.0
kubernetes                   30.1.0
launchpadlib                 1.10.16
lazr.restfulclient           0.14.4
lazr.uri                     1.0.6
libclang                     18.1.1
Markdown                     3.6
markdown-it-py               3.0.0
MarkupSafe                   2.1.5
mdurl                        0.1.2
ml-dtypes                    0.4.0
more-itertools               8.10.0
namex                        0.0.8
numpy                        1.26.4
oauthlib                     3.2.2
opt-einsum                   3.3.0
optree                       0.12.1
packaging                    24.1
pip                          24.1.2
protobuf                     4.25.3
pyasn1                       0.6.1
pyasn1_modules               0.4.1
Pygments                     2.18.0
PyGObject                    3.42.1
PyJWT                        2.3.0
pyparsing                    2.4.7
python-apt                   2.4.0+ubuntu3
python-dateutil              2.9.0.post0
PyYAML                       6.0.2
requests                     2.32.3
requests-oauthlib            2.0.0
rich                         13.7.1
rsa                          4.9
SecretStorage                3.3.1
setuptools                   70.3.0
six                          1.16.0
tensorboard                  2.17.0
tensorboard-data-server      0.7.2
tensorflow-cpu               2.17.0
tensorflow-io-gcs-filesystem 0.37.1
termcolor                    2.4.0
typing_extensions            4.12.2
urllib3                      2.2.2
wadllib                      1.3.6
websocket-client             1.8.0
Werkzeug                     3.0.3
wheel                        0.43.0
wrapt                        1.16.0
zipp                         1.0.0

@Electronic-Waste
Copy link
Member Author

/good-first-issue
/remove-label lifecycle/needs-triage

Copy link

@Electronic-Waste:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue
/remove-label lifecycle/needs-triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@Electronic-Waste
Copy link
Member Author

/remove-lifecycle stale

@Harshal292004
Copy link

Hello,I am new to open-source contributions and would like some clarification regarding this issue. Could you please help me understand:

  1. What exactly does this issue require?
  2. How can I set up the local environment to reproduce and fix the error?

Thank you for your guidance!

@YTGhost
Copy link

YTGhost commented Dec 25, 2024

Maybe I can take the job, can you assign it to me?

@Electronic-Waste
Copy link
Member Author

This issue needs to build katib-controller with source code since push-based metrics collection has not been released yet.

As for push-based metrics collection, you can refer to https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/#push-based-metrics-collector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed kind/bug
Projects
None yet
Development

No branches or pull requests

3 participants