Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Ray runtime_env works with ray core but doesn't work with ray.data when using working_dir #49356

Open
pradipneupane opened this issue Dec 19, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@pradipneupane
Copy link

pradipneupane commented Dec 19, 2024

What happened + What you expected to happen

The same code works for ray.core with setting runtime_env to my working dir but it doesn't work for ray.data during ray.init()

Versions / Dependencies

2.36.1

Reproduction script

I am using ray.data modules and I have many ray objects and Im creating ray dataset
dataset = ray.data.from_pandas_refs(ray_objs)

After that; Im using that dataset as:

for bs in datset_ref.iter_batches(prefetch_batches=0, batch_size=1000, batch_format="pandas", drop_last=False):
    pass

When doing that it tries to calculate metrics of those dataset and I keep getting these error:

ModuleNotFoundError: No module named 'MY_WORKING_DIR'
(Actor pid=xxx) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_StatsActor.update_metrics()

At least one of the input arguments for this task could not be computed:
(Actor pid=xxx) ray.exceptions.RaySystemError: System error: No module named 'MY_WORKING_DIR'

I have set my working_dir as ray-runtime env during ray.init() by using this:

ray.init(address=MY_CLUSTER_ADDRESS, ignore_reinit_error=True,
                 runtime_env={working_dir:'.'})

I don't get any error regarding No module named to my Project folder when using ray.core library but when I use ray.data library, it doesn't work and I keep getting that exception:

Is there a workaround to disable the metrics only from ray.data

Issue Severity

High: It blocks me from completing my task.

@pradipneupane pradipneupane added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 19, 2024
@pradipneupane pradipneupane changed the title [<Ray component: Data>] [Data] Dec 19, 2024
@pradipneupane pradipneupane changed the title [Data] [Data] Ray runtime_env works with ray core but doesn't work with ray.data when using working_dir Dec 19, 2024
@jcotant1 jcotant1 added the data Ray Data-related issues label Dec 19, 2024
@Jay-ju
Copy link
Contributor

Jay-ju commented Dec 23, 2024

I ran a sample and there were no problems occurred.

import ray
from typing import Callable


class Op12(Callable):
    def __call__(self, record) -> str:
        return {"value": "op1"*1000000}


ray.init(ignore_reinit_error=True, runtime_env={
         "working_dir": './', "excludes": ["CC*", "part*", "PMC*"]})
ray.data.from_items(range(1, 10000)).map(Op12,  concurrency=10).take_all()

@pradipneupane
Copy link
Author

@Jay-ju you are using local ray address; you need to test this with ray cluster address

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants