Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated Running Scripts During Perplexity Task Execution on Windows #2581

Open
zhuyuhua-v opened this issue Dec 19, 2024 · 0 comments
Open

Comments

@zhuyuhua-v
Copy link

Hi team,

I encountered an issue while testing the lambada_openai task with lm_eval on Windows. The script repeatedly executes the model loading operation, eventually leading to the system memory being exhausted and the program crashing. After some preliminary debugging, I discovered that this repeated model loading primarily occurs in the following section of the code:

def bootstrap_stderr(f, xs, iters):
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
# this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
# equivalent to stderr calculated without Bessel's correction in the stddev.
# Unfortunately, I haven't been able to figure out what the right correction is
# to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but
# that would be ad-hoc and I can't prove that that would actually be an unbiased estimator)
# Thankfully, shouldn't matter because our samples are pretty big usually anyways
res = []
chunk_size = min(1000, iters)
from tqdm import tqdm
print("bootstrapping for stddev:", f.__name__)
for bootstrap in tqdm(
pool.imap(
_bootstrap_internal(f, chunk_size),
[(i, xs) for i in range(iters // chunk_size)],
),
total=iters // chunk_size,
):
# sample w replacement
res.extend(bootstrap)
pool.close()
return sample_stddev(res)

Interestingly, if I modify this logic to run in a single process, the issue does not recur.

def bootstrap_stderr(f, xs, iters):
      res = []
      chunk_size = min(1000, iters)
      from tqdm import tqdm
      print("bootstrapping for stddev:", f.__name__)
      for i in tqdm(range(iters // chunk_size)):
          bootstrap = _bootstrap_internal(f, chunk_size)((i, xs))
          # sample w replacement
          res.extend(bootstrap)
      return sample_stddev(res) 

Could you please assist in confirming and addressing this problem?

For easier reproduction, I have also created a unit test that simulates this behavior on Windows, where the script restarts repeatedly.

import random

import lm_eval.api.metrics as metrics


import math

# @register_aggregation("mean")
def mean(arr):
    return sum(arr) / len(arr)

# Certain metrics must be calculated across all documents in a benchmark.
# We use them as aggregation metrics, paired with no-op passthrough metric fns.
# @register_aggregation("perplexity")
def perplexity(items):
    return math.exp(-mean(items))


def sample_stddev(arr):
    mu = mean(arr)
    return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))

class _bootstrap_internal:
    def __init__(self, f, n) -> None:
        self.f = f
        self.n = n

    def __call__(self, v):
        i, xs = v
        rnd = random.Random()
        rnd.seed(i)
        res = []
        for _ in range(self.n):
            res.append(self.f(rnd.choices(xs, k=len(xs))))
        return res

def bootstrap_stderr(f, xs, iters):
    import multiprocessing as mp

    pool = mp.Pool(mp.cpu_count())
    # this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
    # equivalent to stderr calculated without Bessel's correction in the stddev.
    # Unfortunately, I haven't been able to figure out what the right correction is
    # to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but
    # that would be ad-hoc and I can't prove that that would actually be an unbiased estimator)
    # Thankfully, shouldn't matter because our samples are pretty big usually anyways
    res = []
    chunk_size = min(1000, iters)

    from tqdm import tqdm

    print("bootstrapping for stddev:", f.__name__)
    for bootstrap in tqdm(
        pool.imap(
            _bootstrap_internal(f, chunk_size),
            [(i, xs) for i in range(iters // chunk_size)],
        ),
        total=iters // chunk_size,
    ):
        # sample w replacement
        res.extend(bootstrap)

    pool.close()
    return sample_stddev(res)


iter = 100000
xs = [-2.2734971046447754, -0.8279670476913452]
bootstrap_stderr(perplexity, xs, iters=iter)
print("done")

Thank you for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant