You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an issue while testing the lambada_openai task with lm_eval on Windows. The script repeatedly executes the model loading operation, eventually leading to the system memory being exhausted and the program crashing. After some preliminary debugging, I discovered that this repeated model loading primarily occurs in the following section of the code:
# this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
# equivalent to stderr calculated without Bessel's correction in the stddev.
# Unfortunately, I haven't been able to figure out what the right correction is
# to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but
# that would be ad-hoc and I can't prove that that would actually be an unbiased estimator)
# Thankfully, shouldn't matter because our samples are pretty big usually anyways
res= []
chunk_size=min(1000, iters)
fromtqdmimporttqdm
print("bootstrapping for stddev:", f.__name__)
forbootstrapintqdm(
pool.imap(
_bootstrap_internal(f, chunk_size),
[(i, xs) foriinrange(iters//chunk_size)],
),
total=iters//chunk_size,
):
# sample w replacement
res.extend(bootstrap)
pool.close()
returnsample_stddev(res)
Interestingly, if I modify this logic to run in a single process, the issue does not recur.
def bootstrap_stderr(f, xs, iters):
res = []
chunk_size = min(1000, iters)
from tqdm import tqdm
print("bootstrapping for stddev:", f.__name__)
for i in tqdm(range(iters // chunk_size)):
bootstrap = _bootstrap_internal(f, chunk_size)((i, xs))
# sample w replacement
res.extend(bootstrap)
return sample_stddev(res)
Could you please assist in confirming and addressing this problem?
For easier reproduction, I have also created a unit test that simulates this behavior on Windows, where the script restarts repeatedly.
import random
import lm_eval.api.metrics as metrics
import math
# @register_aggregation("mean")
def mean(arr):
return sum(arr) / len(arr)
# Certain metrics must be calculated across all documents in a benchmark.
# We use them as aggregation metrics, paired with no-op passthrough metric fns.
# @register_aggregation("perplexity")
def perplexity(items):
return math.exp(-mean(items))
def sample_stddev(arr):
mu = mean(arr)
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
class _bootstrap_internal:
def __init__(self, f, n) -> None:
self.f = f
self.n = n
def __call__(self, v):
i, xs = v
rnd = random.Random()
rnd.seed(i)
res = []
for _ in range(self.n):
res.append(self.f(rnd.choices(xs, k=len(xs))))
return res
def bootstrap_stderr(f, xs, iters):
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
# this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
# equivalent to stderr calculated without Bessel's correction in the stddev.
# Unfortunately, I haven't been able to figure out what the right correction is
# to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but
# that would be ad-hoc and I can't prove that that would actually be an unbiased estimator)
# Thankfully, shouldn't matter because our samples are pretty big usually anyways
res = []
chunk_size = min(1000, iters)
from tqdm import tqdm
print("bootstrapping for stddev:", f.__name__)
for bootstrap in tqdm(
pool.imap(
_bootstrap_internal(f, chunk_size),
[(i, xs) for i in range(iters // chunk_size)],
),
total=iters // chunk_size,
):
# sample w replacement
res.extend(bootstrap)
pool.close()
return sample_stddev(res)
iter = 100000
xs = [-2.2734971046447754, -0.8279670476913452]
bootstrap_stderr(perplexity, xs, iters=iter)
print("done")
Thank you for your assistance.
The text was updated successfully, but these errors were encountered:
Hi team,
I encountered an issue while testing the
lambada_openai
task with lm_eval on Windows. The script repeatedly executes the model loading operation, eventually leading to the system memory being exhausted and the program crashing. After some preliminary debugging, I discovered that this repeated model loading primarily occurs in the following section of the code:lm-evaluation-harness/lm_eval/api/metrics.py
Lines 459 to 485 in 8558b8d
Interestingly, if I modify this logic to run in a single process, the issue does not recur.
Could you please assist in confirming and addressing this problem?
For easier reproduction, I have also created a unit test that simulates this behavior on Windows, where the script restarts repeatedly.
Thank you for your assistance.
The text was updated successfully, but these errors were encountered: