Repeated Running Scripts During Perplexity Task Execution on Windows #2581

zhuyuhua-v · 2024-12-19T05:54:07Z

Hi team,

I encountered an issue while testing the lambada_openai task with lm_eval on Windows. The script repeatedly executes the model loading operation, eventually leading to the system memory being exhausted and the program crashing. After some preliminary debugging, I discovered that this repeated model loading primarily occurs in the following section of the code:

lm-evaluation-harness/lm_eval/api/metrics.py

Lines 459 to 485 in 8558b8d

    
           def bootstrap_stderr(f, xs, iters): 
        
               import multiprocessing as mp 
        
               pool = mp.Pool(mp.cpu_count()) 
        
               # this gives a biased estimate of the stderr (i.e w/ the mean, it gives something 
        
               # equivalent to stderr calculated without Bessel's correction in the stddev. 
        
               # Unfortunately, I haven't been able to figure out what the right correction is 
        
               # to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but 
        
               # that would be ad-hoc and I can't prove that that would actually be an unbiased estimator) 
        
               # Thankfully, shouldn't matter because our samples are pretty big usually anyways 
        
               res = [] 
        
               chunk_size = min(1000, iters) 
        
               from tqdm import tqdm 
        
               print("bootstrapping for stddev:", f.__name__) 
        
               for bootstrap in tqdm( 
        
                   pool.imap( 
        
                       _bootstrap_internal(f, chunk_size), 
        
                       [(i, xs) for i in range(iters // chunk_size)], 
        
                   ), 
        
                   total=iters // chunk_size, 
        
               ): 
        
                   # sample w replacement 
        
                   res.extend(bootstrap) 
        
               pool.close() 
        
               return sample_stddev(res)

Interestingly, if I modify this logic to run in a single process, the issue does not recur.

def bootstrap_stderr(f, xs, iters):
      res = []
      chunk_size = min(1000, iters)
      from tqdm import tqdm
      print("bootstrapping for stddev:", f.__name__)
      for i in tqdm(range(iters // chunk_size)):
          bootstrap = _bootstrap_internal(f, chunk_size)((i, xs))
          # sample w replacement
          res.extend(bootstrap)
      return sample_stddev(res)

Could you please assist in confirming and addressing this problem?

For easier reproduction, I have also created a unit test that simulates this behavior on Windows, where the script restarts repeatedly.

import random

import lm_eval.api.metrics as metrics


import math

# @register_aggregation("mean")
def mean(arr):
    return sum(arr) / len(arr)

# Certain metrics must be calculated across all documents in a benchmark.
# We use them as aggregation metrics, paired with no-op passthrough metric fns.
# @register_aggregation("perplexity")
def perplexity(items):
    return math.exp(-mean(items))


def sample_stddev(arr):
    mu = mean(arr)
    return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))

class _bootstrap_internal:
    def __init__(self, f, n) -> None:
        self.f = f
        self.n = n

    def __call__(self, v):
        i, xs = v
        rnd = random.Random()
        rnd.seed(i)
        res = []
        for _ in range(self.n):
            res.append(self.f(rnd.choices(xs, k=len(xs))))
        return res

def bootstrap_stderr(f, xs, iters):
    import multiprocessing as mp

    pool = mp.Pool(mp.cpu_count())
    # this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
    # equivalent to stderr calculated without Bessel's correction in the stddev.
    # Unfortunately, I haven't been able to figure out what the right correction is
    # to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but
    # that would be ad-hoc and I can't prove that that would actually be an unbiased estimator)
    # Thankfully, shouldn't matter because our samples are pretty big usually anyways
    res = []
    chunk_size = min(1000, iters)

    from tqdm import tqdm

    print("bootstrapping for stddev:", f.__name__)
    for bootstrap in tqdm(
        pool.imap(
            _bootstrap_internal(f, chunk_size),
            [(i, xs) for i in range(iters // chunk_size)],
        ),
        total=iters // chunk_size,
    ):
        # sample w replacement
        res.extend(bootstrap)

    pool.close()
    return sample_stddev(res)


iter = 100000
xs = [-2.2734971046447754, -0.8279670476913452]
bootstrap_stderr(perplexity, xs, iters=iter)
print("done")

Thank you for your assistance.

The text was updated successfully, but these errors were encountered:

zhuyuhua-v mentioned this issue Dec 23, 2024

change to single process for bootstrap_stderr #2593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated Running Scripts During Perplexity Task Execution on Windows #2581

Repeated Running Scripts During Perplexity Task Execution on Windows #2581

zhuyuhua-v commented Dec 19, 2024

Repeated Running Scripts During Perplexity Task Execution on Windows #2581

Repeated Running Scripts During Perplexity Task Execution on Windows #2581

Comments

zhuyuhua-v commented Dec 19, 2024