[utils] Rewrite slurm.pl from scratch #4300

kkm000 · 2020-10-17T17:01:16Z

The new version invokes sbatch passing the batch file on stdin, and waits for completion of the script without polling of flag files, rather by passing the --wait switch to sbatch (which is still doing the polling, but more efficiently, through an already established RPC channel with the active controller). In this mode, sbatch prints the new job ID immediately upon submission, and then later exists with return code 0 on success, or prints an error message and exits with a non-zero code.

Slurm's sbatch command has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed in 2s, then increases the wait time by 2s every poll cycle until it maxes out at 10s (i.e, the wait time is 2, 4, 6, 8, 10, 10.... seconds). Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth:

https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch

It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future, given it's a 2-line change. Anyway, please open an issue in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch.

You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies. The Slurm RPC library .so file must be available already (use ldd $(which sbatch) to locate); build against it with headers from the Slurm repository.

Whether to discuss this with the cluster IT admin is, of course, your decision. Your arguments are that (a) Kaldi jobs are sometimes extremely short, sometimes 10 seconds, while it's not uncommon in the HPC world to submit multinode MPI jobs running for days and (b) the old flag-file polling was putting a heavier load on the system at any poll rate: Slurm RPCs are incomparably more efficient compared to placing flag files on an NFS or other shared FS.

kkm000 · 2020-10-17T17:16:15Z

@yenda, as you are going through the cluster setup, I suggest you give this one a try. I've used the following config file:

command sbatch --no-kill --hint=compute_bound --export=PATH,TIME_STYLE

option debug=0
option debug=1
option debug=2 -v

# This is not very helpful as Burmill setup does not schedule memory.
option mem=0
option mem=* --mem-per-cpu=$0

option num_threads=1
option num_threads=* --cpus-per-task=$0

# For memory-gobbling tasks, like large ARPA models or HCLG composition.
option whole_nodes=* --exclusive --nodes=$0

option gpu=0
# Hack, --c-p-t should be 2*$0, but slurm.pl cannot do arithmetics.
# But all our nodes are 1 GPU each anyway.
option gpu=1 --partition=gpu --cpus-per-task=2 --gres=cuda:p100:1

I defined a Gres resource named 'cuda' with subtypes 'p100', 'v100', 'T4' etc in a single partition named, uninventively, 'gpu'. The CPU nodes were shaped with 10 vCPU (i.e., 5 jobs at a time) and 12.5 GB of RAM -- pretty small, but enough unless used for composing a large HCLG. This is what --whole-nodes is for: give a whole node to such tasks. Now they added C2 machines, but these could not be custom configured last time I checked. Take the smallest that could handle the HCLG-style stuff, either 4 or 8 vCPU. These do not offer a lot of benefit on FST tasks, but are real monsters on matrix ops. etc/cluster should contain a good template requiring to comment/uncomment stuff only.

kkm000 · 2020-10-17T17:17:35Z

@yenda, and I'm setting that to WIP until you Slurm at least a test without any smoke.

The new file calls sbatch passing the batch file on stdin, and waits for completion of the script without polling, rather by passing the --wait switch to sbatch. Slurm has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed for 2s, then increases the wait time by 2s until it maxes out at 10s. Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth: https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future. Please open a ticket in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch. You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies.

kkm000 · 2020-10-30T04:51:28Z

@danpovey, @jtrmal, I'm going to push this into the main repo onto a new branch, 'kkm/new-slurm.pl', otherwise it gets tricky to test. This way, only one repo must be accessed in the GCP build pipeline, and I can just cherry-pick this commit in the build script on top of whichever version of Kaldi is pinned as current in Burrmill.

It's far from the best practice but I solemnly swear on (Mohri, Pereira and Riley, 2008) that I won't forget to remove the branch when this change is merged.

kkm000 · 2020-10-30T06:12:31Z

Superseded by #4314, same code, different source branch.

kkm000 requested a review from jtrmal October 17, 2020 17:01

kkm000 marked this pull request as draft October 17, 2020 17:18

kkm000 force-pushed the new-slurm.pl branch from 4ec4311 to 9faf2f8 Compare October 30, 2020 04:23

kkm000 closed this Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utils] Rewrite slurm.pl from scratch #4300

[utils] Rewrite slurm.pl from scratch #4300

kkm000 commented Oct 17, 2020

kkm000 commented Oct 17, 2020

kkm000 commented Oct 17, 2020

kkm000 commented Oct 30, 2020

kkm000 commented Oct 30, 2020

[utils] Rewrite slurm.pl from scratch #4300

[utils] Rewrite slurm.pl from scratch #4300

Conversation

kkm000 commented Oct 17, 2020

kkm000 commented Oct 17, 2020

kkm000 commented Oct 17, 2020

kkm000 commented Oct 30, 2020

kkm000 commented Oct 30, 2020