Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] Rewrite slurm.pl from scratch #4300

Closed
wants to merge 1 commit into from

Conversation

kkm000
Copy link
Contributor

@kkm000 kkm000 commented Oct 17, 2020

The new version invokes sbatch passing the batch file on stdin, and waits for completion of the script without polling of flag files, rather by passing the --wait switch to sbatch (which is still doing the polling, but more efficiently, through an already established RPC channel with the active controller). In this mode, sbatch prints the new job ID immediately upon submission, and then later exists with return code 0 on success, or prints an error message and exits with a non-zero code.

Slurm's sbatch command has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed in 2s, then increases the wait time by 2s every poll cycle until it maxes out at 10s (i.e, the wait time is 2, 4, 6, 8, 10, 10.... seconds). Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth:

https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch

It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future, given it's a 2-line change. Anyway, please open an issue in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch.

You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies. The Slurm RPC library .so file must be available already (use ldd $(which sbatch) to locate); build against it with headers from the Slurm repository.

Whether to discuss this with the cluster IT admin is, of course, your decision. Your arguments are that (a) Kaldi jobs are sometimes extremely short, sometimes 10 seconds, while it's not uncommon in the HPC world to submit multinode MPI jobs running for days and (b) the old flag-file polling was putting a heavier load on the system at any poll rate: Slurm RPCs are incomparably more efficient compared to placing flag files on an NFS or other shared FS.

@kkm000 kkm000 requested a review from jtrmal October 17, 2020 17:01
@kkm000
Copy link
Contributor Author

kkm000 commented Oct 17, 2020

@yenda, as you are going through the cluster setup, I suggest you give this one a try. I've used the following config file:

command sbatch --no-kill --hint=compute_bound --export=PATH,TIME_STYLE

option debug=0
option debug=1
option debug=2 -v

# This is not very helpful as Burmill setup does not schedule memory.
option mem=0
option mem=* --mem-per-cpu=$0

option num_threads=1
option num_threads=* --cpus-per-task=$0

# For memory-gobbling tasks, like large ARPA models or HCLG composition.
option whole_nodes=* --exclusive --nodes=$0

option gpu=0
# Hack, --c-p-t should be 2*$0, but slurm.pl cannot do arithmetics.
# But all our nodes are 1 GPU each anyway.
option gpu=1 --partition=gpu --cpus-per-task=2 --gres=cuda:p100:1

I defined a Gres resource named 'cuda' with subtypes 'p100', 'v100', 'T4' etc in a single partition named, uninventively, 'gpu'. The CPU nodes were shaped with 10 vCPU (i.e., 5 jobs at a time) and 12.5 GB of RAM -- pretty small, but enough unless used for composing a large HCLG. This is what --whole-nodes is for: give a whole node to such tasks. Now they added C2 machines, but these could not be custom configured last time I checked. Take the smallest that could handle the HCLG-style stuff, either 4 or 8 vCPU. These do not offer a lot of benefit on FST tasks, but are real monsters on matrix ops. etc/cluster should contain a good template requiring to comment/uncomment stuff only.

@kkm000
Copy link
Contributor Author

kkm000 commented Oct 17, 2020

@yenda, and I'm setting that to WIP until you Slurm at least a test without any smoke.

@kkm000 kkm000 marked this pull request as draft October 17, 2020 17:18
The new file calls sbatch passing the batch file on stdin, and
waits for completion of the script without polling, rather by
passing the --wait switch to sbatch.

Slurm has a hardcoded polling schedule for waiting. Specifically,
it checks if the job has completed for 2s, then increases the
wait time by 2s until it maxes out at 10s. Because of this, even
very short Kaldi batches often incur extra 5s wait delays on
average. The following patch reduces the poll time to 1s without
growth:

https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch

It has been tested to apply cleanly up until Slurm v20.1, and is
unlikely to break in future. Please open a ticket in the
https://github.com/burrmill/burrmill repository if your Slurm
source does not patch.

You do not need administrative access to the cluster; you can just
build your own version of sbatch and place on the PATH. Internally
it uses only Slurm RPC calls, and does not require many dependencies.
@kkm000
Copy link
Contributor Author

kkm000 commented Oct 30, 2020

@danpovey, @jtrmal, I'm going to push this into the main repo onto a new branch, 'kkm/new-slurm.pl', otherwise it gets tricky to test. This way, only one repo must be accessed in the GCP build pipeline, and I can just cherry-pick this commit in the build script on top of whichever version of Kaldi is pinned as current in Burrmill.

It's far from the best practice but I solemnly swear on (Mohri, Pereira and Riley, 2008) that I won't forget to remove the branch when this change is merged.

@kkm000
Copy link
Contributor Author

kkm000 commented Oct 30, 2020

Superseded by #4314, same code, different source branch.

@kkm000 kkm000 closed this Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant