Friday, October 21, 2011

Bash scripting: parallel process control with 'bash' and 'xargs'


I have been attempting to run bash commands in parallel on Windows with bash given by msys with some form of control over the number of processes spawned. With this setup I do not have access to the parallel command.

For example, we can always specify the number of processes for compilation to the make command using:

$ make -j4

that uses 4 parallel processes and no more. After much trial and error, I finally figured out how multiple arbitrary commands can be run in the same way with a similar kind of control.

Let us presume we have a command file with one line per command. For example, I am trying to build different machine learning models to predict outcomes on various datasets in parallel on a multi-core machine using WEKA. Hence I have a text file, cmd.txt, prepared by a script that contains lines like:


$JBIN -Xmx3g weka.classifiers.trees.J48 -C 0.25 -M 2 -A -i -t result-1-1/filter6-weka-train.arff -T result-1-1/filter6-weka-test.arff -p 0 -d result-1-1/filter6-J48.model > result-1-1/filter6-J48-report.txt
$JBIN -Xmx3g weka.classifiers.functions.LibSVM -S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 1000.0 -C 1000000.0 -E 0.0010 -P 0.1 -Z -o -i -t result-1-1/filter6-weka-train.arff -T result-1-1/filter6-weka-test.arff -p 0 -d result-1-1/filter6-SVM.model > result-1-1/filter6-SVM-report.txt
$JBIN -Xmx3g weka.classifiers.functions.LibSVM -S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 1000.0 -C 1000000.0 -E 0.0010 -P 0.1 -Z -W '1 2' -o -i -t result-1-1/filter6-weka-train.arff -T result-1-1/filter6-weka-test.arff -p 0 -d result-1-1/filter6-SVM-w-1-2.model > result-1-1/filter6-SVM-w-1-2-report.txt
$JBIN -Xmx3g weka.classifiers.functions.LibSVM -S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 1000.0 -C 1000000.0 -E 0.0010 -P 0.1 -Z -W '2 1' -o -i -t result-1-1/filter6-weka-train.arff -T result-1-1/filter6-weka-test.arff -p 0 -d result-1-1/filter6-SVM-w-2-1.model > result-1-1/filter6-SVM-w-2-1-report.txt
$JBIN -Xmx3g weka.classifiers.trees.J48 -C 0.25 -M 2 -A -i -t result-1-1/filter9-weka-train.arff -T result-1-1/filter9-weka-test.arff -p 0 -d result-1-1/filter9-J48.model > result-1-1/filter9-J48-report.txt
$JBIN -Xmx3g weka.classifiers.functions.LibSVM -S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 1000.0 -C 1000000.0 -E 0.0010 -P 0.1 -Z -o -i -t result-1-1/filter9-weka-train.arff -T result-1-1/filter9-weka-test.arff -p 0 -d result-1-1/filter9-SVM.model > result-1-1/filter9-SVM-report.txt
$JBIN -Xmx3g weka.classifiers.functions.LibSVM -S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 1000.0 -C 1000000.0 -E 0.0010 -P 0.1 -Z -W '1 2' -o -i -t result-1-1/filter9-weka-train.arff -T result-1-1/filter9-weka-test.arff -p 0 -d result-1-1/filter9-SVM-w-1-2.model > result-1-1/filter9-SVM-w-1-2-report.txt
$JBIN -Xmx3g weka.classifiers.functions.LibSVM -S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 1000.0 -C 1000000.0 -E 0.0010 -P 0.1 -Z -W '2 1' -o -i -t result-1-1/filter9-weka-train.arff -T result-1-1/filter9-weka-test.arff -p 0 -d result-1-1/filter9-SVM-w-2-1.model > result-1-1/filter9-SVM-w-2-1-report.txt


where $JBIN is an environment variable that points to the java bin. Now to run these in parallel but with a limit on the number of processes, use the xargs command to split the input lines as follows:

$ cat cmd.txt | xargs -0 -d '\n' -L 1 -I {} -P 3 bash -c "eval \"{}\""

The options used are:

  1. -0 to retain quotes in the input line and presume arguments are terminated as \0 characters
  2. -d '\n' to set newline as the delimiter between arguments, overriding \0 in the previous point
  3. -L 1 to read one line at a time
  4. -I {} to set parenthesis as a replacement string to substitute the argument read, in this case an entire line
  5. -P 3 to limit to a maximum of 3 processes
  6. bash -c "eval \"{}\"" to execute the substituted command within bash
And that is it. It works as long as the commands are on a single line. I have yet to test it on commands spanning multiple lines.