1

I need to process >50,000 files using a third-party .exe command-line application. The application takes only one input file at a time, so I have to launch the application >50,000 times.

Each file (each job) usually takes about one second. However, sometimes the application hangs indefinitely.

I have written a Windows shell script that runs all the jobs serially, and checks every second to see whether the job is done. After 10 seconds, it kills the job and moves on to the next. However, it takes about 20 hours. I believe I can bring the total runtime down by a large amount if I run multiple jobs in parallel. The question is how?

In CMD I launch the task with Start, but there is no simple way to recover the process ID (PID) and therefore I cannot easily keep track of which instance has run for how long. I feel like I am trying to reinvent the umbrella. Any suggestions?

Mattia
  • 43

2 Answers2

2

Powershell did the trick, as indicated in quadruplebucky's answer. Here is the code I used. The second-last line (./xml2csv...) is the job itself. The rest of the script can be reused for any similar tasks.

# PARAMETERS
$root = 'D:\Ratings'
$folder = 'SP'

# Import Invoke-Parallel
 .".\Invoke-Parallel.ps1"

# Run in parallel
Get-ChildItem ".\$folder-xml" -Filter *.xml |
Invoke-Parallel -throttle 10 -runspaceTimeout 10 -ImportVariables `
  -ScriptBlock {
    $file = $_.BaseName
    echo $file
    cd $root
    (./xml2csv $folder-xml\$file.xml $folder-csv\$file.csv fields-$folder.txt -Q) | out-null
  }

Some notes:

  • The Invoke-Parallel function (aka cmdlet) can be downloaded here.
  • A runspace is what I would have called an "instance". -runspaceTimeout provides the maximum running time for each instance.
  • -throttle sets the maximum number of simultaneous running instances.
Mattia
  • 43
0

Powershell is your friend.

https://serverfault.com/questions/626711/how-do-i-run-my-powershell-scripts-in-parallel-without-using-jobs asks something similar.

"Quick" and "robust" are of course subjective.