2

I am new to the windows ecosystem. I have been tasked with writing a program that will search several 10's (maybe even 100's) of thousands of files for a particular string. The string that must be matched is a serial number consisting of only numbers and letters and is less than 20 characters. Right now, my program is executing the following command:

findstr /i /m /s "searchStr" "C:\Directory\To\Search\*.*"

The above command works, however, it is too slow. The file(s) which could contain a particular serial number will only have the serial number in their first line.

Does anybody know of an efficient way to recursively search a directory for all files that contain a particular string only in the first line?

tpdietz
  • 121

2 Answers2

2

In PowerShell (v3.0+), maybe...

Get-ChildItem -Path x:\pathto\*.log `
| ForEach-Object {
    if (Get-Content -LiteralPath $_ -First 1 `
        | Select-String -SimpleMatch -Pattern 'serialnumber') 
    {
        Write-Output $_
    }
}

Different parameters to Get-ChildItem can recurse subfolders, etc; to Get-Content can obtain more or less content from the file; and to Select-String can perform more-complex matches (regex, case-sensitive, etc).

2

I can suggest a few options if you don't need to use findstr, but first of all you should see whether you can restrict the search to files of a certain filetype, as that is sure to speed things up.

  1. FileLocator Lite is in my experience faster at finding files and checking their contents. Be sure to fill in both the "filename" (if applicable) and "contained text" fields, as well as the starting directory.

  2. ag -il "searchStr": ag is built for speed so it should give you results, fast. Be sure to restrict the search by filetype if you can, although binary files are skipped by default already. Also available under Cygwin.

  3. find -exec awk 'BEGIN {IGNORECASE=1} NR==1 && /searchStr/ {print FILENAME": "$0}' {} \; Try this if you have Cygwin or another POSIX-like environment available, in order to check out your idea about searching only the first line. Combine find to get the filenames (and hopefully also filter them) and awk to check the first line and print it together with the filename.
  4. find | parallel 'perl -lane '\'' print "$ARGV: $_" if $. == 1 and /searchStr/i '\'' {}' Another idea to try and speed things up is putting available cores and threads to work: that's what GNU parallel is for. This example sports perl, but it does the same as awk in 3. above. Here's a command breakdown:

    find look for files in the current directory and its subdirectories. You can specify a different directory to look in and a file pattern or extension to filter on: find /cygdrive/c/Directory/To/Search -iname "*.txt".

    | "pipe", i.e. feed the list of results to the next command.

    parallel execute the next command in parallel.

    perl scripting language that excels at text file manipulation, can replace sed or awk.

    -lane useful set of switches for perl one-liners.

    '\'' escaped apostrophe, needed since we already opened an apostrophe set after parallel.

    print "$ARGV: $_" print the filename ($ARGV), a colon, a space and the full line ($_).

    if only execute the previous instruction if the following condition(s) are met.

    $. == 1 line number ($.) is equal to one (1), i.e. we are looking at the first line of the file.

    and the following condition must be met, too.

    /searchStr/i the line being examined contains the text searchStr, case-insensitively.

    '\'' another escaped apostrophe marks the end of the perl instruction.

    {} this is going to be substituted by parallel with each of the filenames passed on by find.

    ' end of the parallel instruction.

Update: Both awk and perl read the whole file even if actions are bound to the first line only. The solution is to explicitly stop elaboration at line 2:

find -exec awk 'BEGIN {IGNORECASE=1} NR > 1 {exit} /searchStr/ {print FILENAME": "$0}' {} \; find | parallel 'perl -lape '\'' exit if $. == 2; print "$ARGV: $_" if /searchStr/i '\'' {}'

simlev
  • 3,912