I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.
After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).
The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).
I'm not doing much in the code itself. I'm doing the following:
- Splitting line via space as delimiter
 - One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
 - Saving the needed array elements into my pscustomobject
 - Adding pscustomobject to arraylist to be used later.
 - During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.
 
All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.
Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?
Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.
10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A583FE0 UDP Snd 127.0.0.1      6c94 R Q [8385 A DR NXDOMAIN] AAAA   (4)pool(3)ntp(3)org(0)
10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A582050 UDP Snd 127.0.0.1    3d9d R Q [8081   DR  NOERROR] A      (4)pool(3)ntp(3)org(0)
NOTE:  the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081   DR  NOERROR] (3 fields)
Other examples would be the "R Q" where sometimes it's "  Q".
$Logfile = "C:\Temp\log.txt"
[System.Collections.ArrayList]$LogEntries = @()
[System.Collections.ArrayList]$DNSCache = @()
# Initialize log iteration counter
$i = 1
# Get Log data.  Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d+\/"}}
$LogDataTotalLines = $LogData.Length
# Process each log entry
$LogData | ForEach-Object {
        
        $PercentComplete =  [math]::Round(($i/$LogDataTotalLines * 100))
        Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
        
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s+'
        
        # Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
        If ($DNSCache.IP -notcontains $temp[8]) {
            $DNSEntry = [PSCustomObject]@{
                IP = $temp[8]
                DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
            }
            
            # Add DNSEntry to DNSCache collection
            $DNSCache.Add($DNSEntry) | Out-Null
            
            # Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE:  value could be blank.
            $ResolvedDNSName = $DNSEntry.DNSName
        } Else {
            # DNSCache contains resolved IP already.  Find and Use it.
            $ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
        }
        $LogEntry = [PSCustomObject]@{
            Datetime = $temp[0] + " " + $temp[1] + " " + $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP = $temp[8]
            ClientDNSName = $ResolvedDNSName
            QueryType = $temp[-2]       # Second to last entry of array
            QueryName = ($temp[-1] -Replace "\(\d+\)",".") -Replace "^\.",""        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }
        
        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry) | Out-Null
       $i++
    }