Instructions for the Event 5, the Logfile Labyrinth:
First let’s see how my entry http://scriptinggames.org/entrylist_.php?entryid=1069 performs:
- Using the 1MB large log files from the zip file in the instructions
- Using 4GB of files located on my laptop that doesn’t have a SSD disk 😦
Although it wasn’t a requirement, I chose to write a function that could parse many files including large files of 25MB,125MB,..as quick as possible without having a negative impact on memory.
There are many ways to read files content and IIS log files. Here’s a non- exhaustive list I found on the web:
- Using a SQL DB
- Using the New-Object -ComObject MSUtil.LogQuery from LogParser
- Using the built-in Select-String cmdlet
- Using the built-in switch statement that has a FilePath parameter
- Using the built-in Get-Content cmdlet that reads lines 1 by 1 by default, or using -ReadCount 0 for all lines, or using a specific number of lines (-Readcount 1000 for example)
- Using the .Net methods of the class [IO.File]
- Using the .Net methods associated with the [IO.StreamReader] class
- Using the [regex] object (my favorite)
All of these methods perform differently, may consume a lot of RAM,…
To select the method, I’ve tested the following
Get-ChildItem -Path .\LogFiles\W3SVC1 | % { New-Object -TypeName psobject -Property @{ FileName = $_.FullName 'FileSize(MB)' = '{0:N2}'-f($_.Length/1MB) GCMethod = (Measure-Command { Get-Content $_.FullName | Out-Null }).ToString() GC1000Method = (Measure-Command { Get-Content $_.FullName -ReadCount 1000 | % { $_ } | Out-Null }).ToString() StreamMethod = (Measure-Command { $reader = new-object System.IO.StreamReader -ArgumentList $_.FullName while ( ($line = $reader.readline())) { $line } }).ToString() }}
Here are a few points that you may notice in my entry.
-
The regular expression in the ValidatePattern attribute will allow IPV4 and IPV6 IP addresses. It means that you can type for example:
127.0.*
192.168.*
*1
::1
2001:*
fe80*I didn’t use the following code because it gave from strange results
try { [IPAddress]::Parse($_) | Out-Null } catch { throw "An invalid IP address was specified." }
As I use the -LIKE operator to filter the resulting array of unique IP Addresses, I needed to know whether the Pattern parameter of my function was used so that I can make sure it ends with a wildcard.
if ($PSBoundParameters.ContainsKey('Pattern')) { if (-not($Pattern.EndsWith("*"))) { $Pattern = "$Pattern*" } Write-Verbose "Using pattern $Pattern to filter the output" }
-
I’ve used the following MSDN page to find out the different file names and understand the different format of IIS log files
- I’ve also extracted the first four lines in a W3C log file to determine in what column the “c-ip” is located. I used the -TotalCount parameter of the built-in Get-Content for this purpose.
- You may notice that I also used the new -notin operator of PowerShell version 3.0 to avoid creating a huge array in memory when reading big files containing millions of lines.
-
The total length of my progress bar represents the total size of files. The progress is proportional to the file size.
Here’s my full entry 😎
#Requires -Version 3 Function Get-IPFromIISLog { [CmdletBinding()] Param( [Parameter()] [ValidatePattern("^(\d|[a-f]|:|\*|\.|\%)*$")] [string]$Pattern="*", [Parameter()] [ValidateScript({ Test-Path -Path $_ -PathType Container })] [string]$FilePath = ".\" ) Begin { if ($PSBoundParameters.ContainsKey('Pattern')) { if (-not($Pattern.EndsWith("*"))) { $Pattern = "$Pattern*" } Write-Verbose "Using pattern $Pattern to filter the output" } Function Get-LineStream { [CmdLetBinding()] Param( [int32]$Index, [string]$Separator, [String]$Path ) Begin { $arIP = @() } Process { try { $StreamReader = New-object System.IO.StreamReader -ArgumentList (Resolve-Path $Path -ErrorAction Stop).Path Write-Verbose "Reading Stream of file $Path" while ( $StreamReader.Peek() -gt -1 ) { $Line = $StreamReader.ReadLine() if ($Line.length -eq 0 -or $Line -match "^#") { continue } $result = ($Line -split $Separator)[$Index] if ($result -notin $arIP) { $arIP += $result } } $StreamReader.Close() $arIP } catch { Write-Warning -Message "Failed to read $Path because $($_.Exception.Message)" } } End {} } # end of function } Process { try { $allFiles = Get-ChildItem -Path $FilePath -Filter *.LOG -Recurse -ErrorAction Stop } catch { Write-Warning -Message "Failed to enumerate files under $FilePath because $($_.Exception.Message)" break } if ($allFiles) { $IPCollected = @() $Count = 1 $FileSizeSum = 0 $TotalSize = (($allFiles | ForEach-Object { $_.Length }) | Measure-Object -Sum).Sum $allFiles | ForEach-Object { $File = $_ $FileSizeSum += $File.length $WPHT = @{ Activity = "Reading file $($File.Name) of size $('{0:N2}'-f ($File.Length/1MB))MB" ; Status = '{0} over {1}' -f $Count,($allFiles).Count ; PercentComplete = ($FileSizeSum/$TotalSize*100) ; } Write-Progress @WPHT $Count++ # Based on the file name we know the IIS Format Switch -Regex ($File.Name) { '^u_ex.*\.log' { $IISLogFormat = 'W3C' ; break} '^ex.*\.log' { $IISLogFormat = 'W3C' ; break} '^in.*\.log' { $IISLogFormat = 'IIS' ; break} '^nc.*\.log' { $IISLogFormat = 'NCSA' ; break} default { $IISLogFormat = 'Custom' ; break} } Switch ($IISLogFormat) { 'W3C' { Write-Verbose "Reading W3C formatted file $($File.FullName)" try { $First4Lines = Get-Content -Path $($File.FullName) -TotalCount 4 -ErrorAction Stop } catch { Write-Warning "Failed to read the content of the file $($File.Name) because $($_.Exception.Message)" } if ($First4Lines) { $i = -1 $Index = ($First4Lines[-1] -split "\s" | ForEach-Object { [PSObject]@{ Index=$i ; FieldName = $_} $i++ } | Where-Object { $_.FieldName -eq "c-ip"}).Index } if ($Index) { [array]$IPCollected += (Get-LineStream -Path $File.FullName -Separator "\s" -Index $Index) } else { Write-Warning "Could not find the c-ip field in the W3C log file $($File.FullName)" } } IIS { Write-Verbose "Reading IIS formatted file $($File.FullName)" [array]$IPCollected += (Get-LineStream -Path $File.FullName -Index 0 -Separator ",") } NCSA { Write-Verbose "Reading NCSA formatted file $($File.FullName)" [array]$IPCollected += (Get-LineStream -Path $File.FullName -Index 0 -Separator "\s") } default { Write-Warning "Cannot parse a custom log file $($File.FullName)" } } $IPCollected = ($IPCollected | Sort -Unique) } Write-Verbose ("A total of {0} unique IP were collected" -f $($IPCollected.Count)) $IPCollected | Where-Object { $_ -like $Pattern } } else { Write-Warning "No file with .LOG extension found in this folder and subtree" } } End {} }
Really nice shot. I gave you five stars for this script because I found it extremely interesting from a technical point of view for the way you get to access big files really fast and for the way you handle all the different types of IIS logs.One question I have to this regard is that you say you prefer regex methods to read logfiles and then use IO.streamreader instead. What’s the explanation for this choice?
I also liked the progress bar based on file size. Great!
If I can suggest something, since you are using v3, why not to replace $_ with $PSItem and keep yourt code uniform?
Also, the task asked for no sorting whatsoever, so, why sorting $IPCollected instead of just selecting?
Carlo
Thank you for the 5 star and your feedback 🙂
I’ll do my best to answer your questions.
I love regex because they are powerful. Have a look at my post on coloring windowsupdate.log: https://p0w3rsh3ll.wordpress.com/2012/03/20/coloring-windowsupdate-log/. I usually use regex with a Get-Content -Readcount 1 (default value that sends line 1 by 1 through the pipeline). I didn’t choose this option because I knew that it consumes too much memory on big files. The other valid method for big files is Get-Content -Readcount 1000 or IO.StreamReader. My quick tests showed that IO.StreamReader was faster.
For Get-Content -Readcount 1000, you should have a look at http://rkeithhill.wordpress.com/2007/06/17/optimizing-performance-of-get-content-for-large-files/ and at the entry of mjolinor http://scriptinggames.org/entrylist_.php?entryid=1074
Yes, $_ can be replaced by $PSitem.
The PowerShell team says it’s an alias: http://blogs.msdn.com/b/powershell/archive/2012/06/14/new-v3-language-features.aspx
But Keith Hill says it is not an alias: http://rkeithhill.wordpress.com/2011/10/19/windows-powershell-version-3-simplified-syntax/
$IPCollected is an array that stores results from all the subfolders. Ok, I make sure that whenever I read a single log file, I get a unique list of IP (-notin). But an IP address can appear in many log files and different subfolders. The requirements said we should get unique client addresses.
Wow J’avais pas lu ton post encore. Excellent travail Emin!
A mon avis c’est le meilleur de l’event 5.
Je viens de te donner tes 5 etoiles bien meriter! 🙂
Keep the good work! On apprend beaucoup avec tes scripts!
Merci pour les compliments, les encouragements et les 5 étoiles 😀
Dis donc, beaucoup de francophones sur ce blog, je savais pas!
Courage à vous deux pour la dernière tache!