Friday Fun Get Content Words

Recently I was tracking down a bug in script for a client. The problem turned out to be a simple typo. I could have easily avoided that by using Set-StrictMode, which I do now, but that’s not what this is about. What I realized I wanted was a way to look at all the for “words” in a script. If I could look at them sorted, then typos would jump out. At least in theory.

My plan was to get the content of a text file or script, use a regular expression pattern to identify all the “words” and then get a sorted and unique list. Here’s what I came up with.

Function Get-ContentWords {

[cmdletbinding()]

Param (
[Parameter(Position=0,Mandatory=$True,
HelpMessage="Enter the filename for your text file",
ValueFromPipeline=$True)]
[string]$Path
)

Begin {
    Set-StrictMode -Version 2.0
   
    Write-Verbose "Starting $($myinvocation.mycommand)"
   
    #define a regular expression pattern to detect "words"
    [regex]$word="\b\S+\b"
}

Process {

    if ($path.gettype().Name -eq "FileInfo") {
      #$Path is a file object
      Write-Verbose "Getting content from $($Path.Fullname)"
      $content=Get-Content -Path $path.Fullname
    }
    else {
        #$Path is a string
        Write-Verbose "Getting content from $path"
        $content=get-content -Path $Path
    }

    #add a little information
    $stats=$content | Measure-Object -Word
    Write-Verbose "Found approximately $($stats.words) words"

    #write sorted unique values
    $word.Matches($content) | select Value -unique | sort Value
 }
 
End {
    Write-Verbose "Ending $($myinvocation.mycommand)"
    }
   
} #close function

The function uses Get-Content to retrieve the content (what else?!) of the specified file. At the beginning of the function I defined a regular expression object to find “words”.

#define a regular expression pattern to detect "words"
[regex]$word="\b\S+\b"

This is an intentionally broad pattern that searches for anything not a space. The \b element indicates a word boundary. Because this is a REGEX object, I can do a bit more than using a basic -match operator. Instead I’ll use the Matches() method which will return a collection of match objects. I can pipe these to Select-Object retrieving just the Value property. I also use the -Unique parameter to filter out duplicates. Finally the values are sorted.

$word.Matches($content) | select Value -unique | sort Value

The matches and filtering are NOT case-sensitive, which is fine for me. With the list I can see where I might have used write-host instead of Write-Host and go back to clean up my code. Let me show you how this works. Here’s a demo script.

#Requires -version 2.0

$comp = Read-Host "Enter a computer name"

write-host "Querying services on $comp" -fore Cyan
$svc = get-service -comp $comp

$msg = "I found {0} services on $comp" -f $svc.count
Write-Host "Results" -fore Green
Write-Host $mgs -fore Green

The script has some case inconsistencies as well as a typo. I’ve dot sourced the function in my PowerShell session. Here’s what I end up with.

For best results, you need to make sure there are spaces around commands that use the = sign. But now I can scan through the list and pick out potential problems. Sure, Set-StrictMode would help with variable typos but if I had errors in say comment based help, that wouldn’t help. Maybe you’ll find this useful in your scripting work, maybe not. But I hope you learned a few things about working with REGEX objects and unique properties.

Download Get-ContentWords and enjoy.

Post to Twitter Post to Plurk Post to Yahoo Buzz Post to Delicious Post to Digg Post to Facebook Post to FriendFeed Post to Google Buzz Post to Ping.fm Post to Reddit Post to Slashdot Post to StumbleUpon Post to Technorati

This entry was posted in Friday Fun, PowerShell v2.0, Scripting and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>