Powershell and Parsing html code in Core

When working on a PowerShell webservice, I came across an interesting problem that I think is only going to crop up more and more.  This webservice takes a json payload, performs some simple manipulation on the data, kicks off some automation, and logs some events.  Part of the payload is html code, however.  Of course the json payload doesn’t care, but when it was coming into PowerShell it was being interpreted as a string.  That meant that when you looked at the string, you literally saw html code:

<BODY><H3>OPEN Problem 256 in environment <I>EPG</I></H3>
                               <HR>
                               <B>1 impacted infrastructure component</B>
                               <HR>
                               <BR>
                               <DIV><SPAN>Process</SPAN><BR><B><SPAN style="FONT-SIZE: 120%; COLOR: #dc172a">bosh-dns-health</SPAN></B><BR>
                               <P style="MARGIN-LEFT: 1em"><B><SPAN style="FONT-SIZE: 110%">Network problem</SPAN></B><BR>Packet retransmission rate for process bosh-dns-health on host 
                               cloud_controller/<GUID> has increased to 18 %</P></DIV>
                               <HR>
                               Root cause
                               <HR>
                               
                               <DIV><SPAN>Process</SPAN><BR><B><SPAN style="FONT-SIZE: 120%; COLOR: #dc172a">bosh-dns-health</SPAN></B><BR>
                               <P style="MARGIN-LEFT: 1em"><B><SPAN style="FONT-SIZE: 110%">Network problem</SPAN></B><BR>Packet retransmission rate for process bosh-dns-health on host 
                               cloud_controller/<GUID> has increased to 18 %</P></DIV>
                               <HR>
                               
                               <P><A href="https://redacted.somewhere.com/e/<GUID>/#problems/problemdetails;pid=-<GUID>">Open in Browser</A></P></BODY> 

I wanted to put this data in the events I was logging, but I obviously didn’t want it to look like this.  There isn’t a PowerShell cmdlet for ‘ConvertFrom-HTML’, although that would be great.  I could have tried to parse the text and remove the formatting, but that would be a pain trying to escape the right characters and account for all of the html tags.  I found several ways online to handle this – namely creating a ‘HTMLFile’ object, writing the html code into it, and then extracting out the innertext property of the object.  This works fine in some environments, but throws a weird error if you don’t have Office installed.  Here is the code:

$HTML = New-Object -Com "HTMLFile"
$html.IHTMLDocument2_write($htmlcode)

And the error it throws when Office isn’t installed:

Method invocation failed because [System.__ComObject] does not contain a method named 'IHTMLDocument2_write'.

There is also a catch here – that error will also be thrown if you are trying this in PowerShell Core/Microsoft PowerShell (not Windows PowerShell – i.e. anything under 5.x)!  Fortunately, there is a quick fix:

$HTML = New-Object -Com "HTMLFile"
try {
    $html.IHTMLDocument2_write($htmlcode)
}
catch {
    $encoded = [System.Text.Encoding]::Unicode.GetBytes($htmlcode)
    $html.write($encoded)
}
$text = ($html.all | Where-Object { $_.tagname -eq 'body' } | Select-Object -Property innerText).innertext

A simple try/catch, and it works great.  I am going to create a new repo in GitHub and make this a new function.  Hopefully you haven’t wasted too much time trying to track this down – it took me a while, and it wasn’t until I stumbled upon a similar solution on StackOverflow (https://stackoverflow.com/questions/46307976/unable-to-use-ihtmldocument2) that I was able to wrap it up.  Expect the cmdlet/function soon!

Leave a Reply