HTML Parser does not parse full content or only see a very small portion of HTML

alabamian2 · August 11, 2021, 3:31pm

Hi everyone,
I am trying to parse a large set of HTML files that renders fine when viewing on a browser. Here is the html file. HTML file extension was not allowed for upload so I changed it to txt, hope it works.

AutomatedChecks_20210811_AmazoncomSpendlessSm.change2htmlExtension.txt (214.6 KB)

With HTML Parser, there is only this much content captured in the Document col.

<?xml version="1.0" encoding="UTF-8"?> D:\Users\myNameHere\Downloads\AutomatedChecks_20210811_AmazoncomSpendlessSm.html

When I open the html file using the text editor, I can see more content in the body tags. Is there a trick or setting or another node I should be using with this HTML file? I have tried different nodes and parser and could not get it to see the content.

Thank you so much in advance for your help and guidance. I appreciate your time and support.

elsamuel · August 11, 2021, 7:49pm

How exactly are you accessing/handling the HTML file prior to involving the HTML Parser node?

Using the file you shared, I get this result with the HTML Parser:

Can you share your workflow?

alabamian2 · August 11, 2021, 8:08pm

Hi @elsamuel, thank you for taking a look at it. I just list the file, the Path to String and into HTML Parser. I tried different ways but not seeing the content. hhhhmmmm…

ADA.knwf (16.2 KB)

I appreciate the help and time, @elsamuel. Thank you.

elsamuel · August 11, 2021, 8:23pm

Well, as far as I can tell, the problem is that you haven’t read in the HTML content at any point.

Are all of the HTML files that you want to process stored locally?

The HTML Parser node requires one of the following:

HTTP Result cells which you obtained with the “HTTP Retriever” node
Binary data cells
String cells which contain a local file: URL
String cells which contain the raw markup

You got close with using the Path to String > String to URI > HTML Parser approach, but you need to configure the HTML Parser node to use the URI column, not the Location column.

The Clean HTML Retriever node requires URL cells containing http or https URLs, and optionally, String cells containing HTML content. So it’s not a surprise that this didn’t work.

alabamian2 · August 11, 2021, 8:29pm

Ahhhhh, I see. I have separate process that downloads the HTML files to local folder. All the .html files are stored locally in a folder (or can be in some share drive). So I have to do one of the above after List Files/Folder? Let me try.

elsamuel · August 11, 2021, 8:30pm

This part of the workflow should work if you configure the HTML Parser node to use the URI column, not the Location column

alabamian2 · August 11, 2021, 8:33pm

AAAAAHHHHH, did that and worked!!! Superb!! I should read up on file handling documentation. Thank you and I now need to parse this.

system · August 18, 2021, 8:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.