Execute failed: (“StackOverflowError”): null” on the HTML parser

stevelp · May 12, 2020, 3:19pm

Thanks for that explanation @qqilihq! That makes sense about what kinds of situations I should be more careful in.

I’m getting a new error “Execute failed: (“StackOverflowError”): null” on the HTML parser. Do you have any idea how I can fix that? Or where I can go to find out what different error codes mean? Also, I’m not sure on thread etiquette. Perhaps I should start this in a new thread, as it’s fairly unrelated to my original question.

I’ve isolated the issue to 4 URLs in my current list that are causing the problem. They are all PDF documents, but none of them have “pdf” in the URL. They do all have “View” or “Preview” in the URL, so I could filter by that, but that feels like I could also exclude valid pages that way. Do you know any more elegant solution that could help me exclude these kinds of results in the future, before I try to use the HTML parser?

http://www.pilotpointlibrary.org/DocumentCenter/View/2281/2017-2018-CAFR
https://neptunebeachfl.civicclerk.com/web/UserControls/DocPreview.aspx?p=1&aoid=33
http://www.garlandtx.gov/DocumentCenter/View/5526/0724-Fiirefighter-Recruit?bidId=
http://bonnieandclydedays.org/AgendaCenter/ViewFile/Agenda/_04082019-764

ipazin · May 12, 2020, 4:06pm

Moved to new topic as for reason stated yourself
Br,
Ivan

qqilihq · May 12, 2020, 5:18pm

Hi stevelp,

I suggest the following combination for that:

Define a maximum download size limit in the HTTP Retriever. This will stop downloading once the given file size has been reached. Set it to e.g. 0.5 MB (or even less, depending on your dataset)
Use the Content-Type HTTP header to remove non-HTML files. You can extract this using the HTTP Result Data Extractor node.

I have prepared an example for you which you can find on my personal NodePit Space:

https://nodepit.com/workflow/com.nodepit.space%2Fqqilihq%2Fpublic%2Fforum%2Fexecute-failed-stackoverflowerror-null-on-the-html-parser-23567.knwf

Hope this helps!
Philipp

stevelp · May 12, 2020, 7:37pm

Awesome, thanks Phillip! I also adjusted the default socket timeout from 60 to 10 seconds. I tried it on a few different pages, and it seemed to work alright at that limit.

system · May 19, 2020, 7:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.