Text mine the first N words of the text, discard the rest.


I am interested to know if it is possible to keep and process only the first N words of a set of PDFs that I have parsed. I want to do this because I have a text containing many different dates, and I want to identify the date in which the text was written. Usually this appears somewhere at the beginning of the document after the headings and titles, hence only within the first 100 or so words. Would this be possible? 




the PDF Parser node does not provide this option. However, you could start with the PDF Parser to parse the complete file. Then extract the Text (Document Data Extractor) and use the String Manipulation node to get the substring for the first X characters.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.