String to document Error

Hi,

I’m trying to use the Strings to document node on a CSV file and I get and error:
ERROR Strings To Document 0:24 Execute failed: String index out of range: 7

Bio.txt (326.6 KB)

The text file here contains a sample of the data.
I use these settings for the Strings to document node:
Title: contributions_id
Text : contributions_bodyText
Sources: contributions_url
categories: contributions_section_title
author: contributions_author_id

Anyone know why I’m getting this error?
Thx

The text file looks a bit challenging. Large lines with various special characters, text with commas, quotation marks and then , as column separator … what could possibly go wrong … but as usually Readr saves the day … (hopefully) :sunglasses:

kn_example_r_readr_string_document.knwf (396.3 KB)

5 Likes

Hey @LouisL,

this should definitely not happen, thanks for providing the file and reporting the issue.
I will have a look at the problem and create a ticket, if necessary.

In the meanwhile, I hope @mlauber71’s workaround works for you.

Cheers,
Julian

I tried using @mlauber71 file “Bio.table”, but I get the same error.

I tried to convert it into a Document following your instructions. It seems to work although I am not a specialist for Text Analysis. I compiled a few links to Text and sentiment analysis here.

Maybe you have a look. I did it on a Mac so it might be there is a text encoding thing going on. Everything should be UTF-8.

kn_example_r_readr_string_document.knwf (959.0 KB)

The problem seems to be with the tokenizer. I was trying to use Stanford NLP PTBT tokenizer and now I changed for OpenNLP SimpleTokenizer and it worked.

Thanks a lot for your help!

1 Like