String to document Error

LouisL · March 12, 2019, 5:25pm

Hi,

I’m trying to use the Strings to document node on a CSV file and I get and error:
ERROR Strings To Document 0:24 Execute failed: String index out of range: 7

Bio.txt (326.6 KB)

The text file here contains a sample of the data.
I use these settings for the Strings to document node:
Title: contributions_id
Text : contributions_bodyText
Sources: contributions_url
categories: contributions_section_title
author: contributions_author_id

Anyone know why I’m getting this error?
Thx

mlauber71 · March 12, 2019, 9:05pm

The text file looks a bit challenging. Large lines with various special characters, text with commas, quotation marks and then , as column separator … what could possibly go wrong … but as usually Readr saves the day … (hopefully)

kn_example_r_readr_string_document.knwf (396.3 KB)

julian.bunzel · March 13, 2019, 3:24pm

Hey @LouisL,

this should definitely not happen, thanks for providing the file and reporting the issue.
I will have a look at the problem and create a ticket, if necessary.

In the meanwhile, I hope @mlauber71’s workaround works for you.

Cheers,
Julian

LouisL · March 13, 2019, 3:49pm

I tried using @mlauber71 file “Bio.table”, but I get the same error.

mlauber71 · March 13, 2019, 10:13pm

I tried to convert it into a Document following your instructions. It seems to work although I am not a specialist for Text Analysis. I compiled a few links to Text and sentiment analysis here.

Maybe you have a look. I did it on a Mac so it might be there is a text encoding thing going on. Everything should be UTF-8.

kn_example_r_readr_string_document.knwf (959.0 KB)

LouisL · March 14, 2019, 12:55pm

The problem seems to be with the tokenizer. I was trying to use Stanford NLP PTBT tokenizer and now I changed for OpenNLP SimpleTokenizer and it worked.

Thanks a lot for your help!

system · June 2, 2023, 9:44pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.