The Issue between String to Document and NGram Creator (v.4.2.3)

I initially used Title column as Filepath and Full text as Content columns. Application of NGramm Creator with N=6, NGram type Word and Output table Ngram bag of words
lost couple of documents.
When I changed Title column to Content. All documents appeared on the output but Document frequency of NGram Creator shows 2 for all documents but not for those one
disappeared in original run. Their Document frequency is 1.
I’d like to provide the original data but they are confidential.

I found this explanation from @ScottF

Anyway, the fact that title is included in the corpus needs to be addressed. I use the title as a key. Also, it looks like title considered as separate document (???).

Hi @izaychik63

Can you please check which documents you lost and what is special about the file path of these documents? Is there any error/warning output on the KNIME console or log file?

May be this post helps clarify your issue regarding the title column.

Best,
Temesgen

No errors, no messages, no specifics on path. Also, I experimented with Title as RowId and empty one. The result is the same - lost of 4 documents in NGram. I tried Path as Title and as Content, no loss of those 4 documents. The only assumption I can do is interpretation of part of the path as special character or command (\A, \M ,\S). As a result path became shorter than 6 words. It is also confusing, as I mentioned in previous post, that Path is considered as separate document on the counts. It needs to be an option to include it as part of the document or separate document or just consider as an identification key.

It could also be, some of your documents have just 3-6 word count

If document is used as Title and as Full Text documents are where, also 6 column context is extracted. It is not clear how come that 6 words are extracted from Title but not from the Full Text portion? I’d assume this as a bug (different processing of the Title and Full Text).

The Full Text field has context such
Diagnostic Accuracy 86.0%
NPI 1111111111
for all lines. Is it possible that special characters processing (\n) and % symbol in the Title and Full text are different?

I made an example here. Please take a look why as path string has 6 parts but as the content does not.Salud Example.knwf (12.9 KB)

@temesgen-dadi , did you have a chance to look at my example?

Thank you
Igor

@temesgen-dadi I added answers from Palladian group about N-Gram Extractor.

It may shed some lite on my question. Why I’m getting unexpected result Ngram Creator in the first example.
Salud Example.knwf (21.1 KB)

Hey @izaychik63,

sorry for the late response.

I had a look at your example and got the same weird behavior.
I will do some debugging and check what the problem could be.

Best regards,

Julian

Hi @izaychik63,

no sentence tokenization is done for the title. The document body text however is being tokenized. The underlying sentence tokenizer splits the string
Diagnostic Accuracy 86.0%
NPI 1111111111
into two sentences. The N-Gram creator creates N-grams based on sentences instead of the whole string so that for this particular text no n-grams of size 6 can be found. If this text is set as the title, it will return a 6-gram as it is treated as one sentence.

Best,
Julian

1 Like

Thank you, Julian. Is there a way to avoid splitting into 2 sentences? Should I delete some characters?

It depends on what you want to keep. Currently the split happens between the % and NPI. Case conversion or removing the % could do the trick. You would need to do this before using the Strings To Document node though.

1 Like

Thank you, Julian. To put a dot on I, I added another line to the second example. It looks the same but processed differently. I’ll really appreciate if you solve a puzzle, why.
Salud Example.knwf (21.2 KB)

Hi @izaychik63,

it looks fine to me, however the sentence tokenizer might be a bit of a black box (it’s OpenNLPs Sentence Tokenizer). When using the Sentence Extractor node, we can see that the first document consists of three sentences: The title (which itself consists of the whole string), the first part of the string (Diagnostic Accuracy 70.0%) and a second part (NPI 1111111111). This leads to a count of 1 (frequency of the 6-gram across the document), since only in the title a 6-gram can be found.

For the second document, the tokenizer behaves differently and the string is recognized as a full sentence in the document body text instead of being split. That is why we get a 2 as document frequency for that particular 6-gram.

In general it looks like the problem is more related to the underlying tokenizer instead of the N-Gram Creator.
I will do a bit of debugging to confirm that.

Best,
Julian

1 Like

Thank you, Julian. It confirms my assumption that sentences processing is confused by input data and result is unstable. So, I failed in some cases to extract percent and NPI.

There is a ticket with respect to being able to choose another sentence tokenizers. I’ll give it a +1.

1 Like

For a case it shade some lite, one of the string contains u00A0 character that confused Palladian n-Gram Extractor. To make it work I have to delete it from the string.