The Issue between String to Document and NGram Creator (v.4.2.3)

I initially used Title column as Filepath and Full text as Content columns. Application of NGramm Creator with N=6, NGram type Word and Output table Ngram bag of words
lost couple of documents.
When I changed Title column to Content. All documents appeared on the output but Document frequency of NGram Creator shows 2 for all documents but not for those one
disappeared in original run. Their Document frequency is 1.
I’d like to provide the original data but they are confidential.

I found this explanation from @ScottF

Anyway, the fact that title is included in the corpus needs to be addressed. I use the title as a key. Also, it looks like title considered as separate document (???).

Hi @izaychik63

Can you please check which documents you lost and what is special about the file path of these documents? Is there any error/warning output on the KNIME console or log file?

May be this post helps clarify your issue regarding the title column.


No errors, no messages, no specifics on path. Also, I experimented with Title as RowId and empty one. The result is the same - lost of 4 documents in NGram. I tried Path as Title and as Content, no loss of those 4 documents. The only assumption I can do is interpretation of part of the path as special character or command (\A, \M ,\S). As a result path became shorter than 6 words. It is also confusing, as I mentioned in previous post, that Path is considered as separate document on the counts. It needs to be an option to include it as part of the document or separate document or just consider as an identification key.

It could also be, some of your documents have just 3-6 word count

If document is used as Title and as Full Text documents are where, also 6 column context is extracted. It is not clear how come that 6 words are extracted from Title but not from the Full Text portion? I’d assume this as a bug (different processing of the Title and Full Text).

The Full Text field has context such
Diagnostic Accuracy 86.0%
NPI 1111111111
for all lines. Is it possible that special characters processing (\n) and % symbol in the Title and Full text are different?