Qustion on the naming of documents

tomskaczmarek · September 12, 2014, 10:44pm

I am using the Flat File Documetn Parser to read 120 documents. I did not want the first sentence of the document to be the name so I added a name. I placed the names I wanted at the start of each document followed by a "." and a new line. This was an attempt to have my name recognized as the first sentence. This worked in all but 4 cases out of 120. I was able to do a work around by forcing the first word of the next line to be a capital and this got me through the problem.

However, I would love to understand why this was happening or if there is a better way to accomplish what I was looking to do. Of course, I would still like to understand how the anme is generated.

For reference, all of the 4 cases began with something of the form "nn Reply Brief.html." where nn is a one or 2 digit number.

kilian.thiel · September 17, 2014, 3:43pm

Hi tomskaczmarek,

the first sentence is used as document title. The sentence tokenizer recognizes a string followed by a "." as sentence if the next letter after the whitespace is a capital letter. For sentence tokenization the openNLP tokenizer is used.

Cheers, Kilian

tomskaczmarek · September 18, 2014, 12:23am

Thanks. This makes some sense, but there must be more to it. There must be something happening in the tokenizer. I preprocessed the text in Python to add a mixed-csae title that contained exactly two periods. The Python code appended the tile with a new line ("\n") before adding it to the files. For example, "8 Reply.Brief.html." was the first line in one of the files. In this case the name of the document should have been "8 Reply.Brief.html." I used the names of the document later in the processing. This worked in almost all cases.

I had converted all the text to lower case (except in the names as shown the example above).

I went back and looked at the 4 cases (out of 106) that failed. They all had a second line that began with a prepositional phrase (all of them in fact started with "in ..."). I hadn't noticed that before.

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.