Workaround for extracting document titles from a column with full “references” data



I would appreciate help to find a workaround for extracting document titles from a column with full “references” data. By full “references” I mean the following structure:


author(s), title, journal


The different references are in the same row DELIMITED by ;


author a, TITLE AAAA: AAAA, journal XXXX; author b, author c, TITLE BBBB: BBBBB, journal YYYYY.


I’ve attached a workflow in which I am able to split, group and count the references, but not to extract just the title. The desirable output would be:

               Title                               Count

Row1=> TITLE AAAA: AAAA           2

Row2=> TITLE BBBB: BBBBB         1


My point with extract the title is that in the real data, even same author can appear with different names (e.g. Silva C.E., Silva CE) or the journal name can appear complete name or abbreviated. So it would be impossible to count the references in a larger dataset. Extracting just the title would allow the counting.



I thought that a workaround could be to `find a sequence of strings between commas being bigger than 5 words or 25 letters  (including spaces and colons since this pontuation exists in the titles) `. Since the authors` names tends to be smaller than 10 they would be ignored. Other point is that the extraction should stop after the FIRST sequence of strings be found because other big sequences (e.g. journal names) could appear after the title.  


Maybe there is a better logic than that.


Many thanks in advance!



Hi Cadu,


with the JavaSnippet node and some substring operations it is possible to extract the title beginning with "title" and ending with "," as well as the authors. As you described can the title be grouped and counted.


Attached you find an example workflow with a configured JavaSnippet node, extracting titles and authors.


Cheers, Kilian