Hi,
I would appreciate help to find a workaround for extracting document titles from a column with full “references” data. By full “references” I mean the following structure:
author(s), title, journal
The different references are in the same row DELIMITED by ;
author a, TITLE AAAA: AAAA, journal XXXX; author b, author c, TITLE BBBB: BBBBB, journal YYYYY.
I’ve attached a workflow in which I am able to split, group and count the references, but not to extract just the title. The desirable output would be:
Title Count
Row1=> TITLE AAAA: AAAA 2
Row2=> TITLE BBBB: BBBBB 1
My point with extract the title is that in the real data, even same author can appear with different names (e.g. Silva C.E., Silva CE) or the journal name can appear complete name or abbreviated. So it would be impossible to count the references in a larger dataset. Extracting just the title would allow the counting.
I thought that a workaround could be to `find a sequence of strings between commas being bigger than 5 words or 25 letters (including spaces and colons since this pontuation exists in the titles) `. Since the authors` names tends to be smaller than 10 they would be ignored. Other point is that the extraction should stop after the FIRST sequence of strings be found because other big sequences (e.g. journal names) could appear after the title.
Maybe there is a better logic than that.
Many thanks in advance!
Cadu