Data cleaning | How to remove copyright statements in abstracts? | String Manipulator + Regex

Hi, I want to use abstracts for some machine learning.
But the abstracts have copyright statements at the beginning or at the end of each abstract.
I tried to get rid of these using regex and leave the remaining text as it was, but somehow I don’t get the String Manipulator get output as I would expect.
What am I doing wrong?

Below all the steps I took.

I took some sample text (see below for reproduction), and created the following regex in regex101.com to detect the parts I want to have removed.
(^\©.*\.[A-Z])|(\©\s.*\.$)

Then I went to KNIME and used the String manipulator, along with the instructions I read earlier on this forum post, and combined the pieces of information in this statement:
regexReplace($abstract$, "(^\©.*\.[A-Z])|(\©\s.*\.$)" , "$1" )

Resulting output in “abstracts_cleaned”, with the copyright statements still there…
afbeelding

Using an empty string to return chops of entirely unexpected pieces (as shown in regex101) of the texts.
regexReplace($abstract$, "(^\©.*\.[A-Z])|(\©\s.*\.$)" , "" )

Resulting output in “abstracts_cleaned”, with unexpected chunks of texts removed, and copyright statements at the end still attached.
afbeelding

Here are the abstracts you van use to reproduce / solve the issue.

This paper investigates --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- and reduce poverty. © 2011 Kiel Institute.
We assessed intergenerational differences in --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- adopt a multigenerational approach. © The Author(s) 2012.
© Center for Southeast Asian Studies, Kyoto University.Themes of inclusion, empowerment, and --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. ---grounded in local realities.
© 2015 Elsevier Ltd.Land reform may be an effective  --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- effects by gender and inheritance systems.
© Springer International Publishing Switzerland 2017.This chapter provides the conceptual --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. ---the public sector have failed.
© 2019 The Author(s) 2019.Evidence shows that women --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- reported for individuals as well as households.
© 2015 Taylor & Francis.There is much debate within --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- precancerous cervical lesions. © 2014 Elsevier Ltd.
© Lahore School of Economics 2015.Pakistan’s economic performance --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- best way is to target poverty in Pakistan.
© The Editor(s) (if applicable) and The Author(s) 2016.Iran has experienced various social, economic, and political --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- have influenced the vulnerability rate?
© Copyright 2019 by the American Psychosomatic Society.Objective: We examined associations among  --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- socioeconomic adversity and high social resources in the cohort.

Thank you so much in advance for your insights!

Best wishes.
Maurice

2 Likes

Hi @Maurice,

I think the main problem you have is that when entering regex in a String Manipulation, you need to use a pair of \ characters whereever you would normally have a single \ in the regex expression, so you would have improved results with:

regexReplace($abstract$, "(^\\©.*\\.[A-Z])|(\\©\\s.*\\.$)" , "" )

However, I didn’t get that to work for me quite right, so I tried a slightly simpler expression instead:

regexReplace($abstract$,"©[^.]*[\\.]{0,1}" ,"")

This removes anything beginning © followed by any number of characters that aren’t a period/full-stop, followed by either 0 or 1 period .

I’ve uploaded a workflow to demonstrate. It contains both your, and my regex so you can have a play. If mine isn’t quite doing what you need, then please post back and either I or somebody else can assist further.

(btw for anybody struggling to enter © on windows, one key sequence is alt-0169 using he numeric pad)

Remove copyright text.knwf (9.3 KB)

6 Likes

Hi @takbb,

Superb! Those pair of \ characters now make a ton of sense! It does work!

The only thing remaining is to make a regex that cleans out all the dirty data.

Your simplified regex is a seriously good starter that might just clean out 90% of the copyright statements.

Looking at the additional abstracts I found some more troublesome lines.

Highlighed:

  • Blue: in the statement at the beginning of the line contains a second dot. So this statement at the beginning of the line actually ends by a dot followed by a not-whitespace. ( I am not a regex wizard, so I just looked for the pattern \\.[A-Z], but that one removes also the Capital letter at the beginning of the sentence.)
  • Yellow: the copyright statement does not end with a dot, but just Ltd, so it becomes too greedy and removes the fist sentence. (So we might also look for ‘Ltd’ as an alternative ending?)
  • Pink: a copyright statements at the end contains more information after the first dot, followed by several other sentences until the en of the string.

So I guess I think we need to add some “alternative endings” to that regex you made. apart from “… followed by either 0 or 1 period”, we could add “… followed by ‘Ltd’, or … followed by end of string / line”, or “… followed by dot followed by non-whitespace” ?
And perhaps that needs to be in the right order…

something like this
(\©.*\.\b(?=\w))|(\©.*Ltd)|(\©.*Ltd\.)|(\©.*\.)|(\©\s\d{4})

But this one is too greedy when the abstracts contain numbers with a dot in between… ( and research papers have them a lot…) It seems the reg ex pattern looks for the last dot, in stead of limiting it to the first one. perhaps I need to look in to that [\.]{0,1} pattern you use…

Do you have some suggestions?

Here some additional abstract lines with the pattens as in the example. - I also updated & uploaded your .knwf file.
Remove copyright text.knwf (12.3 KB)

This paper investigates --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- and reduce poverty. © 2011 Kiel Institute.
We assessed intergenerational differences in --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- adopt a multigenerational approach. © The Author(s) 2012.
© Center for Southeast Asian Studies, Kyoto University.Themes of inclusion, empowerment, and --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. ---grounded in local realities.
© 2015 Elsevier Ltd.Land reform may be an effective  --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- effects by gender and inheritance systems.
© Springer International Publishing Switzerland 2017.This chapter provides the conceptual --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. ---the public sector have failed.
© 2019 The Author(s) 2019.Evidence shows that women --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- reported for individuals as well as households.
© 2015 Taylor & Francis.There is much debate within --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- precancerous cervical lesions. © 2014 Elsevier Ltd.
© Lahore School of Economics 2015.Pakistan’s economic performance --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- best way is to target poverty in Pakistan.
© The Editor(s) (if applicable) and The Author(s) 2016.Iran has experienced various social, economic, and political --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- have influenced the vulnerability rate?
© Copyright 2019 by the American Psychosomatic Society.Objective: We examined associations among  --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- socioeconomic adversity and high social resources in the cohort.
© 2019 by the authors.The strategic goal of city management--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- to create a new approach to management of city development consistent with the known facts.
© 2018 Elsevier B.V.Due to rapid economic growth and population--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- solving China's RWR problems.
© 2015 Tine S. Prøitz.This article focuses on Nordic education--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- analytical concepts employed within European integration studies.
© 2019, Emerald Publishing Limited.Purpose: To gain the highest performance--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- therefore, less attention has been paid to them in the literature.
© 2016 Elsevier LtdThe purpose of this paper is to assess the status and progress of rural household energy sustainable development in China. A new composite indicator, rural energy sustainable development--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- decrease stage from 1997 to 1998, and a rapid increase stage between 1999 and 2012.
© 2016 Informa UK Limited, trading as Taylor & Francis Group.ABSTRACT: One of the European Union’s 2014–2020 cohesion--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- We discuss these implications in detail.
© Schuetze 2015.A single narrative about the Gorongosa Restoration Project (GRP--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- resident populations in conservation and development schemes.
© 2019 Published under licence by IOP Publishing Ltd.The Zwicky Transient Facility (ZTF)--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- The main goal of our program is to study the variability of Be stars in the range of ∼13.5 to ∼20.5 magnitudes.
© 2018Background: Coordinated approaches are needed to optimally control the spread of resistant organisms across facilities that share patients. Our goal was to understand social tensions--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- Israel fell by 1.9% and 1.8%, respectively. Total EU flower imports fell by 1.4%. © The Author(s) 2010. Published by Oxford University Press, on behalf of Agricultural and Applied Economics Association. All rights reserved.
Globally as well as in the Asia-Pacific Region,--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- These issues can only be addressed if the proposals incorporate lessons from on-the-ground experiences at a local, regional and national level. This edition first published 2013 © 2013 John Wiley & Sons, Ltd.
© The Authors, published by EDP Sciences, 2018.Today, the dairy sub-complex of the--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- execution of the pairing algorithm. © 2012 IEEE.
Lu et al. found that health aid displaces--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- assistance for health channelled to governments remains significantly fungible. © 2013 The Author(s). Published by Taylor & Francis.
© 2015 Institute of Economic Affairs.There is a continuing debate on --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- enhance the climate for inward investment.
© 2018 Intellect Ltd Article.Commercial communications constitute--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- is a key revenue source for ad-free services like Netflix.
© 2019 IEEE.The goal of this special session is to --- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- Participants will leave with resources to create their own effective faculty learning committee.
© 2017 The Korean Association for Public Administration.This paper addresses the collaboration and partnership--- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed erat sapien, aliquet id porta ut, porttitor scelerisque risus. --- can facilitate collaborative governance.
3 Likes

Hi @maurice, this suddenly got a lot more involved as I realised that there is a fundamental problem with the approach taken so far: RegexReplace does not take into account “Capture Groups”, so it will replace everything identified by the regex string, and not just the portions contained in the ( ) portion of the regex expression. This is why you start to lose parts of the text that follow the copyright message.

I ended up going back to the drawing board and ended up with the attached flow

A new table of individual regex patterns, in the order in which they are to be searched is used in a loop with a Regex Split to generate the piece of text found by each pattern. Having these as individual patterns serves to make it easier to maintain and add new patterns whilst also allowing us (to an extent) to find multiple © patterns within the same piece of text.

This loop causes multiple rows (one per regex pattern) to be created for each “abstract” and so these are then brought back together by a GroupBy. After that, each of the found strings are processed by a Column Expression node in the order of the patterns that were matched, and the substrings removed from the original string. As the substrings are processed in order, if a subsequent substring contains an earlier string, it won’t be processed because by the time it gets to that one, the first substring has already been removed. This is intentional and means that the pattern to find can become progressively “looser”.

image

For example if we have the text

So for example a pattern ending “Ltd” will be found after a pattern ending “Ltd.” and so, the “Ltd.” pattern will be removed (including the “.”) . Even though “Ltd” is also found, it will not then do anything because “Ltd.” already caused the substring to be removed. Likewise, some patterns will identify everything from “©” to end of line, which isn’t what you’d necessarily want (unless the © message were the only thing on the line!) but in those cases, it is expected that an earlier regex pattern will find a shorter string to remove first, thus neutralising the effect of the subsequent regex.

Give it a try and if you have questions over how it works, feel free to ask. Meanwhile somebody else may have suggestions for improvements, or alternatives. One issue here that I can see is that this won’t find multiple copyright messages that match the SAME pattern within one piece of text, as Regex Split returns only the first match found. Possibly the Regex Extractor node from Palladian could be used to improve on this, but that might be a job for another day should this prove to be an issue for you.

The workflow also contains a “simplified” version that doesn’t use the loop…


… but instead concatenates the regex patterns back into a | delimited pattern, to use with Regex Split in a single hit. This works in much the same way as the loop version, and removes some of the complexity, but doesn’t catch multiple © substrings within the same piece of text. That said, the more complex “loop” version doesn’t catch multiple © substrings that match the same pattern either, as mentioned earlier.

Remove copyright text - loop.knwf (54.3 KB)

7 Likes

Hi @takbb ,

Waaw! That is some real KNIME kong-foo there.
You just have created a data cleaning playground.

You are right, this problem needs to be solved from a different angle. And having those RegEx beeing placed in the right order from fine to broad is a great way to clean out the data.

I also tried to work with the Paladian RegEx extractor node, and it works in a similar way how you have constructed the nodes with an output for each regex pattern in a separate column. Only they also don’t remove multiple copyright matches in a single line. So I used your method, and created a recursive loop to clean out the remaining copyright statements with the same pattern.

You can download my solution here:
Remove copyright text - loop.knwf (89.7 KB)

I think we largely solved this issue. and I hope other will be able to enjoy it as much as I do.

Thank you @takbb so much for your time, expertise and help! It has been an honour!

Warm regards,
Maurice

4 Likes

Hi @maurice , thank you for your kind words, and thank you also for sharing your workflow. It’s very satisfying when a workflow comes together and even better when it improves through collaboration… :slight_smile:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.