Creating the reference category column with the words in the texts in the data list

DiaAzul · January 15, 2023, 12:44pm

Thanks.

The Text Processing nodes are geared more towards analysing publication abstracts and identifying topics. Given that several of the nodes target chemical compounds and bioinformatic text, the nodes are more designed for scientists in the pharmaceutical industry to identify papers of interest to their area of research; though, in principle they can be used across any subject. I used them during Covid to search for modelling related papers when I was forecasting the spread of infectious diseases.

For the example that you have given you could use the String Matching node, but you would have to manually create examples for each word you wanted to correct. This is similar to the way that Microsoft Office does some of its auto-corrections. It’s tedious and limited in scope.

The more up to date way to do what you want is to train a machine learning model to analyse the sentence and provide a predictive text like capability. You could use the Text Processing nodes to break your text into sentences or word triplets, the apply a model to suggest corrections. To train the model you would take a large corpus of text, which you know to be correct then create a training set from this by corrupting the data, retaining the original text as the training target for the model.

This then provides a few ideas for KNIME-It! challenges ( @alinebessa ).

The original requirement of this post - To take a database of articles, analyse them and append codes matching the content of the articles. In this case the articles are abstracts from supplier notices of upcoming business opportunities, but it could easily be job adverts, or other commonly posted notices. For the job adverts codes could be added to identify which industry segment the job relates to and what type of role is required. For suppliers notices it would be industry segment and possibly products required.
The second challenge is the correcting text challenge. It is often the case that manually entered data needs cleaning up. It would be nice to have a set of tools / workflow for correcting common typing mistakes and other errors.

I am sure there are other text processing problems that could be identified across the community that would help many people with their own projects.

DiaAzul
LinkedIn | Medium | GitHub

iCFO · January 15, 2023, 1:18pm

Thanks @DiaAzul,

Accounting / Management Software manual entry memos that provide item detail are typically a crazy jumble of shorthand & lazy entry. I am not sure if they can be penetrated easily. Example of how someone might describe this post:

D.Azal-GB ref txt list v catagory builder

Right now I try to visually look for repeated shorthand entry patterns, fuzzy match on key words to try and access the challenges, then design Regex patterns that will look for matches. Perhaps I could try to build some logic approach settings into model training as a longer term project.

iCFO · January 15, 2023, 2:42pm

@DiaAzul

I don’t want to sidetrack the thread too much, but I would also be happy to share a few sales tactics and communication strategies that I developed over the years which have proven to be a strong presentation for my company. They may help you target a few leads and sell your services as an outside contractor or consultant if you are interested. Let me know, and I will message you on Linked In.

system · January 22, 2023, 2:42pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.