Creating the reference category column with the words in the texts in the data list

cpv_code list.xlsx (261.6 KB)

The reference code list to be used in category creation is attached in excel.

I haven’t created a workflow at the moment, because I’m undecided about what approach to come up with a solution.

I see the issue now! Is there another database table that has parent and child relationships on those categories? You could use parent categories to split apart and then only process the categories that fall under that parent list. If there isn’t a parent child relationship in those categories, then my next question is what kind of sadist designs a category list like that!

1 Like

:grinning:
https://simap.ted.europa.eu/cpv

What is the CPV?

The CPV consists of a main vocabulary for defining the subject of a contract, and a supplementary vocabulary for adding further qualitative information. The main vocabulary is based on a tree structure comprising codes of up to 9 digits (an 8 digit code plus a check digit) associated with a wording that describes the type of supplies, works or services forming the subject of the contract.

The CPV consists of a main vocabulary and a supplementary vocabulary.

The main vocabulary is based on a tree structure comprising codes of up to nine digits associated with a wording that describes the supplies, works or services forming the subject of the contract.

  • The first two digits identify the divisions (XX000000-Y);
  • The first three digits identify the groups (XXX00000-Y);
  • The first four digits identify the classes (XXXX0000-Y);
  • The first five digits identify the categories (XXXXX000-Y);

Each of the last three digits gives a greater degree of precision within each category.

A ninth digit serves to verify the previous digits.

The supplementary vocabulary may be used to expand the description of the subject of a contract. The items are made up of an alphanumeric code with a corresponding wording allowing further details to be added regarding the specific nature or destination of the goods to be purchased.

The alphanumeric code is made up of:

  • a first level comprising a letter corresponding to a section;
  • a second level comprising four digits, the first three of which denote a subdivision and the last one being for verification purposes

In that case, I would definitely use a Regex to split those codes into multiple columns based on that logic in order to have a manageable parent child relationship table.

Then I would split streams and process only those group associated tests in the dictionary rule node. This actually might be one of the situations where a loop might actually provide an efficient dynamic approach.

1 Like

Actually, the closest workflow among several solution alternatives I have is attached. But I couldn’t achieve a complete fix or how can I come up with a better solution.

Converting the attached workflow might work, but it doesn’t work for partial matches. In other words, the word is looking for an exact match. For the reference category I need, I think that the match with a perfectly matched word in the cvp code will be enough to classify it.

If I do more of the matching run, it will trigger another problem where one-word cvp codes don’t follow this rule.

but first of all, matching will be done, that is, I have to make the codes to be used in classification work.

*** why am i doing this? Or why do I need this?
The european union uses the cvp code classification as a standard for purchasing, some countries do not, America uses NAICS codes, which is different. now i have to come up with a solution such that cvp code or naics code should be recreated in two columns according to the reference code list within the existing data so that cvp code and naics code can be formed in all data with the closest match. Because in some countries there is no such coding system or data from some sources at all.

I hope I was able to explain.

Thanks in advance for all the support and help.
KNIME_project_category.knwf (258.5 KB)

Uploading: KNIME_category_alternative 2.knwf…
Alternate adaptive second workflow, I’m using this in a different workflow. The attached workflow provides an arrangement related to this issue, but the large cvp code list will force me. Here, it is necessary to produce a better and more optimal solution.

Actually, it’s not at all an unsolvable point, how can I solve it better… I’m looking for it

@umutcankurt

You can use text processing toolkit to identify keywords in your text and then assign the code associated with the keywords to the document.

The following workflow (available on KNIME hub here) gives an idea of how the problem could be approached.

The top yellow box takes the text and converts it to a KNIME text processing document. This is a specific data type that allows the text to be tagged with additional information. The document is pre-processed to convert everything to lower case, remove diacritics. I would suggest in a production environment you would want to further process the document to remove plurals and standardise keywords, possible using lemmatization. Whatever you choose to do in pre-processing you will also need to consider doing to the keywords in your dictionary so that they match.

The bottom yellow box splits the keyword lists (e.g. cat, dog, mouse) into individual items and then creates a list of keywords by ungrouping the set created by the cell splitter. The keywords are then converted to lower case prior to matching.

The core of the workflow is in the green box. The text processing toolbox takes the dictionary of keywords and uses it to tag items in the documents where they occur. I have used the NE(LOCATION) tag, but any will do, as long as it can be recognised by the tag filter and doesn’t conflict with any tags already added to the document. The tag filter then removes anything from the document that hasn’t been tagged; the bag of words creator then extracts tagged words from the documents and creates a row with the originating document and word, for each word in the document. The terms are then converted to strings for further processing.

The final task is to match the extracted strings to the keywords in the dictionary to append the code, and then group the document and codes, such that their is a table with one document per row and the relevant codes are listed in a separate cell.

Note: There are challenges remaining with this workflow. I set the dictionary tagger to ignore case and not use exact matching. This means there are keywords identified in the text that do not have a keyword in the dictionary (usually the result of identifying a plural in the text which has no corresponding entry in the dictionary). There are possibly other refinements that could be made to improve the matching process; but, the approach will get you most of the way to what you are trying to achieve.

DiaAzul
LinkedIn | Medium | GitHub

8 Likes

Hi; @DiaAzul
Thank you very much for this detailed and informative sharing. I will review the workflow and try to organize it.

It’s like a complete user manual. you are great :wink: :boom:

1 Like

I wrote this amazing previous post just after reading what you wrote. Now I’ve looked at the workflow and it’s awesome just the solution I was looking for

Thanks again, you saved me a lot of confusion and questions.
:+1: :knime: :medal_military: :medal_military: :medal_military:

3 Likes

@umutcankurt

Thanks for the feedback. I learn as mutch as you do by solving problems - I usually end up coming to the forum when I have run out of crosswords to do. The other reason I come here is because I can’t get a job or work doing data analytics, so work at the local supermarket stacking shelves. It’s crazy, but at least the forum benefits.

I’ve had another look at the workflow and made a few changes. The workflow is here.

  • I’ve added a node to remove punctuation, the punctuation was causing cars and tractors to merge into one word.
  • I added the dictionary replacer node to replace the keywords with the codes. In typical KNIME fashion the Dictionary Tagger allows an inexact match, but the Dictionary replacer only has an exact match option. If the replacer allowed a closest match option in the same way the tagger does then we would be done. It’s one of those things with KNIME, lovely javascript front-end but lack of attention in the detail.

DiaAzul
LinkedIn | Medium | GitHub

5 Likes

I think the people who see your contributions, developments and solutions here, company employees will definitely present you a job offer. If you create useful and truly productive work, someone will definitely notice you.

Please share your linkedin profile link here, someone will definitely help you get to the right place you need to be.

Thanks again for your contribution and solution support.

3 Likes

I’m sorry, you already have a linkedin link in your post. I just noticed and sent you an invite.
:slightly_smiling_face:

2 Likes

Hi; @DiaAzul
I’ve been working with test data and I’m sending the workflow by email. I couldn’t upload it here because the size is too big. (I worked on the first workflow you posted) Why do you think it gives blank code for some of the matches? I will be glad if you can look at it when you have time

@umutcankurt

I got your workflow with data, thanks, and it works as intended. I’ve updated the workflow and reposted it with the original data/keywords on KNIME hub here. The changes made to the workflow are:

  • Improvements in the pre-processing of the text documents and keywords. I’ve added filters for punctuation, short words and a few other elements that mess up matching. I’ve also added the Stanford Lemmetizer to both text and keywords which removes plurals (among a few other minor modifications to help matching). This improves the matching accuracy where keywords exist.
  • I’ve also added a small routine to extract the documents where a keyword was not found and then list the extracted keyword ranked by the number of documents they appear in. This may help identify keywords that you could use to classify documents (though high frequency keywords are probably not useful for discriminating between documents).
  • I’ve also added two components (it’s the same component internally) to analyse the contents of the documents and identify common themes using Latent Discriminant Analysis.

The reason you are getting blank codes for some of the matches is because the documents do not contain any keywords - you may get an improvement with the revised workflow, but it looks like 10-15% of documents cannot be matched. To try and get some insight into how the documents could be classified I split the documents into those which could be coded and those which could not.

I then extracted the words from the documents; calculated how many times those words appeared in documents; removed from the list words that only appeared once or twice; and then, for each document, filter-in the eight words that appeared least frequently across all the documents. I then used these eight words to create a summary document which I then performed topic analysis (latent discriminant analysis) to identify document themes and the most frequent keywords that characterised those themes. This I then plotted on a scatter plot to see if there was any information that could be gleaned which might provide insight into how you could code the uncoded documents.

The output of the analysis is below (shared because others don’t have access to the data, but provides some insight into the results).

The analysis of documents that could be coded (apologies that the legend partially obscures the chart) shows the keywords that describe the theme. I don’t have much insight into the source of the data, but the topic analysis is mostly picking up locations and some keywords that don’t appear in the keyword dictionary. This would indicate that further analysis would need consider removing addresses/regions is appropriate. However, it may also be useful information if it relates to a particular supplier/customer and could be used to assign a code based upon non-keyword information. What is striking is that the data is well clustered indicating that the identified topics are doing a good job of discriminating the data.

Now, for the documents that could not be clustered. This shows much less structure and the topics that make up the theme are different from the clustered data. A possible reason for the coded data forming tighter clusters is if the keywords that are coded also influence the tighter grouping (i.e. the same keywords from your list keep appearing in the coded documents); whereas, for uncoded data the pattern of keywords is not as strong.

At this point you have two potential choices:

  • The manual approach is to identify additional keywords that can be coded and added to your list. This is a pragmatic, though tedious process and will only improve results if the keywords are consistently used by the originators of the documents.
  • The second, indirect approach, is to consider the heuristics of the documents. In this case, rather than looking for specific keywords, consider additional information such as the addresses, telephone numbers, names, locations, trades described in the document and build a probabilistic classifier using machine learning/AI techniques. You could train this using your coded data to then classify the uncoded data - though in reality you would want a much larger data set to improve the quality of the classification.

Hope that helps (sorry it’s long winded, but there is so much that can be done arising from your initial question).

DiaAzul
LinkedIn | Medium | GitHub

1 Like

Hi David;
Thank you again for your time and interest in this matter. It’s really great, and it deserves a special appreciation for adding scopes to your workflow that will include more comprehensive and different details.

I am sure that this study, with its scope and details, will help people who are looking for solutions on many issues to produce alternatives or find results with the current workflow.

It is now complete and even more complete.

:medal_military: :tada: :tada: :trophy:

2 Likes

Very impressive as always @DiaAzul!

I will definitely be checking out this workflow in detail. You have inspired me to delve into the text processing nodes! I had not considered their application in categorization based on key words. Would you recommend this approach when more of a fuzzy match is required on multiple words / word combos? Most of my challenges along these lines are targeting manually entered string fields that are loaded with spelling / typing / voice to txt conversion errors.

2 Likes

@iCFO

Thanks.

The Text Processing nodes are geared more towards analysing publication abstracts and identifying topics. Given that several of the nodes target chemical compounds and bioinformatic text, the nodes are more designed for scientists in the pharmaceutical industry to identify papers of interest to their area of research; though, in principle they can be used across any subject. I used them during Covid to search for modelling related papers when I was forecasting the spread of infectious diseases.

For the example that you have given you could use the String Matching node, but you would have to manually create examples for each word you wanted to correct. This is similar to the way that Microsoft Office does some of its auto-corrections. It’s tedious and limited in scope.

The more up to date way to do what you want is to train a machine learning model to analyse the sentence and provide a predictive text like capability. You could use the Text Processing nodes to break your text into sentences or word triplets, the apply a model to suggest corrections. To train the model you would take a large corpus of text, which you know to be correct then create a training set from this by corrupting the data, retaining the original text as the training target for the model.

This then provides a few ideas for KNIME-It! challenges ( @alinebessa ).

  1. The original requirement of this post - To take a database of articles, analyse them and append codes matching the content of the articles. In this case the articles are abstracts from supplier notices of upcoming business opportunities, but it could easily be job adverts, or other commonly posted notices. For the job adverts codes could be added to identify which industry segment the job relates to and what type of role is required. For suppliers notices it would be industry segment and possibly products required.

  2. The second challenge is the correcting text challenge. It is often the case that manually entered data needs cleaning up. It would be nice to have a set of tools / workflow for correcting common typing mistakes and other errors.

I am sure there are other text processing problems that could be identified across the community that would help many people with their own projects.

DiaAzul
LinkedIn | Medium | GitHub

3 Likes

Thanks @DiaAzul,

Accounting / Management Software manual entry memos that provide item detail are typically a crazy jumble of shorthand & lazy entry. I am not sure if they can be penetrated easily. Example of how someone might describe this post:

D.Azal-GB ref txt list v catagory builder

Right now I try to visually look for repeated shorthand entry patterns, fuzzy match on key words to try and access the challenges, then design Regex patterns that will look for matches. Perhaps I could try to build some logic approach settings into model training as a longer term project.

2 Likes

@DiaAzul

I don’t want to sidetrack the thread too much, but I would also be happy to share a few sales tactics and communication strategies that I developed over the years which have proven to be a strong presentation for my company. They may help you target a few leads and sell your services as an outside contractor or consultant if you are interested. Let me know, and I will message you on Linked In.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.