NLP questions

Hello, dear community!

Junior Knime user here but with big questions (I think).
I’m in analytics and my main reason is to be able to extract entities and relations from unstructured Romanian text and after that, to link them together.

For example, I have this text in English:
Maria is driving a red Mercedes and Josh is driving a black Saab.
My goal is to create nodes that:

  1. Are able to extract the entities: Maria, Josh, Mercedes and Saab.
  2. To extract the relations between the entities: is driving.
  3. To be able to create an excel file with this configuration: in Column1 to place the first entity (Maria), in Column2 to place the relation (is driving), in Column3 to place the brand of first car (Mercedes), in Column4=Josh, in Column5=is driving, in Column6=Saab.
    The reason why I want this format is because, after I’ll export the resulting excel file, I will use Analyst’s Notebook that will help me linking the entities (with a line, like in neo4js).

By using the SpaCy nodes like ModelSelector, Tokenizer, NER, Bag of words and some Regex nodes, I am able to extract the entities from Romanian text.

But I am far from the desired result.

So, I have some questions to this wonderful community:

Regarding spaCy model selector node, I saw that it’s possible to load another local model. Do you know some websites from where I can download another large scale model, preferably for Romanian language?

Regarding Stanford nlp relation extractor, I saw that accepts only English. Can you recommend another relation extractor that accept Romanian language also? Or maybe can I train another note to do it?

Regarding offline LLMs, do you have some recommendations about what’s the best model suited for nlp, ner and relations, that I can use inside knime?

Also, can you share with me another approach in solving the main question regarding entities and relations between them?

Thanks!

2 Likes

HI @vlad28, welcome to the KNIME community! :slight_smile:

Here you go:

  • Reading local models from the Spacy Model Selector node:
  1. Download on your local machine the model you prefer (“Assets” > model with extension “.tar.gz”). You can find model files on Spacy GitHub repository: Romanian · Releases · explosion/spacy-models · GitHub (I’ve already filtered by “Romanian” language). You should be able to read smoothly model versions up to 3.5.0 (but try also with higher versions).

  2. Make sure you have 7-Zip to installed on your PC to unzip files:
    a) Download and install 7-Zip if you don’t have it installed already.
    b) Right-click on the .tar.gz file you want to extract and select “7-Zip” > “Extract Here.”
    c) The contents of the archive will be extracted to the same directory as the archive. The now uncompressed .tar file should be visible in the directory listing.
    d) Locate the .tar file, right click and select “7-Zip” > “Extract Here.”
    e) The contents of the tar file will be extracted to the same directory as the archive.

  3. Point the Spacy Model Selector node to the model folder that looks like this:
    In my case for example: C:\Users\roberto.cadili\Downloads\New folder\dist\en_core_web_sm-3.5.0\en_core_web_sm\en_core_web_sm-3.5.0

  • Unfortunately, as you correctly pointed the StanfordNLP Relation Extractor or (the even more powerful) StanfordNLP Open Information Extractor only support English.
    In general, consider that extracting (meaningful) relations from text is quite a hard task and to train your own extractor can be quite challenging. Unfortunately, I’m not aware of an option for Romanian. My suggestion is to research literature on the this topic specifically dedicated to Romanian, as there might be some resources developed by researchers that can help. Or why not try LLMs?

  • Using local LLMs for relation extraction in Romanian:
    a) That’s quite simple: download your model of choice from GPT4All. To start off, I’d suggest Falcon - as it’s relatively light compared to others but may perform less well.
    b) Next, point the GPT4All LLM Connector node to the model file, in my case: C:\Users\roberto.cadili\AppData\Local\nomic.ai\GPT4All\gpt4all-falcon-q4_0.gguf
    c) Write your prompt, for example in a Table Creator node, and pass it on the LLM Prompter node. You will have to play a bit to find the right prompt and/or model that suits you best.
    image

I did a couple of quick tests for Italian and the Falcon model was not great (or maybe my prompts were not good enough - totally possible! :slight_smile: ). I tried the same prompt with ChatGPT and worked very well (the test sentence to extract relations from was: “John lives in Paris and works for Google”).

Hope it helps!

Happy KNIMEing,
Roberto

7 Likes

Thanks you very much, dear Roberto.
Very comprehensive answer.

Regarding Spacy model, I have downloaded the last one - 3.7.0, followed your advice on extracting and my path look like this “C:\knime_workspace\ro_corpus\ro_core_news_lg-3.7.0\ro_core_news_lg\ro_core_news_lg-3.7.0”

But when I press play, I get the error:

Execute failed: Executing the Python script failed: Traceback (most recent call last): File “”, line 2, in File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\spacy_init_.py”, line 54, in load return util.load_model( File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\spacy\util.py”, line 444, in load_model return load_model_from_path(Path(name), **kwargs) # type: ignore[arg-type] File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\spacy\util.py”, line 516, in load_model_from_path nlp = load_model_from_config( File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\spacy\util.py”, line 564, in load_model_from_config nlp = lang_cls.from_config( File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\spacy\language.py”, line 1749, in from_config resolved_nlp = registry.resolve( File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\confection_init_.py”, line 759, in resolve resolved, _ = cls.make( File "C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\confection_init.py", line 808, in _make filled, _, resolved = cls.fill( File "C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\confection_init.py", line 862, in fill promise_schema = cls.make_promise_schema(value, resolve=resolve) File "C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\confection_init.py", line 1054, in make_promise_schema func = cls.get(reg_name, func_name) File “C:\Program Files\KNIME\bundling\envs\se_redfield_textprocessing\lib\site-packages\spacy\util.py”, line 128, in get raise RegistryError(Errors.E892.format(name=registry_name, available=names)) catalogue.RegistryError: [E892] Unknown function registry: ‘vectors’. Available names: architectures, augmenters, batchers, callbacks, cli, datasets, displacy_colors, factories, initializers, languages, layers, lemmatizers, loggers, lookups, losses, misc, models, ops, optimizers, readers, schedules, scorers, tokenizers

Hi @vlad28,

You’re very welcome :slight_smile:.

I guess I might have anticipated your current issue in my previous message:

You should be able to read smoothly model versions up to 3.5.0 (but try also with higher versions).

You’re now trying to use a model in version 3.7. I’m under the impression that higher model versions (from 3.5 on) are currently not supported in the Spacy Model Selector node.

As the Spacy extension is a community extension, let’s see if developer can confirm this. @Artem is my guess, right?

Best,
Roberto

1 Like

Hi @vlad28,

While we wait for a feedback from the developer, can you try to let the Spacy Model Selector node read in the model from a location such as Download or Desktop?

Basically, like I had it in my example path:

C:\Users\roberto.cadili\Downloads\New folder\dist\en_core_web_sm-3.5.0\en_core_web_sm\en_core_web_sm-3.5.0

Currently, you are storing the model in the KNIME workspace and it could be that there are some permission conflicts that the node cannot handle.

Best,
Roberto

1 Like

Thanks, @roberto_cadili .
Unfortunately, it didn’t worked.
But, I have some good news. Although it didn’t work for v3.7.0, I managed to fix it for v 3.6.0:
Using Anaconda prompt (I have installed Anaconda), I created a new environment with the following commands:
conda install --name knime python=3.6
conda activate knime
pip install pandas
pip install spaCy
pip install typer

After that, I switched to Knime / Preferences / Redfield NLP Nodes
Checked the Conda environment and where it sais “Name of the Spacy Conda environment” I selected my newly created env, aka Knime.

In that way, the node Spacy Model Selector manages to read the Spacy downloaded files, version 3.6.0.

Maybe it’s a Python version thing or regarding installed spaCy version but…I don’t know. Waiting for python guys :slight_smile:

But these steps aren’t working for 3.7.0 version of spacy model. Waiting for @Artem :slight_smile:

1 Like

Hello @vlad28,

Thank you for your interest to our extension. As @roberto_cadili mentioned the bundled Python environment supports versions up to 3.5.0. However as I can see you managed to create your own Python environment, so in the settings you can use your own. This should also help you to use Spacy 3.7.0 (I hope it is compatible with our code for the nodes). But could you please provide a stack trace of the error for Spacy 3.7.0 model when you use your custom environment?

Regarding the task for NER and relationship extractions. Probably you can use Spacy for this - NER + POS, so you look for the named entities, then check the verb that connects the subject and object. Then you can upload it to Neo4j (by the way we have developed Knime extension for it too). However this solution would probably be less accurate then using LLMs. As Roberto showed it works quite good with ChatGPT, and probably you can give it a try to find the model that does the same task for Romanian texts.

2 Likes

Thanks for your answer, @Artem and all of your contribution.
Without you guys, I couldn’t figured this out.

I read the 3.7.0 spaCy documentation and it was stated that “This release drops support for Python 3.6.” and “spaCy v3.7 adds support for Python 3.12”

I created a new environment using anaconda prompt, but this time with python version 3.11.3 (on 3.12, in Knime I got the error No module named imp)
installed packages: pip install matplotlib numpy pandas pyarrow spacy
Returned to Knime, Preferences / Redfield NLP Nodes and where it sais “Name of the Spacy Conda environment” I selected my newly created env.

Went to Spacy Model Selector, set the path to spacy 3.7.0 model and it works.

Thank you for all your efforts :slight_smile:

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.