EPO Web Server

Hello knimers,

I am searching for a solution to start a python code directly within knime.
With a client, I am accessing the EPO Web Server.

https://www.epo.org/searching-for-patents/data/web-services/ops.html

I really don’t know how I can accessing the XML Files (Patent Files)

via a KNIME Node.

Is this possible, is there such a node for a Web Service like EPO implemented?

Many thanks and best regards,

Bastian

Not sure if my answer is relevant to you, but don’t you want to try EPO API to extract texts of patents? https://data.epo.org/linked-data/documentation/api-reference.html
I think with GET Request you may have luck.

I havn’t tried EPO yet (I will in a couple of days - this source got out of my head), but USPTO dataset is easily accessible with KNIME.

1 Like

@DmitryIvanov76

import pandas as pd
import epo_ops
import requests

from epo_ops.models import Docdb, Epodoc, Original

def getpublisheddata(patentnummer, client):
try:
response = client.published_data( # Retrieve bibliography data
reference_type=‘publication’, # publication, application, priority
input=epo_ops.models.Epodoc(patentnummer), # original, docdb, epodoc
endpoint=‘biblio’, # optional, defaults to biblio in case of published_data
constituents= # optional, list of constituents
)
print(response.text)
with open(‘data’ + patentnummer + ‘.xml’, ‘w’, encoding=‘utf-8’) as f:
f.write(response.text)

    response2 = client.family(
        reference_type='publication',
        input=epo_ops.models.Epodoc(patentnummer),  # original, docdb, epodoc
        endpoint='biblio',  # optional, defaults to biblio in case of published_data
        constituents=[]  # optional, list of constituents
    )
    print(response2.text)
    with open('family' + patentnummer + '.xml', 'w', encoding='utf-8') as g:
        g.write(response2.text)
except:
    pass

client = epo_ops.Client(key=XXXX, secret=XXXX)
df = pd.read_csv(‘VW_Patentdatenbank_CSV’)
for column in df[[‘doc-number (first realization)’]]:
columnSeriesObj = df[column] # Listenfeld Veröffentlichungs-Nummer erstellen
for i in columnSeriesObj.values:
print(i)
patentnumberlength = len(i)
patentoffice = i[0:2]
patentnummer = i[2:patentnumberlength] #wenn Patentnummer Annmeldekürzel am Ende der Nummer hat hier → i[2:patentnumberlength-1] schreiben!!
print(patentnummer)
patentnummer = patentnummer.lstrip(‘0’) # patentnummer ohne führende nullen
print(patentnummer)
numeric_filter = filter(str.isdigit, patentnummer) # alle nicht numerischen Elemente löschen
numeric_string = “”.join(numeric_filter)
print(numeric_string)
patentnummer = patentoffice + numeric_string
print(patentnummer)

    print("Patentnummer: " + patentnummer)

    getpublisheddata(patentnummer, client)
    #client = epo_ops.Client(key=XXXX, secret=XXXX)  # Instantiate client


    print(epo_ops.__version__)

How can I import modules into the knime environment (epo_ops, pandas, and requests…) they are used here…Via pycharm I can run the code externally…but How can I run the code within KNIME, considering the modules…which Node should be used?

Yes, the GET Request fits. Is there an opportunity to read a python code, that is accessing the server via the GET request node. Or should I think about implementing an Anaconda Environment to start the code directly via knime. I did not worked before with this kind of options. Do you have any precedure, that I can follow?

But in terms of USPTO is it the same GET request approach?

You are well ahead of me in python so I would propose to wait for the reply of someone more knowledgeable than me.
I made a kind of very basic sample extracting first 100 patents of let us say APPLE containing the word PHONE in the full text of the patent. There are numerous keywords in USPTO API allowing to focus your search, I used the easiest and the shortest.
изображение

Then probably you may face a challenge of extracting more then 100 patents meeting your search requirements.
In this case you will need a loop (something like below, but the Math Formula in the middle is from my workflow and serves for technical reasons – it may be easily deleted.
In the counting loop manually or with a widget if you like you set the number of… pages(?) – how many hundreds of patents you would like to extract.
Then in Math Formula (var) multiplies the counter x 100
String Manipulation node [only one is mandatory – I just left them both to make it more illustrative] you make a string of the integer and in the second similar node combine a string for the get request [join(“https://developer.uspto.gov/ibd-api/v1/application/publications?searchText=PHONE&assigneeEntityName=APPLE&start=",$$(Start_row_string)$$,"&rows=100&largeTextSearchFlag=Y”).
I tend to get rid of variables if possible and add constant column with the sting to the GET Request. Technically you can pass variable directly to GET Request node and it will work perfectly.
изображение
The rest is easy: you send GET Request, Path JSON to have columns you need.
Column filter is just to clean the table.
Column resorted helps to have columns ordered in readable order.
And the last trick is the Wait node. USPTO API has limitations and if you send too many requests it just stop working for some 10-15 minutes. To avoid this I add Wait node and execute the loop ones in 4 minutes - this helps to overcome the problem and to have as many results as I need.
! This is not the best approach and not a kind of solution which is to be considered as a guideline. This is just something that works in KNIME and I hope that my inaccuracies will not be considered by the community as an attempt to misguide you.

Many thanks for your contribution. It is such difficult to follow the intelligent way you have implemented your workflow. Is it possible, that you can upload your workflow?

Is it possible to search for a specific bundle of IPC-classes as well, within the API interface of USPTO?

That would be great!

1 Like

Hello @DmitryIvanov76,

I have tried your approach but I get a warning that “No JSON path was specified, please enter at least one expression!”



image

Here are some pictures of the nodes I have …within the string manipulation I dont know what to do …

Registers 0.1. sample.knwf (36.3 KB)

This workflow collects first 200 of VW patents and can be easily adjusted to your needs.
Please have a look at configuration of each node and API documentation – I’m sure you will easily understand how it works and improve it :blush:
Have a great day!

1 Like

Wow, that helped a lot, many thanks!
One question, where can I set the maximum number of patents? Within the code in the string manipulation and the math formula (Var)?
Or should I change the number within counting loop start to “2” if I want to download 200 patents in total? Because you considered the multiplier of “x 100” within the math formula node.

I wish you a great day as well!

In this workflow you have two choices:

  1. Change the number of counts in the loop node settings
  2. Change the number of patents extracted a time (1-100) in String Manipulation (Variable). Just set the number of rows extracted per count…"&rows=100&largeTextSearchFlag=Y" and appropriately change the multiplier in preceding Math Formula (Variable).
    If you need to extract < 100 patents the easiest way will be to set counting loop to “1” and follow the procedure [2].
    Certainly, if you need 500 patents just set the number of loops to 5 and wait for some time…
    This is just a scheme as I mentioned and certainly not perfect. E.g. you can make the multiplier and the number of rows extracted per loop a variable, then you will not need to change settings of two nodes… I didn’t think about it myself :blush:

Great, now I understand. The last sentence is a little bit tricky to implement. Do you mean a metanode or is that another history ? :smiley:

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.