Help with RSS parsing

RPS · June 21, 2020, 1:29am

Hello KNIME users,

I am a relatively recent user of KNIME and need help with the attached workflow. I am also not an IT/Software professional.

I am trying to extract publication metadata from DBLP. Metadata is available for individual authors and as a test case, I am attempting to extract information for the following author:
Albert Zomaya
Being new to KNIME, I have followed a workflow that is available on the hub that I have downloaded.

My workflow is also attachedKNIME_project_DBLP.knwf (14.1 KB)

The RSS feed data that is in the description column needs to be cleaned and I have done it using the cell splitter but unable to scale it up across the full dataset.

How do I clean up all the ‘garbage’ in the Description column of the RSS output? Is the column filter and cell splitter the right tools to use?

Any help will be greatly appreciated.

Thanks,
RPS

Matt_D · June 22, 2020, 2:57pm

Hi @RPS,

I’ve not had a chance to look at this properly but I noticed you can return the data as xml too.

Albert Y Zomaya

There’s an XPath tool which may make this process easier, I’m not an expert but I know some of the respondents on here have a higher xml knowledge so maybe able to parse this for you.

RPS · June 22, 2020, 10:33pm

Hi @Matt_D,

Thanks for your response. I have tried the XPath tool and didn’t get any success. In fact I cannot even configure it and get the message: “The dialogue cannot be opened for the following reason: No column spec compatible to XMLValue”.

The thing is that I get a similar error in Alteryx (which I am more familiar with) so I am beginning to wonder if the data structure (for both the RSS feed and XML file) is flawed? Since I am not a computer guy, I can’t say if this is the case.

Regards.

Matt_D · June 23, 2020, 12:39pm

Hi @RPS,

I used to use Alteryx, loved that software.

I can get some time one morning this week to give you a hand with this, I have made some progress for you. Is there something specific you need from the data?

I spotted that there are different keys within the file relating to -

article key
book key
incollection key
inproceedings key
proceedings key

these all contain a slightly different xml structure (data). It’s a pain for a novice like me but I’m learning a great deal so thank you!

If you can narrow down what you need (specific data headers) it’ll be easier to try and help you out with the specific instead of trying to ingest it all. I don’t know anything about this data which makes it tough…

Let me know, I’m sure we can solve this.

Matt

RPS · June 23, 2020, 11:05pm

Hey Matt,
Good to be able to talk to someone who has also used Altleryx (and so can feel some of the pain I am feeling )

This is an extract of research publishing data and the different keys that you mention are the different ‘platforms’ (if you will). For instance article key relates to all research publications in journals. Likewise book relates to published books and proceedings relate to publications presented at various conferences.

For a start if I could extract Journal publications that would be great. One thing I have done (and I am building a parallel workflow in Alteryx) is to download the RSS feed and used the XML parser to read the file. Doing this, I am able to filter out the different publication types using a filter (I get a GUID field in Alteryx which I cannot see in KNIME).

I am glad that this is helping you learn as am I. I have a meeting with a colleague at work who knows a bit more about XML than I do and hope to find out more. I will let you know if I get ahead or find something that would help with this workflow.

I am also trying to replicate all of my Alteryx workflows into KNIME so that will be a challenge in itself.

Thanks again for your help, it is greatly appreciated.

Cheers
RPS

ipazin · June 29, 2020, 2:58pm

Hello there!

Just to add that for easier transition there is From Alteryx to KNIME free guide/book.

@RPS did you have some progress on your workflow?

@Matt_D welcome to KNIME Community!

Br,
Ivan

Matt_D · July 17, 2020, 2:04pm

@RPS hello!

I had an unexpected absence so wasn’t able to work on this for you, did you find a solution?

Happy to continue to help if you need me too.

Thanks,

Matt

DemandEngineer · October 30, 2020, 9:59pm

Hi @RPS,

I’m also coming from Alteryx… hope this helps you:

system · May 1, 2021, 9:59am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.