Extracting text from powerpoint and save the content in an excel

s3marube · April 14, 2021, 9:56am

Hi guys,

I am trying to extract data from a folder which contains ca. 100 powerpoint files. These files have all the same structure and the main goal is to extact just few information like contact person, main issue, was customers issue solve etc.

I have already use the Tika parser to get the contect and the title inside KNIME but I am struggling with extacting the specific text out the content column. Maybe you guys can help me with my issue!

Many thanks in advance!!

Regards,

Max

takbb · April 14, 2021, 3:14pm

This looked like an interesting challenge that I’d not tried before, so maybe the attached workflow will help, but I’m sure some people will chip in if there are other/better ways to tackle this.

I haven’t added the “write to excel” part, as I figure that’s not the bit that is challenging, and exactly what you will want to write is going to depend on your data. What I’ve done is worked on the assumption that your powerpoint files are saved in pptx format, in which case they will comprise of zipped up xml files.

So the task then is to loop through your files, and for each one, unzip it into a temp folder, then loop through all the newly unzipped xml files. Only certain xml files are going to be of relevance to this process, and from investigation I’ve made the assumption that the only ones you’ll want will be ones containing /ppt/slides/ in their path name.

For each one of those, we extract xml using the XPath node, and I came up with the following xpath query to grab the contexts of any textboxes

/p:sld/p:cSld/p:spTree/*/p:txBody

That worked in my quick tests, but there may be other text you want that this doesn’t pick up, so you may need to do a bit of trial-and-error and other research on that bit.

Once it has all that, you end up with a table of xml file names (containing the slide number) and some text. Without seeing your powerpoint files, I don’t know how you are going to organise your text for output to Excel, but maybe this gives you a starting point?

KNIME_extract_from_ppt.knwf (133.5 KB)
[edit: updated to modify row filter to include only xml files]

I hope that helps

p.s. thinking about it I probably should have used decompress instead of the unzip (legacy) node, but you get the idea…

takbb · April 15, 2021, 5:10am

I realised that while my flow potentially extracted the text, it did not answer your specific question as you said you were using the Tika parser, which (my bad) I had completely overlooked.

I hadn’t used tika before but the equivalent workflow to the one I did with xml is attached (and far simpler!).

Does this help, or does this just get you to where you are so far?

KNIME_extract_from_ppt_with_tika.knwf (65.0 KB)

I think to provide further help with any specific problems, a sample of one of your powerpoints (without any confidential information) and a snippet of the workflow you have so far would be needed.

s3marube · April 15, 2021, 11:57am

Hi takbb,

sry for my late response. I have tried your solution and it works fine for me!
Thank you for your help! You saved my day

Regards,
Max

system · June 2, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.