This looked like an interesting challenge that I’d not tried before, so maybe the attached workflow will help, but I’m sure some people will chip in if there are other/better ways to tackle this.
I haven’t added the “write to excel” part, as I figure that’s not the bit that is challenging, and exactly what you will want to write is going to depend on your data. What I’ve done is worked on the assumption that your powerpoint files are saved in pptx format, in which case they will comprise of zipped up xml files.
So the task then is to loop through your files, and for each one, unzip it into a temp folder, then loop through all the newly unzipped xml files. Only certain xml files are going to be of relevance to this process, and from investigation I’ve made the assumption that the only ones you’ll want will be ones containing /ppt/slides/ in their path name.
For each one of those, we extract xml using the XPath node, and I came up with the following xpath query to grab the contexts of any textboxes
That worked in my quick tests, but there may be other text you want that this doesn’t pick up, so you may need to do a bit of trial-and-error and other research on that bit.
Once it has all that, you end up with a table of xml file names (containing the slide number) and some text. Without seeing your powerpoint files, I don’t know how you are going to organise your text for output to Excel, but maybe this gives you a starting point?
KNIME_extract_from_ppt.knwf (133.5 KB)
[edit: updated to modify row filter to include only xml files]
I hope that helps
p.s. thinking about it I probably should have used decompress instead of the unzip (legacy) node, but you get the idea…