JSON clean up from multiple PDF pages.

bruce_231 · February 1, 2021, 6:47pm

Using azure form recognizer to read PDFs for some data capture.SO after a few trials, I got a clean CSV extract from the JSON file. so far so easy. The problem is with PDFs you can get 2 or 3 tables on a single page, and as they have row and index numbers… if you try to pivot the results from a multi-page PDF you get a mess.

So my new workflow splits every table into a clean “row”. So my JSON to table output (the last step) is ok but ideally, I need to loop through this output and clean it a bit. Because I need to PIVOT on all 8 rows. For each row, i group on “row index” and PIVOT on columnIndex and aggregation is “text”.
SO basically the output should be 14 separate CSV tables…
I have uploaded my workflow and the source file. (JSON)Layout-Result-Pages from 2018 AR RIO (2).json (1.5 MB) 2021-02-01T00:00:00Z

julian.bunzel · February 4, 2021, 5:40pm

Hey @bruce_231,

I will have a look and get back to you shortly!

Best,
Julian

bruce_231 · February 4, 2021, 6:30pm

HI Julian
thanks for picking this up! I actually figured out a second “part” of the workflow that solves the issues. My new problem is how to connect both workflows… Create the JSON files (done), transform the JSON files to CSV (done)… but no connected, I raised another post
And I’m looking at the suggestion (a call workflow node)…

system · August 6, 2021, 6:31am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.