I’m trying to extract some basic information from a few thousand poorly formatted Excel files.
I’d like to search for something in the top 5 rows to identify the kind of document (e.g., invoice vs bill of lading), and then search through the document for other key terms to identify other pieces of information.
Unfortunately, the files are not all formatted exactly the same, but they will follow certain conventions. Usually there will be a data label and then the data will be to the right or below the data label. Sometimes there will be blank rows or columns in between.
I’m trying to do this with a series of row filters and transpose nodes, but it doesn’t feel like the most straightforward or robust way to do it. I’m attaching an example of a couple different invoice files and the way that I’m going about this. Any suggestions for improvement? I’m thinking I should be able to search for a term, and then search for the next piece of data either in the same row or column, and return that value, without having to chop everything up in rows, but I can’t figure it out. Maybe this is the kind of problem that’s best suited for a code snippet of some kind? I’m attaching a sample workflow of the kind of problems I’m facing and how I’m going about solving it. In my actual workflow, I’m looping through the files in one directory.
As you can see, the workflow I’m building for Doc 1 wouldn’t work for Doc 2, because even though the files are similar, the columns and rows are shifted around a bit.
Also, sometimes there is a line item description that spans two rows. In Doc 1, I’ve demonstrated this by putting “consulting” and “services” on two separate rows. The row with “Services” wouldn’t ever have any other data on that line. Any way to join those two cells when I’m doing this?
Thanks! This is helpful. I like the way you’ve approached it. Do you have any idea for how to get “Consulting Services” in a column labeled “Description” on Row0?
Actually, this might be easier to see what I’m trying to do if I show a file that’s a little closer to what I’m working with. I’m attaching a redacted version of a sample source file with two worksheets. One is “Source”. That’s close to what most of these files will look like, but they do vary a bit in how they’re prepared. Some have more rows than others in certain sections.
I’m trying to add the relevant rows in the attached Excel file from A1 to H24 to my workflow.
I’m attaching the workflow I’ve been building to extract other pieces of data using the Column Expressions node, but I can’t figure out how to extract these rows. I’m thinking I should be able to search for “Description” and “Check total” and use those as variables to select a subset of data, but I don’t know how to get an index for those fields.
I will see if I can take a look later. Here is an example where several separate tables have been extracted from one Excel table - with a way to preserve the column names.