Extracting Information from poorly formatted Excel Files

I’m trying to extract some basic information from a few thousand poorly formatted Excel files.

I’d like to search for something in the top 5 rows to identify the kind of document (e.g., invoice vs bill of lading), and then search through the document for other key terms to identify other pieces of information.

Unfortunately, the files are not all formatted exactly the same, but they will follow certain conventions. Usually there will be a data label and then the data will be to the right or below the data label. Sometimes there will be blank rows or columns in between.

I’m trying to do this with a series of row filters and transpose nodes, but it doesn’t feel like the most straightforward or robust way to do it. I’m attaching an example of a couple different invoice files and the way that I’m going about this. Any suggestions for improvement? I’m thinking I should be able to search for a term, and then search for the next piece of data either in the same row or column, and return that value, without having to chop everything up in rows, but I can’t figure it out. Maybe this is the kind of problem that’s best suited for a code snippet of some kind? I’m attaching a sample workflow of the kind of problems I’m facing and how I’m going about solving it. In my actual workflow, I’m looping through the files in one directory.

As you can see, the workflow I’m building for Doc 1 wouldn’t work for Doc 2, because even though the files are similar, the columns and rows are shifted around a bit.

Sample.knwf (26.3 KB)

Also, sometimes there is a line item description that spans two rows. In Doc 1, I’ve demonstrated this by putting “consulting” and “services” on two separate rows. The row with “Services” wouldn’t ever have any other data on that line. Any way to join those two cells when I’m doing this?

Thanks!

Hi @stevelp

I put some nodes together in this workflow Sample_2.knwf (22.8 KB) , Maybe it helps in someway to find a solution for your question.


gr. Hans

2 Likes

Thanks! This is helpful. I like the way you’ve approached it. Do you have any idea for how to get “Consulting Services” in a column labeled “Description” on Row0?

1 Like

Actually, this might be easier to see what I’m trying to do if I show a file that’s a little closer to what I’m working with. I’m attaching a redacted version of a sample source file with two worksheets. One is “Source”. That’s close to what most of these files will look like, but they do vary a bit in how they’re prepared. Some have more rows than others in certain sections.

Check Request 1.xlsx (11.7 KB)

I’m trying to add the relevant rows in the attached Excel file from A1 to H24 to my workflow.

I’m attaching the workflow I’ve been building to extract other pieces of data using the Column Expressions node, but I can’t figure out how to extract these rows. I’m thinking I should be able to search for “Description” and “Check total” and use those as variables to select a subset of data, but I don’t know how to get an index for those fields.

Read Files2.knwf (35.4 KB)

I will see if I can take a look later. Here is an example where several separate tables have been extracted from one Excel table - with a way to preserve the column names.

Maybe you could also get some inspiration there.

2 Likes

@stevelp - OK I took a first look at this. I think it is possible but will not be easy. A strategy would go something like this:

  • identify the table/data block of each sheet
  • it should have some indicator words and the end might be marked by the first line which contains a “:”
  • this data block would be imported along the lines of the link I posted
  • these “:” might be the key to converting these informations
  • all data behind a colon would belong to whatever is before
  • if you encounter information in a line without colon it would belong to the cell of the above colon
  • some cleaning and converting of data must happen ($ marked cells to number)
  • every line will get an artificial ID (new_id) that marks the position so other information (from below might then be attached)
  • it could be helpful if you know some items that would always have a name
  • the lonesome information at the top will be the head
  • if a line only contains one colon it can just be split, if it is more than one you will have to treat them separately

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.