Read files in folder - different row structures

JWebb · November 7, 2022, 8:09am

This has probably been asked before though I am not quite sure how to describe it in order ot actually find it in searches!

I have a big stack of Excel files with roughly the same column template (I can deal with that) but the dat starts on different rows. Some of them have a single blank row before the headers and some of them have a blank row and a merged row (which I am not interested in keeping) first.

The question is: is there a good way to consolidate Excel spreadsheets from one folder if the headers are on slightly different rows? Perhaps a variable to force it to only read a given row down to the end? Thanks for the input in advance

aworker · November 7, 2022, 8:42am

Hi @JWebb

If you know in advance at which row to start reading the Excel data, then it is possible to configure it based on the value in a variable. Otherwise, you would need to do two readings, one to gather where to start reading (just for that), and a second one to do the right reading of your Excel file based on the right row starting.

Please let us know which of the two scenarios is yours to help you further. Uploading here one of your Excel files to play with and eventually provide a solution would be much appreciated

Hope it helps.

Best
Ael

JWebb · November 7, 2022, 9:55am

Thanks, I’ve actually almost got it! I used your idea of doing two reads to find the row numbers and using variables there. The only trouble which remains is the column structure is different. Normally, this is not a problem, but the option “Fail if specs differ” is checked and greyed out so I cannot uncheck it like I normally do! Any hints on what I could do there?

Normally, I would just uncheck that and I could end up with extra columns I can just handle further down the line.

aworker · November 7, 2022, 10:03am

Hi @JWebb

Glad the idea seems to work. Concerning the checked option “Fail if specs differ”, I have never faced this problem.

Maybe you could try using a new fresh -Excel Reader- node from the node depository to uncheck it first before you set any variable dependency and then, once it is unchecked, you could set up the variable dependency as expected. Would this work ? Otherwise, please reach out again and we will try other options from there.

Best
Ael

JWebb · November 7, 2022, 10:15am

I think I have figured out the reason. Because I am reading each individual file path rather than the files in folder (so I can evaluate where the extra rows are in each file), the option is not available.

Time to rethink the way I did it!

aworker · November 7, 2022, 10:19am

It makes perfect sense
Maybe it would be better to refer to files individually to avoid this problem. For this, you can use the -List Files/Folders- node beforehand to set with the exact Excel file name the variable that you are using.

Hope it helps.

Best,
Ael

gonhaddock · November 7, 2022, 10:33am

Hello @JWebb
You could use @takbb 's workflow “Find Header Row For Excel”.

I did a modification based in previous workflow, that fits the excel once read; then you don’t need to read the file twice. You can nest this workflow in a loop…

BR

aworker · November 7, 2022, 11:46am

Hi @gonhaddock

Thanks for your post and links to these two different solutions. Just a couple of thoughts about both solutions in your links.

@takbb solution is implementing the idea I mentioned in my previous post. I believe it is a really nice solution since it is ready to use “off-the-shell”. However, one needs to be careful to make sure that it detects and fits the correct format of your own Excel file.

Indeed, your second solution does not need a 2nd read of the Excel file. However, one may eventually need to manually handle, verify / set all the column types which is automatically done by the Excel Reader- node when it is correctly configured and used to eventually read the Excel file. This may be a bit of hassle, specially when reading Excel files with a big variety of column types.

Again, thanks for bringing these two solutions and links to the arena.

Best
Ael

gonhaddock · November 7, 2022, 12:02pm

Hi @aworker
You are totally right, I am just bringing some concepts, as tools. To be honest, I’ve faced the most complicated Excel data gathering that you can imagine. However and for many reasons, when an Excel data ETL becomes spicy in terms of column types, or other ‘creative’ excel implementations and uses; I think that the best solution is to move straight forward into R

Thanks for your comments.
BR

JWebb · November 7, 2022, 1:03pm

To be completely honest, in the time it would take me to learn the R required (it took me 2.5hrs to calculate a mean last week !) I could probably prepare them all manually. However, I can guarantee this will not be the only time I am sent some spreadsheet messes, so I will definitely take all the ideas on board and try them out.

Now, to maybe try and get a bit more R-proficient…!

Daniel_Weikert · November 7, 2022, 5:42pm

If only the row is different and the structure the same why not reading them with A,B,C column names together and then filter the empty rows (if necessary) and the header rows out afterwards. Header can be replaced afterwards as well and any loop is skipped. (Nobody of us likes them anyway )
br

system · February 5, 2023, 5:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.