This has probably been asked before though I am not quite sure how to describe it in order ot actually find it in searches!
I have a big stack of Excel files with roughly the same column template (I can deal with that) but the dat starts on different rows. Some of them have a single blank row before the headers and some of them have a blank row and a merged row (which I am not interested in keeping) first.
The question is: is there a good way to consolidate Excel spreadsheets from one folder if the headers are on slightly different rows? Perhaps a variable to force it to only read a given row down to the end? Thanks for the input in advance
If you know in advance at which row to start reading the Excel data, then it is possible to configure it based on the value in a variable. Otherwise, you would need to do two readings, one to gather where to start reading (just for that), and a second one to do the right reading of your Excel file based on the right row starting.
Please let us know which of the two scenarios is yours to help you further. Uploading here one of your Excel files to play with and eventually provide a solution would be much appreciated
Thanks, I’ve actually almost got it! I used your idea of doing two reads to find the row numbers and using variables there. The only trouble which remains is the column structure is different. Normally, this is not a problem, but the option “Fail if specs differ” is checked and greyed out so I cannot uncheck it like I normally do! Any hints on what I could do there?
Normally, I would just uncheck that and I could end up with extra columns I can just handle further down the line.
Glad the idea seems to work. Concerning the checked option “Fail if specs differ”, I have never faced this problem.
Maybe you could try using a new fresh -Excel Reader- node from the node depository to uncheck it first before you set any variable dependency and then, once it is unchecked, you could set up the variable dependency as expected. Would this work ? Otherwise, please reach out again and we will try other options from there.
I think I have figured out the reason. Because I am reading each individual file path rather than the files in folder (so I can evaluate where the extra rows are in each file), the option is not available.
It makes perfect sense
Maybe it would be better to refer to files individually to avoid this problem. For this, you can use the -List Files/Folders- node beforehand to set with the exact Excel file name the variable that you are using.
Thanks for your post and links to these two different solutions. Just a couple of thoughts about both solutions in your links.
@takbb solution is implementing the idea I mentioned in my previous post. I believe it is a really nice solution since it is ready to use “off-the-shell”. However, one needs to be careful to make sure that it detects and fits the correct format of your own Excel file.
Indeed, your second solution does not need a 2nd read of the Excel file. However, one may eventually need to manually handle, verify / set all the column types which is automatically done by the Excel Reader- node when it is correctly configured and used to eventually read the Excel file. This may be a bit of hassle, specially when reading Excel files with a big variety of column types.
Again, thanks for bringing these two solutions and links to the arena.
You are totally right, I am just bringing some concepts, as tools. To be honest, I’ve faced the most complicated Excel data gathering that you can imagine. However and for many reasons, when an Excel data ETL becomes spicy in terms of column types, or other ‘creative’ excel implementations and uses; I think that the best solution is to move straight forward into R
To be completely honest, in the time it would take me to learn the R required (it took me 2.5hrs to calculate a mean last week !) I could probably prepare them all manually. However, I can guarantee this will not be the only time I am sent some spreadsheet messes, so I will definitely take all the ideas on board and try them out.
Now, to maybe try and get a bit more R-proficient…!
If only the row is different and the structure the same why not reading them with A,B,C column names together and then filter the empty rows (if necessary) and the header rows out afterwards. Header can be replaced afterwards as well and any loop is skipped. (Nobody of us likes them anyway )