Convert 31 column table to 2,046 columns

Hello,

I currently have a process in an Access database that uses a VBA loop to transpose data from a 3 column table into a table with 60 columns. The data from 20 rows is consolidated into 1 row.
The data starts off in the 3 column table like this
image

The transposed data results looks like this. Each of the 3 columns is not necessarily a straight transposition, but rather i have consolidated multiple rows into a single row. You can see in the image here, that the first row’s data for item # 1 is in columns 1-3, then item # 2 is in 3-6, etc.

The code to do this is here. It basically loops through each record and manually builds the transposed table. This may seem like it would be slow, but it is actually lightning fast.

The purpose of this is to reduce the amount of records that Access needs to upload to a Teradata server. This table can be as many as 10,000 rows and Access is slow. By transposing in this way I can reduce the row count of my upload table 20 times over and increase the upload speed dramatically.
Once my table is uploaded to Teradata, i run a union query to rebuild it to it’s normal form. Teradata is extremely fast, so this is no problem.

I have tried a few nodes such as Transpose and Pivot, but they do not seem to function the same way my VBA loop does which allows me to consolidate multiple rows into a single row while limiting the columns. Transpose node will transpose them all, but it results is millions of columns which cannot be uploaded.

The reason i need this is because, I run multiple ETL workflows moving data from a DB2 server into a Teradata database. These tables can be hundreds of millions of rows long and Knime takes about and hour to move just 10 million of them. I would like to transpose the data in similar fashion to my Access VBA loop from a table with 31 fields to one with 2,046 (Teradata columns are maxed at 2,048). This would give me a much shorter table to upload and i believe would significantly speed up my workflows for large tables.

My current workflow is very simple.
image

Are there any nodes that perform similarly to my Access loop? Or, is it possible to code something similar in a JS node? Or, is there something i am missing about uploading large amounts of data. My batch size on the DB writer is set at 100,000.

Hi @bmclain

To restructure your data I’m thinking about something like this convert_x_cols_many_cols.knwf (47.9 KB).


I don’t about the performance with a huge data set and the loops, but maybe this helps.
gr. Hans

3 Likes

Hi @HansS

Thank you very much for your quick reply. Your solution works exactly like my VBA loop and does what i needed. It took a bit of playing around to find the settings that would speed up my process, but here are the results of my test.

Initially, i wanted to build as many columns as Teradata will allow, but for simplistic math in the loop i decided to transpose 50 rows into 1 row. I set the first Chunk Loop Start to 1,000,000 and the 2nd one to 20,000. This created a table with 1,550 columns (31 original columns x 50 iterations). However, The Writer node could not seem to upload this table at all, probably due to memory issues or something. I did try to reduce the batch size, which allowed it to show progress, but it was very much slower than the normal upload process. After further testing, I settled on transposing only 5 rows at a time, so the table i finally used for testing only has 31 x 5 = 155 columns. The Writer node seemed happy with this. I will be further testing to try transposing 2 or 3 and 10 columns to see if i can find a faster runtime. Will also, be testing on a typical week of data which is about 10million rows. What I am really trying to do is find a method to upload a lot more rows than 10 million without taking all day.

I tested a sample set of data at 1 million rows to get a benchmark for the original workflow i posted. Just a standard Reader and Writer setup @ 100,000 batch size on the writer.
Using this setup the writer takes 4.5 minutes to upload 1m records.
Using the transposed data, the writer takes just under 3 minutes and the transposition itself takes 27 seconds, so a total of about 3min 15 seconds which is a 25% increase in performance. The union at the end is instant.

I expected this to have a much higher performance increase, but i am happy to see that it works.

The only remaining issue, i have is this.

  1. The loop you provided doesn’t work if the input table rows cannot be perfectly divided into the number of columns i try to create. So for example, my sample set was 1 million rows, so i could divide those over 5 iterations to build a table with 200,000 rows. However, if i alter the input table to be 1,000,001 rows, the loop fails, because 1,000,001 cannot be evenly divided by 5. Do you know of a way to fix this?

@bmclain I once built an example how to download chunks of data from a database with the last chunk being smaller than the ones before. Maybe you could adapt that?

3 Likes

Thank you @mlauber71
I will take a look at your workflow and try to include it in my own.

1 Like

Hello @bmclain,

for your data transformation you can only use Pivoting node if you add group and sequence columns to your table. Avoiding loop might be good idea if you are working with 10+ million of rows. See here how to simply add group and sequence columns to your table:

And here is workflow with Pivoting node under assumption you have already added these columns:
convert_x_cols_many_cols_ipazin.knwf (21.0 KB)

Regarding writing to Teradata you can explore these nodes:

Br,
Ivan

1 Like

Hi @bmclain , I agree with @ipazin , if you can avoid Loops in case of huge data set, it’s better, especially Loop within a Loop.

It’s also a good idea NOT to initiate your TD DB Connector (Node 27) when your workflow starts, as it will remain idle until all your data is processed - and if you are staying with your Loop solution, that idle/wait time can be long, and your connection could expire. It’s better you link your Loop End (or Pivoting or other nodes part of your data processing - basically the last node of your data processing) to your TD_Credentials_BM1 component (via the Flow variable port, like you did linking the Node 22 to the DB SQL Executor), that way your TD DB Connector will only initiate after the data is ready to be written.

1 Like

With column to grid you should be able to avoid the second chunk loop, however I am not sure whether this node has a loop build in itself. But would be interesting whether the performance is better

edit: or maybe some combination of column aggregator and then a lag column. After that you split the aggregations into separate columns again
br

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.