Explosion of the CSV files containing List / JSON data format

Hi,

I am reading my data from a CSV file . Some of my columns contain cells that store data in the form of lists or JSON file formats. I was wondering if there were any options in KNIME that could explode JSON and also list cells into several rows. As my CSV file is very big(35G), I am looking fo an efficient way as the file size would increase after the explosion.

Examples with columns containing cells with List / JSON: (Green colur is what I expected).
Screenshot 2022-07-28 185148

Welcome to the forum @NeginZarbakhsh.

In your first example, where does the Day information come from?

In both examples, what happens if there’s more than one row in the initial table?

Where does the Sample information disappear to?

Can you upload examples of csv files (a couple dozen rows is fine) that you’re trying to process along with some more details of your desired output?

Thanks!

My data store weekly sales information for each store and I attached 3 rows of my data.
In your first example, where does the Day information come from?
Column G stores the sale amount for the seven days as a list data type. I have two options for day information.
Option 1: I have opening and closing week dates stored in cells D &E.
Option 2: All of my lists start on Sundays, so I can create new rows for all rows beginning with Sunday and ending with Saturday.

In both examples, what happens if there’s more than one row in the initial table?
For the first example, each row should be duplicated by seven times.

Can you upload examples of csv files (a couple dozen rows is fine) that you’re trying to process along with some more details of your desired output?
In the first example, I have only 3 rows, and the desired output would be 21 rows as all of the three rows should be duplicated for 7 days and assign the sale amount to each day.

For the JSON format, the key and values are different for each sample. so I need to repeat the rows based on that.

The first picture shows 3 rows of the data.


The second picture shows the desired output for example 1.

That explanation is helpful. Can you upload the text file here so that we could download it and use it in workflow?

sample.txt (20.0 KB)

Having the data is great.

Let’s first deal with the list columns. In the following screenshots I’ve filtered out a few columns just to make it easier to follow along.

Since you’re loading a text/csv file, KNIME doesn’t know that the list columns (e.g. visits_by_day) are lists. KNIME reads this as a string column, as shown by the S icon in the column header:

The first step is to convert this column into a KNIME list. I removed the square brackets using the Regex Extractor node:

After renaming the column so that it has the original name, I use the Cell Splitter node to split the column using the comma as the delimiter, and format the output as a list:

image

Now we’re back to where we started, but now KNIME knows it’s a list column. So we can use an Ungroup node to explode the list:

30 list items multiplied by 5 rows gives 150 rows.

image

3 Likes

Now let’s deal with a column like visitor_home_cgbs.

The same concept applies. When the text/csv file is loaded, KNIME doesn’t know that the column contains JSON.

A String to JSON node will take care of that conversion:

image

To flatten the JSON, we use the JSON Path node. This one is a little tricky but if you have problems let me know and I can walk you through it.

Now we have 2 lists, and we can ungroup:

I used a Regex Extractor to clean up the Key column:

And then this is the result after cleaning up the column names:

image

Both examples are available on the KNIME Hub:

2 Likes

Hi @elsamuel,

That is a great solution!
Thanks!

One question, do you think it is also possible to add also another column that includes the weekday’s names / or numbers from 1 to 7?

That is possible, but the workflow is a little more complex:

The idea is to go row by row and use the start and end date information to generate a series of dates (as rows), from which a day number and day name can be extracted. At the end, this series of dates gets appended to the main stream. I’m assuming the values in the main stream are in chronological order.

I’ve updated the previous workflow on the KNIME Hub

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.