I’ve a dataset with a few million rows. I’m trying to get a visualization as a scatter plot of sorts and need to have two different dates (for example sales date and production date) as X and Y axis.
In python I was creating this visual with a 2d line chart and a line size of 0 → Works fine there.
In Knime i want to use the Python view to get the same visual but the python view never executes completely.
this is my code:
from io import BytesIO
import matplotlib.pyplot as plt
# Create buffer to write into
buffer = BytesIO()
# Create plot and write it into the buffer
plt.plot(input_table['purchase_date'],input_table['sales_date'], 'o', ms = 1)
# data.plot().get_figure().savefig(buffer, format='svg')
# The output is the content of the buffer
output_image = buffer.getvalue()
Any ideas why in python directly it takes like less then a minute to run and here it get’s stuck at ~15%?
@TheLeo you could try and do a direct export from within the Python node and later import the image into KNIME and see if that does work.
Also have you tried with just a few dates just to see if the code runs at all?
yes the code executes. In the dialogue I can run it with like 5000 rows and I get the image I 'm looking for.
But even there If I increase the rows to like 200.000 I can’t even run the preview anymore, it’s just stuck at 50%…
A possible solution to achieve a “reasonable” scatter plot when having a huge amount of points is to undersample the set of points in a random way before any visualization. The result will be much clear and easy to understand than trying to plot the whole set. Otherwise the plot gets messy (looking more like a stain than a scatter plot because of point overlapping). In other words, the scatter plot gets too much less informative than if you undersample your data set just for visualization.
Statistically speaking it will be perfectly correct to solve the problem in that way because I guess what you want to know is the overall distribution of your points in the scatter plot. For instance, an undersampling of 10,000 points is statistically largely enough representative of your distribution of millions of points if they follow a reasonable distribution.
The only wise scenario I see for aiming to plot a whole set of millions of points is when one has a visualization tool that implicitly handles the visualization undersampling in the background and thus recalculates a more refined visualization when zooming in a given region of the scatter plot. The python code you have uploaded here is not capable of doing that. Therefore, a a-priori data undersampling scheme should be needed for getting a reasonable scattered plot of your data.
Random data undersampling can be done in KNIME using the -Partitioning- or -Row Sampling- nodes, making sure that the random sampling option is chosing in these nodes.
Hope this hint helps.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.