I want to generate for Stockprices OHLC Data how to do this best?

TotalDataLoss · April 2, 2020, 5:40pm

Hi,
i have large csv datasets and i want to create the financial world typical OHLC aggregation (for charting)

Input csvs looks like this: timestamp, stockprice, traded volume
1497511900,4404.96,1.176680350000
1497511900,4403.99,0.434099140000
1497511922,4404.94,0.127097050000

i want to create a 5min OHLC “compression”. In case some don’t know an explanation how the Format of the output csv should look like

at the beginning of the data you start at first data point and make a 5min intervall for time at a round time e. g.
2018/05/03 15:15:00 - 15:19:59
now you take the first stock price that is the the interval and you save it in a new column called O (for “open”)
you take the max of all the stock prices within that interval and write it in the column H (for “High”)
the min for Low
and the last price that is within the interval you save in a colum C (“close”).

the second interval would be from 15:20:00h - 15:24:59
the third one 15:25:00 - 15:29:59
and so on

**Is there a ready solution for this? Since the input files will be quite large several GB it should be as efficient as possible and i couldn’t find anything ready **

Nice evening to the board!

elsamuel · April 2, 2020, 6:08pm

It is possible to create candlestick charts in the BIRT report editor.

You’d create a workflow to pre-process and filter the time series data appropriately, and at the end of your workflow you’d add a Data to Report node. Inside the report editor, you’d add a chart of type Stock and format it the way you want.

If post some example data, it would be easier to experiment with a solution.

TotalDataLoss · April 3, 2020, 9:51am

Hi elsamuel,
the Input data looks like this: Input_Testdata.xml (411.6 KB)
second column is the price (that should be aggregated to OHLC) last column is volume which should be just summed for the interval.

The real files would be around 2-4 GB.
Candlestick uses a similar approach but I actually don’t want to produce a chart in Knime. I want to write the data to an csv/txt file that should look like this: WantedOutput.xml (136 Bytes)

( I could not upload csv here that’s why i saved them as .xml)
This is like the standard format for a bazillion of stock market charting tools.
The data is not meaningful, i just typed it in for illustration.
So when the timestamp in the first file reads: “20.05.2020 15:25”
it should aggregate the data for OHLC the Intervall from 15:25:00 - 15:29:59

hope that clarifies, greetings…

elsamuel · April 3, 2020, 7:01pm

That does clarify things, and I now see what you’re trying to do.

Here’s a workflow that I think can get you started:

Steps:

Convert the timestamp to a Date&Time format
Use a Window Loop Start to define windows of whatever period you’d like to analyze. In this example I choose periods of 5 mins, but this is easily changed.
Use a Groupby node to aggregate price data. I chose first, maximum, minimum, and last for high, open, low, and close respectively. Also I aggregated volume using sum.
Rename columns
Export as a csv file

Good luck! Let me know if you have any other questions

TotalDataLoss · April 4, 2020, 5:31pm

Hi Elsamuel,
thank you for that workflow! Window Loop Start is really an interesting function. Problem is it worked fine for a small testfile with 100K rows. A file with 1,6GB ran for about 60min then came the error:

ERROR Loop End 3:5 Caught “NullPointerException”: null
WARN GroupBy 3:13 No grouping column included. Aggregate complete table.

Edit: in the logs it shows this:

2020-04-04 18:17:16,372 : WARN : KNIME-Worker-53-Window Loop Start 3:8 : : Node : GroupBy : 3:9 : No grouping column included. Aggregate complete table.
2020-04-04 18:17:16,383 : ERROR : KNIME-Worker-55-Loop End 3:5 : : LocalNodeExecutionJob : Loop End : 3:5 : Caught “NullPointerException”: null
java.lang.NullPointerException
at java.util.ArrayDeque.addLast(ArrayDeque.java:249)
at org.xerial.snappy.buffer.CachedBufferAllocator.release(CachedBufferAllocator.java:77)
at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:424)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.knime.core.data.container.BlockableOutputStream.close(BlockableOutputStream.java:160)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.knime.core.data.container.LongUTFDataOutputStream.close(LongUTFDataOutputStream.java:90)
at org.knime.core.data.container.DCObjectOutputVersion2.close(DCObjectOutputVersion2.java:152)
at org.knime.core.data.container.DefaultTableStoreWriter.close(DefaultTableStoreWriter.java:146)
at org.knime.core.data.container.Buffer.performClear(Buffer.java:1983)
at org.knime.core.data.container.Buffer$SoftRefLRULifecycle.onClear(Buffer.java:2739)
at org.knime.core.data.container.Buffer.clear(Buffer.java:1962)
at org.knime.core.data.container.ContainerTable.clear(ContainerTable.java:266)
at org.knime.core.node.BufferedDataTable.clearSingle(BufferedDataTable.java:947)
at org.knime.core.node.Node.cleanOutPorts(Node.java:1623)
at org.knime.core.node.workflow.NativeNodeContainer.cleanOutPorts(NativeNodeContainer.java:586)
at org.knime.core.node.workflow.WorkflowManager.restartLoop(WorkflowManager.java:3276)
at org.knime.core.node.workflow.WorkflowManager.doAfterExecution(WorkflowManager.java:3144)
at org.knime.core.node.workflow.NodeContainer.notifyParentExecuteFinished(NodeContainer.java:630)
at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:248)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:124)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:334)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
2020-04-04 18:49:16,656 : WARN : main : : Node : GroupBy : 3:13 : No grouping column included. Aggregate complete table.
2020-04-04 19:27:28,887 : WARN : main : : Node : GroupBy : 4:9 : No grouping column included. Aggregate complete table.

The first entry with no grouping column included…is probably not a problem it looks like it shows it for every iteration.
Any tricks for this “NullPointerException”: null ? Should i monitor something specific?
It seemed for me like some buffer flowed over, but the Ram usage on my machine never went over 5.3 GB. Or is there some other workaround like streams?

Edit2: I can not cancel or even delete “Window Loop start” node. I could only cancel Loop Ende and the group in between.
Thanks for your help so far have a good weekend!

ipazin · April 6, 2020, 9:49am

Hi there @TotalDataLoss,

This warning is not a problem as it only notifies you that no grouping columns are chosen in GroupBy node.

Regarding the NullPointerException in Loop End node. Is there possibility you share data to investigate this? I can send you PM if you don’t want to share it publicly.

Br,
Ivan

elsamuel · April 6, 2020, 3:11pm

My first reaction is that there’s something sketchy happening in the table.The Loop End node gets really unhappy if it doesn’t have identical table specs for each iteration.

If you know what iteration the loop was on when it failed, you can figure out what rows it was processing, and then use an upstream row filter node to only run those rows. If these rows don’t run successfully then maybe you’ve found the problem.

TotalDataLoss · April 7, 2020, 12:42pm

Hi Ivan and elsamuel,
yes you actually can download the data i used for testing it is publicly available bitcoin price data:
http://api.bitcoincharts.com/v1/csv/bitstampUSD.csv.gz

@elsamuel: i will try to test this later but my gut feeling is not the case (the data comes directly from an api and is not aggregated or something). My gut feeling is that this window function crashes down if files grow over 1GB. A friend of mine told me something similar a while ago in Python. I think it was in Pandas with some form of Trailing window function which sounded very similar to the window Loop node. If files grew to big it even produces erreonous output-data, but i don’t exactly remember.
Anyways does somebody know how i could find out in which line/iteration that happens? I couldn’t find it in the logs and Window Loop seems not to have a flow variable that counts the iteration, or what else i could do?

greetings…

TotalDataLoss · April 7, 2020, 2:35pm

I forgot this is the workflow i used:

this is almost identical to the workflow elsamuel posted i only changed some settings, but its better for testing to have identical data and process. So i posted this.

Greetings

elsamuel · April 7, 2020, 3:37pm

I just used the .csv.gz as the input and ran my workflow. It completed successfully after about 70 mins, producing 756242 rows of OHLC data.

I’ll be testing it again with the workflow you posted to see if I can reproduce your error.

What version of KNIME are you using? I’m on 4.1.2 on a Windows machine.

TotalDataLoss · April 8, 2020, 11:57am

Hi elsamuel,

could you post the outputfile(s)? i ran the script again and this time it finished without error too. But now i am a bit worried about data integrity and would like to compare the results with yours. Actually i would be interested in both files the result with your script and if you did run it again with mine too. In a similar project an python script had the tendency to sometimes finish and sometimes not, but analysis of the data showed that there were datapoints missing. This error ocurred with files at about approx>= 1 GB of size, so i would like to check that nothing similar happens in Knime.
Is there a another way /node to come to a similar result but to have it in a way that the timestamp in the outputfile is the beginning of the interval and not the date&time of the first entry that lies within that interval? So for example in my version of the workflow i set the “start at” time at 21:45:00 so what Window Loop does is it takes the first actual database entry that lies in that interval, which is:

“2013-09-05T21:45:13”
and the second line is:
“2013-09-05T21:51:02”

What i would be looking for is that the timestamp would be the starting time of the interval like:

     "2013-09-05T21:45:00" 
     "2013-09-05T21:50:00"
     "2013-09-05T21:55:00"

and so on
That is actually the way almost any financial software does this it takes the beginning of the defined interal at a “round time”. I actually kinda like the way Knime does it - the first time stamp is more “correct”. But its not very compatible to other data and if done this way i could compare it to some old data that was aggregated before Knime and check the data accuracy better (another problem is that this way leads to gaps between 2 bars, which could lead to other problems, so it would be cool to have both options anyways).

I thought about taking the unixtimestamp and rounding it down to the beginning of the timeinterval but whatever i could think of nothing seemed to work - i couldn’t think of something to aggregate the data into the fixed timeframe. Maybe there is a way to do this?

Thank for your professional help so far!
Have a great day

PS: Knime version 4.12, Windows. And i checked in the Knime.ini Ram usage was set to just 2048 Mb (under -Xmx2048m) i increased it to 3072MB…

elsamuel · April 8, 2020, 9:17pm

Here’s the output data from my workflow: https://www.dropbox.com/s/9nrk5u3jbsmbmww/OHLC_data.csv

I tried your workflow twice and both times, the Window Loop Start node timed out. The first time, it hung after about an hour, the second time, it hung after about 2 hours. I’m not sure why this happened. As far as I can tell, the only difference between your workflow and mine is that you start your window loop at a different time.

Going back to look more closely at the actual data, I see that I chose a window size (5 mins) that was far too small. I’ve also noted your desire to have better control over the time interval reporting, so I made some changes to my workflow. Here’s what I’ve played around with so far:

The user specifies a start and end time, as well as a duration. These variables are used to:
1. filter the original data so that only rows in the desired time window are processed further
2. define bins into which the filtered data will be sorted
The filtered price-volume data is assigned a bin
The GroupBy node groups the data by bin, then carries out the aggregation

To shorten the runtime, I tested it using March 2 - 6, 2020 as the time window of interest, and define an interval of 1 day. I tried to cross-check the output with an actual BTC candlestick chart but the numbers didn’t quite match up. There may be gaps in the csv file.

The outputs and workflow are in the following KNIME Hub folder:

TotalDataLoss · April 9, 2020, 4:21pm

Hi elsamuel,
thanks for posting!

I tried your workflow twice and both times, the Window Loop Start node timed out. The first time, it hung after about an hour, the second time, it hung after about 2 hours. I’m not sure why this happened.

The Window loop seems to crash easy. I had similar troubles. First time it ran through but it took very long 2:45h. Then found in the GroupBy Node the “Process in Memory” checkbox and checked it - i thought that makes it quicker. but unfortunately after ~2h:30 it crashed the whole Knime instance. When i restarted Knime even the Settings of some nodes of that workflow where reverted to old settings… Something strange was going on. I just hope that when it ran through that the data is correct! Great that you posted it.

Going back to look more closely at the actual data, I see that I chose a window size (5 mins) that was far too small.

Ah the 5-min window is an requirement that can not be changed. The process before was an cumbersome way of doing things with Access - Acess crashed under the 1.6 Gb file. I heard of Knime before so i thought this is the right time to check it out. Anyways.

Unfortunately i have been pretty busy (still am) today with other stuff.
I will have a look at it tomorrow and answer you.

Thanks so far elsamuel! Nice evening

TotalDataLoss · April 10, 2020, 5:29pm

Hi elsamuel,
I want to give a quick update where i am at. Unfortunately depressing news.
I compared today the Data that Knime aggregated With Window Loop with some old Data (that was aggregated with Access etc.)

Comparing the 2 Datasetzt showed literally thousands of differences in the output file.
I spent the day today researching this. What i found out is that the Node “Window Loop” Node has several bugs. (Note there might be more errors i was going through the first couple of errors not all of them)

Could someone from support please look into this like ipazin or someone else - this would need some development attention. I would be interested in a working solution too since i can’t use Window Loop (and i wouldn’t recommend it anybody else right now). Here is what i found out so far:

Output csv file contained double entries like:
21.11.2011 12:10:00,2.45,2.45,2.45,2.45,39000
21.11.2011 12:10:00,2.45,2.45,2.45,2.45,39000
Or
14.09.2011 07:00:00,5.58,5.58,5.58,5.58,12549
14.09.2011 07:00:00,5.58,5.58,5.58,5.58,12549
but there are many more. This can probably be fixed with some form of duplicate removal.

More serious is that the process does not exclude the right border as it should.
The Window Loop aggregated Data showed me for the Candle of
27.2.2012 13:30 Volume: 30 300 comparing this with the Old data that was aggregated without Knime showed: 25 500. The difference is 4800. Researching this with the Input data (i Changed to timestamp from Unix to readable):
27.2.2012 13:33:59;5,08;5300
27.2.2012 13:34:12;5,08;5200
27.2.2012 13:34:24;5,08;5100
27.2.2012 13:34:36;5,08;5000
27.2.2012 13:34:50;5,08;4900
27.2.2012 13:35:00;5,08;4800

It shows that Windows Loop wrongly included the right border since it sits exactly on the edge. If one aggregates data to OHLCV the standard is to go from lower bound to the last Tick (row in database) before the next step begins.
In documentation it is described that way too:

Step size
The step size is the distance between the starting point of one iteration and the starting point of the next. It is defined in terms of number of rows covered (row based) or time.

Another problem is: All in all the process crashes most of the time or doesn’t finish. ElSamuel had similar problems too. For me I ran the process 4 times and 3 times it crashed/didn’t finish. The data above is from the 1 time it did finish. When it die finish it was a pretty slow process too…

To reproduce:
Input Data: http://api.bitcoincharts.com/v1/csv/bitstampUSD.csv.gz
totaldataloss/Public – Tickdata to OHLC_V2 – KNIME Community Hub

@ElSamuel thanks for sharing you data i ran your process last night again. But it crashed After just 2K lines. The good news is the aggregated data was the same for both files - so it except duplicate entries the aggregation itself seems to be stable in what it does.

I will try to have a look at your new process tomorrow.

Again at support if there is a better way of doing this i would be really interested in a solution or a better idea how to do this

Greetings…

TotalDataLoss · April 11, 2020, 3:35pm

Hi elSamuel,

I tried your workflow twice and both times, the Window Loop Start node timed out. The first time, it hung after about an hour, the second time, it hung after about 2 hours. I’m not sure why this happened. As far as I can tell, the only difference between your workflow and mine is that you start your window loop at a different time.

Yes as i wrote it was quite unreliable with me too, but it happened with your very similar process too. So i think its nothing with the changed start at, but more in general the Window Loop Node.

About the performance of Window Loop i noticed that if i take a datafile with 100K rows it finishes in 8 seconds. Which would mean for the 35.75 million lines of the file that it would finish in 48mins. But it actually took much longer it took 2:45h mins (the one time it did finish). So i wonder is there a way in Knime to check if the RAM assigned to Knime is not enough and it had to write to disk? so that one can know if RAM is the problem?

Anyways. Busy week but now i had time to study your process. That was interesting to see how you found a solution how to get the table of the Create Date&Time Range into the other table with the Cross Joiner Node I thought about this for long how to do this, but i couldn’ come up with a solution.

Problem with that solution is that the cross joiner combines every row of table 1 with every row of table 2. With just 10K rows and a 1h interval it already took over 2min to finish with 36 Mio. rows and 5min interval it would take 1430h to finish. You already mentioned this.

So i thought maybe i can build this in a way that is faster and avoids the errors of Window Loop. I was experementing with the following:

put the value of the first time interval in as a unixtimestamp in a flow variable:
TimestampFirstCandle=1442824500
UpperBound = TimestampFirstCandle +300 (5min=300sec)
CandleNumber=1 (Its the Number of the Interval so that one can GroupBY this later)
Then use a Mathnode to check wether: Timestamp < UpperBound if yes write variable CandleNumber to a column
If CandleNumber>=UpperBound, increase CandleNumber by 1 AND increase UpperBound by 300

But i didn’t get anywhere since the combination of check datafields for conditions and then change datafields and flow variables and check for several conditions at the same time i always ran into a problem. But i must say my knowledge of loops is very limited, i think there must be a way to do this. And its probably faster to compare unix timestamps than fully written dates.

So i hope support can fix the window loop node or has an idea how one can build an efficient enough loop, otherwise i would be out of like with Knime. Would be sad really liked the graphical drag&drop interface,

Nice weekend…

ScottF · April 13, 2020, 9:23pm

Hi @TotalDataLoss -

I have read through this thread, but must confess I haven’t digested everything yet - there is a lot of detail. One thing that did jump out at me earlier was that you increased your available RAM in KNIME at one point to 3072 MB. This still strikes me as pretty low if you are dealing with files that are close to 2 GB themselves. Any chance you can bump it up significantly? I would try for something like 8GB or even more, if possible.

TotalDataLoss · April 14, 2020, 9:55am

Hi ScottF,
unfortunately i just have access to this machine with 8GB RAm Total so i guess i could bump it up 1-2GB. But i can not fathom that all the inconsistencies/errors in the output are due to RAM, maybe that the process stopped often - yes. But especially the error that the volume at right border at 27.2.2012 13:35:00;5,08;4800 was in the sum, which it shouldn’t. Second of all there are a ton of duplicate entries, which there shouldn’t be - its a bit worrisome what the node did there (compressed Data never should have duplicates like this). So it would really help if you (or someone) could run the workflow with the Input-Data (Links in the post) and have a look at the mentioned problems.

Thanks for reading all, but the main part really is this (i quote it again for you):

Comparing the 2 Datasetzt showed literally thousands of differences in the output file - compared to .
I spent the day today researching this. What i found out is that the Node “Window Loop” Node has several bugs. (Note there might be more errors i was going through the first couple of errors not all of them)

Could someone from support please look into this like ipazin or someone else - this would need some development attention. I would be interested in a working solution too since i can’t use Window Loop (and i wouldn’t recommend it anybody else right now) . Here is what i found out so far:

Output csv file contained double entries like:
21.11.2011 12:10:00 ,2.45,2.45,2.45,2.45,39000
21.11.2011 12:10:00 ,2.45,2.45,2.45,2.45,39000
Or
14.09.2011 07:00:00 ,5.58,5.58,5.58,5.58,12549
14.09.2011 07:00:00 ,5.58,5.58,5.58,5.58,12549
but there are many more. This can probably be fixed with some form of duplicate removal.

More serious is that the process does not exclude the right border as it should.
The Window Loop aggregated Data showed me for the Candle of
27.2.2012 13:30 Volume: 30 300 comparing this with the Old data that was aggregated without Knime showed: 25 500. The difference is 4800. Researching this with the Input data (i Changed to timestamp from Unix to readable):
27.2.2012 13:33:59;5,08;5300
27.2.2012 13:34:12;5,08;5200
27.2.2012 13:34:24;5,08;5100
27.2.2012 13:34:36;5,08;5000
27.2.2012 13:34:50;5,08;4900
27.2.2012 13:35:00;5,08;4800

It shows that Windows Loop wrongly included the right border since it sits exactly on the edge. If one aggregates data to OHLCV the standard is to go from lower bound to the last Tick (row in database) before the next step begins.
In documentation it is described that way too:

Step size
The step size is the distance between the starting point of one iteration and the starting point of the next. It is defined in terms of number of rows covered (row based) or time.

Another problem is: All in all the process crashes most of the time or doesn’t finish. ElSamuel had similar problems too. For me I ran the process 4 times and 3 times it crashed/didn’t finish. The data above is from the 1 time it did finish. When it die finish it was a pretty slow process too…

To reproduce:
Input Data: http://api.bitcoincharts.com/v1/csv/bitstampUSD.csv.gz
https://hub.knime.com/totaldataloss/spaces/Public/latest/Tickdata%20to%20OHLC_V2

ipazin · April 15, 2020, 4:11pm

Hi there @TotalDataLoss,

I’ll check it and get back to you.

Br,
Ivan

ipazin · April 16, 2020, 3:54pm

Hi there @TotalDataLoss,

have read both of your topics, investigated a bit and gave it a thought. So in theory Window Loop should be used in this case. However problem with duplication you are seeing is known and there is ticket in system to handle it (internal reference: AP-11610). Considering that and amount of data you have (which causes slowness and memory issues) I suggest different approach (without a loop). Will add my suggestion into new topic - once I polish it successfully

Br,
Ivan

TotalDataLoss · April 21, 2020, 4:14pm

Hi Ipazin,
ah you already linked this to the other thread in case someone is looking for a solution.

However problem with duplication you are seeing is known and there is ticket in system to handle it (internal reference: AP-11610)
Its not just the duplicates, more worrisome i found that the right border was included (which it shouldn’t). Maybe that can be added as a second ticket…

Considering that and amount of data you have (which causes slowness and memory issues) I suggest different approach (without a loop).

Your prediction was so right