Referencing previous row

Hello everyone!

I’ve just downloaded and started using KNIME, so I’m a complete newbie. I was wondering if it is possible to reference data from the previous row in the table. I have a data table that contains values by dates:

Date Value1 Value2 …
1 2 3
2 4 5

What I need is to calculate the relative change in the value to the previous date

Date Value1 ChangeValue1…
1 2 -
2 4 (4-2)/2

So I need to reference the value from the previous row somehow. Can anybody show me how to do that?

Thanks.

Hi,

As far as I know this is not present in Knime or perhaps just the existing nodes. I posted awhile back on this: http://www.knime.org/node/245 .

Here is my work around.

Take table one

RowID Entity Value
Row1 A 123
Row2 A 124
Row3 A 125
Row4 B 125
Row5 B 124
Row6 B 123

Make a second table with a new RowId offset by the desired lag period.

RowID Entity Value
Row2 A 123
Row3 A 124
Row4 A 125
Row5 B 125
Row6 B 124
Row7 B 123

Then execute an inner join.

RowID Entity Value Entity_lag Value_lag
Row2 A 124 A 123
Row3 A 125 A 124
Row4 B 125 A 125
Row5 B 124 B 125
Row6 B 123 B 124

You’ll notice that you now have less rows and non-matching rows in the merged table. You’ll need to filter for the rows with non-matching enitity’s.

I apologize if this answer is a little unclear. I don’t have Knime on the machine I’m using right now. I’m happy to answer any questions and help how I can. It would be really useful for data manipulation tasks to have this type of functionality supported in Knime in an easier way. Specifically form the SQL standard the functionality provided by RANK, LAG, OVER, PARTITION BY, GROUP BY, and WINDOW would be useful in working with transactional, longitudinal, timeseries data.

That being said Knime is a fantastic tool that is coming along well.

What type of work are you doing? What area do you work in? I’m an analyst in the marketing/sales area of a financial firm. I tend to work with this type of data often in my tasks.

Best regards,

Jay

Hello Jay,

Thank you for your answer. I was thinking about creating a duplicate table, but was hoping to find a straight-forward solution. I ended up preparing data with a small external application.
This is not for my work. I’m just playing with different data mining tools.
The inputs are financial data. Some mutual funds prices and russian stock market indices (the only real world data I can get).
I’m a software developer and these tests have nothing to do with my work. I’m just fascinated by data mining and machine learning. And I’m hoping to get a real world data mining task some time in the future.

Best regards,
Max

Hi,

Working with data encompassing multiple observations (time series, event, longitudinal) is a very useful ability. Unfortunately most of the packages are highly specialized and are memory bound.

The big advantage of Knime is it’s ability to work, in most cases, with larger then available memory datasets. I have worked with a lot of different packages, statistical, data mining, etc… in both research and application environments and Knime is quite excellent. Especially given how new it is and that it’s open source.

There are multitudes of real datasets available for public use. Many come from competitions such as the ones listed here: http://www.kdnuggets.com/datasets/competitions.html . The only thing is that these tend to hide many, most of the initial data preparation steps. Financial data is another source but financial data can be hard (or expensive) to come by once you get into different information. Financial datasets quickly can become very large as well to Knime type scalability is very critical.

What other tools have you been working with?

Best regards,

Jay

Hi,

Indeed Knime seems to be pretty powerful tool. What I liked the most at the first glance is that :

  1. It is visual workflow based. Another very powerful open source DM tool Rapidminer has a tree workflow. And for me it is much harder to work with.
  2. In Knime it is possbile to create R and Python nodes. I’m not an expert in either, but i think that this feature will allow to do some cool things.

Other tools I’ve tried to work with are Clementine, Statistica, PolyAnalyst, KXEN, Rapidminer. I only spend several hours in Clementine and Statistica, and it was a while since I used them, so cannot say much about them. Three weeks ago I was able to get an evaluation of KXEN, but it was only for 15 days.
PolyAnalyst is the tool I used a lot. I first tried it about 4 or 5 years ago. It was version 4.6. Now I’m trying to work with PolyAnalyst 6.0, but it is still in beta stage. I very like one feature of PolyAnalyst. It has a so called exploration engine “Find Laws”. In a few words it is an evolutionary algorithm that is able to create human readable formulas which express relationship in the data. On my data this algorithm was generating models that were more robust than any other algorithm including Decision Trees, Neural networks and SVMs.

At this moment I’m trying to explore opensource tools: Rapidminer and Knime. I also had a look at Weka, but its algorithms are included in Rapidminer and Knime, so it is much easier to use them from these tools. Rapidminer impressed me with a huge number of different algorithms. Unfortunately some of its features are inconvenient, like this tree based workflow.

Best regards,
Max

Hi,

PolyAnayst is one tool I haven’t worked with in the past. Is it scalable to large datasets? Can it do the type of data prep you were asking about here? I’ve worked with Clemetine which is quite a nice tool. I really like R but as with most tools the wall I quickly run into is scalability. I’ve found it hard to find a complete tool for pulling datasets, exploring/profiling data, transforming data, sampling, modeling, model evaluation, performance estimation and ultimately some form of deployment (pmml, predicted values,etc…).

Rapidminer implemented the “windowed” modeling meta-scheme which I suggested awhile back now and I think it’s still in there. That is a type of testing useful for financial data as most attributes are not time-invariant. The issue with that program again is scaling to larger datasets. I’m with you too; I find several of the main aspects of the “tree-workflow” cumbersome to work with. If they could “hide” the need for some of their process operators it would simplify things.

The rule engine sounds interesting. Are you working on investment models for personal use?

All the best,

Jay

Info on the RapidMiner functionality:

http://sourceforge.net/forum/message.php?msg_id=4340331

http://sourceforge.net/tracker/index.php?func=detail&atid=667393&aid=1729081&group_id=114160

Hi,

I haven’t used large datasets in PolyAnalyst. My typical configuration has at most several thousand rows and 30 to 50 columns. But here are some quotes from Polyanalyst 5 help:

Clustering
Maximum of 3,000,000 records
Max practical number of attributes: 3,000

Decision Tree
Maximum of 5,000,000 records
Max practical number of attributes: 3000

Decision Forest
Maximum of 10,000,000 records
Max practical number of attributes: 3000

Find Dependencies
Maximum of 1,000,000 records
Max practical number of attributes: 3000

Find Laws
Maximum of 1,000,000 records
Max practical number of attributes: ------

Linear Regression
Maximum - Unlimited
Max practical number of attributes: 3,000

Market Basket Analysis
Maximum of 3,000,000 records
Max practical number of attributes ------
In the Transactional Basket Analysis implementation of this algorithm, where each purchased item is represented by a separate record, the maximum number of records is 100,000,000.

Memory Based Reasoning
Maximum of 100,000 records
Max practical number of attributes: 300

PolyNet Predictor
Maximum of 1,000,000 records
Max practical number of attributes: 3,000

But these numbers are for the version 4.6 and 5.0 of Polyanalyst which is now 7 years old. The 6.0 is still in development and there is 64 bit version. I think it will have more impressive numbers.

In polyanalyst I can easily access previous or next row. It has it’s own formula language and accessing columns from different rows is done like this:

[Col1]{-10}

where the number in curly braces is the distance between rows. In this case I’m accessing Col1 from the row which is 10 rows behind the current row. This can be positive number also. So it’s pretty easy.

The most scalable tool I’ve ever seen is MATLAB with it’s parallel processing toolbox. But I haven’t used this toolbox myself, because I have never had such large datasets. And of course it’s not a data mining or business intelligence tool per se, so it might require some additional programming.

I guess there’s no universal tool. At least at this moment. I constantly find myself preparing data with one tool or even with some little program I wrote and using this data in another tool. Data preparation is the most painful part of data mining, I guess.

Several years ago I even wrote one program that implemented Data Mining algorithm, the so called Group Method For Data Handling (GMDH), because I couldn’t find its implementation in any other tool. GMDH has a number of interesting features: it automatically controls the complexity of the model, it can be used on a very small number of samples. It is also fast. But it isn’t a popular approach for some reason.

Yes, I’m developing an investment model for myself. I’m not trying to predict stock or index prices, because I’ve tried to do this several times before and couldn’t create any useful model for russian stock market. Now I have created a model that predict a particular mutual fund price. The broker company only provides this price with one day lag, because as they said there’s a complicated price calculation process involved. I was able to create a pretty accurate and robust model for this price.

Best regards,
Max

Hi,

Thanks for the info.

Yes data prep is not only difficult but also critical. I often find that simpler modeling techniques can be used to obtain superior results with the proper characterization of the problem through data prep.

Some simple automations are useful. Creating graphics for a number of variables at once for example (histograms, etc…). Computing continuous and categorical summary statistics (maybe with the graphics?) for a number or variables. Reshaping and aggregating multiple observations per entity is important and useful as well as the ability to transform a number of variables at once and avoid “node-explosion” in workflows making them cumbersome and difficult to work with.

Other things like enhancements to existing modeling nodes to provide more information on the developed models are helpful and probably another good area for enhancement of the existing Knime nodes.

Little things like these go a long way. Unfortunately I’ve been hidden from actual programming by having used SAS, SPSS, S-Plus/R, Matlab, RDBMS&SQL, etc… So right now I am working to get my head around java, eclipse and knime in one go. I’ve managed to make some “edits” which have been helpful in my work but these were very ad-hoc and simple modifications. I have been working on a more complete augmentation to the statistics gathered and stored to enhance downstream data exploration and modeling nodes I have started but unfortunately my time is somewhat limited right now. Thankfully my dislike for how many pieces of software are licensed is propelling me forward here.

My professional work centers around customer/product /campaign analysis in a sales&marketing department but I’ve been developing financial models for awhile now and am always interested in chatting.

I’d have thought we would have drawn some Knime or commentary from other users. Anyone please feel free to comment. I noticed in the release on the Knime support packages that there may be other “business” users out there. Who out there is using Knime for customer analytics?

Best regards,

Jay

All -

Recently got interested in KNIME, primarily driven by economics. Just wanted to introduce myself, and share a few thoughts. I run a small analytics practice, and typically have relied on commercial tools for these consulting engagements. More specifically, I am a SPSS Clementine user for about 10+ years (and to a very very small extent SAS EM user)- a real power user in the sense I basically use it all waking hours : )

Really upset at these commercial vendors - for the copy of Clementine that I have, I paid about 30k. On top of that, now, unless my client buys a license of clementine, they want a cut out of all the consulting projects that the tool is used in. SAS has the same model as well. The notion that these vendors are having is that it is the tool that is delivering value to the clients, and not the miner/modeler. Anyways, I am really getting interested in open source data mining solutions. I had tried WEKA in the past, and it was not robust for commerical applications.

At a first glance, KNIME seems to have come very far - probably still has a long way to go to get close to commercial hardened tools (which benefit from many years and thousands of user feedback), but it is so encouraging. In one day, I was able to learn the tool enough to get around and manipulate data, and even build models and examine it. One thing is bugging me, and I havent figured out yet - when I connect a linear regression learner to a regression predictor, I cant seem to make it work. The light in the predictor stays red. I see the model in the learner (the coefficients). I am able to connect the output from the learner to the predictor…but I cant execute the predictor. Any help to this newbie is appreciated.

I see that they dont have logistic regression, one that I use a lot.

BTW, Max, I run into that problem very often as well. In clementine there are some very nifty functions to do that (called OFFSET…etc). And in a day of KNIME, I was thinking the same, does KNIME support that, because I run into that often. When I have to rely on databases to acheive that, I have used the approach suggested by Jay as a work around.

Appreciate your read of my ramblings.

Satheesh

Hi Satheesh,

Thanks for jumping in! I find it very hard to gauge the Knime community that’s out there most of the time. Clementine is probably the nicest tool I’ve out there as well but unfortunately as you’ve mentioned the SPSS licensing isn’t the greatest (a la SAS…).

No logistic regression. There is presently no support for OFFSET/LAG in Knime. I’ve been thinking about where this would fit, perhaps in the Java snippet node? That node presently performs its action once per row. I wonder what it would take to extend the method so as to give it the ability to access one or more “prior” rows? Perhaps this also fits into the GROUPBY node. Most of the modern rdbms support this now.

Presently at the top of my wish list are:

  1. ranking inside an arbitrary number of levels (groupings)
  2. an interface around the java snippet and math nodes allowing the creation of multiple variables inside one node
  3. an extension to 2) above; the ability to operate on a number of columns (“all numeric”, etc…perhaps in the math node?)

In general new modeling methods (a lot of typical statistical stuff, the scaling of the platform towards working with temporally oriented datasets (at least base data to be prepared and fed to models) and many variables. This centers around data manipulation capabilities such as those described in this thread and others as well as the variable derivation interfaces and statistical summaries/graphics for a number of continuous and categorical variables at once.

I’ve been starting these things on my own but much of it is in the form of small extensions inside existing nodes…I am far from a java programmer or even a programmer. My world has been the command/scripting languages inside statistical software and using relational databases for the most part. So I’ve been working to learn java/eclipse and knime at once.

Is the linear regression node showing a warning symbol or anything? Is there any message when you put your mouse over it? So the learner executes and complete correctly and you can see the view, etc…but it simply won’t connect to the predictor? Is there any difference in the dataset structure of the dataset you’re connecting to the predictor?

Best regards,

Jay

Hi Satheesh,

thank, Jay, for asking all the right questions: the red error marker should show a pretty good explanation when you mouse over it. Can you try to feed in the training data just to quickly find out if this works? Then the structure of your test data may not match what the predictor expects.

As for logistic regression - it’s on our list. But I can not promise it to be part of the next release. However, the basic time series plugin is in the works and should show up on the labs pages soon.

Cheers,
Michael

PS: As for community: recently new minor releases tend to get over 2000 downloads over the following week or so.

Hi Michael,

Thanks for the update!

Best regards,

Jay

Just came across this looooooong forum thread and wanted to quickly come back on the initial problem, which was about “Referencing previous row”. We have added that feature to the Java Snippet node, it will be available for v2.1. We added an editor to put a custom class header, which allows the user to define class fields to store, e.g. the values from a previous row. We also added array support (using KNIME collection cells).

Cheers,
Bernd

Hi Bernd,

Yes the thread got wayyy off topic! :wink:

Amazing! Thanks Bernd. It sounds like you are done but if you need external testers or anything to help the process a long please let me know.

One quick question. how is “previous” setup for a column? Is it one back or some number back?

Thanks,

Jay

Question is - when will 2.1 be out? :wink:
 
Cheers,
E.

…if everything goes well at the end of October. We are shooting for a
code freeze in three weeks.

Michael

how is “previous” setup for a column? Is it one back or some number back?

Hi Jay,

I don’t understand your question. Are you talking about previous rows (but wrote columns) or did you actually mean previous columns? If it’s the latter, you need to clarify.

As for previous rows: That pretty much depends on what global fields you define. If you take a java collection, you can get the entire table into memory. Not that we would recommend to do this but … yes, it’s possible.

Hi Bernd,

I was thinking of a single field. The question was in regards to how previous was defined for a row as you thought.

Thanks,

Jay

This looks promising, thanks guys! :slight_smile:

E.