Hi,
I haven’t used large datasets in PolyAnalyst. My typical configuration has at most several thousand rows and 30 to 50 columns. But here are some quotes from Polyanalyst 5 help:
Clustering
Maximum of 3,000,000 records
Max practical number of attributes: 3,000
Decision Tree
Maximum of 5,000,000 records
Max practical number of attributes: 3000
Decision Forest
Maximum of 10,000,000 records
Max practical number of attributes: 3000
Find Dependencies
Maximum of 1,000,000 records
Max practical number of attributes: 3000
Find Laws
Maximum of 1,000,000 records
Max practical number of attributes: ------
Linear Regression
Maximum - Unlimited
Max practical number of attributes: 3,000
Market Basket Analysis
Maximum of 3,000,000 records
Max practical number of attributes ------
In the Transactional Basket Analysis implementation of this algorithm, where each purchased item is represented by a separate record, the maximum number of records is 100,000,000.
Memory Based Reasoning
Maximum of 100,000 records
Max practical number of attributes: 300
PolyNet Predictor
Maximum of 1,000,000 records
Max practical number of attributes: 3,000
But these numbers are for the version 4.6 and 5.0 of Polyanalyst which is now 7 years old. The 6.0 is still in development and there is 64 bit version. I think it will have more impressive numbers.
In polyanalyst I can easily access previous or next row. It has it’s own formula language and accessing columns from different rows is done like this:
[Col1]{-10}
where the number in curly braces is the distance between rows. In this case I’m accessing Col1 from the row which is 10 rows behind the current row. This can be positive number also. So it’s pretty easy.
The most scalable tool I’ve ever seen is MATLAB with it’s parallel processing toolbox. But I haven’t used this toolbox myself, because I have never had such large datasets. And of course it’s not a data mining or business intelligence tool per se, so it might require some additional programming.
I guess there’s no universal tool. At least at this moment. I constantly find myself preparing data with one tool or even with some little program I wrote and using this data in another tool. Data preparation is the most painful part of data mining, I guess.
Several years ago I even wrote one program that implemented Data Mining algorithm, the so called Group Method For Data Handling (GMDH), because I couldn’t find its implementation in any other tool. GMDH has a number of interesting features: it automatically controls the complexity of the model, it can be used on a very small number of samples. It is also fast. But it isn’t a popular approach for some reason.
Yes, I’m developing an investment model for myself. I’m not trying to predict stock or index prices, because I’ve tried to do this several times before and couldn’t create any useful model for russian stock market. Now I have created a model that predict a particular mutual fund price. The broker company only provides this price with one day lag, because as they said there’s a complicated price calculation process involved. I was able to create a pretty accurate and robust model for this price.
Best regards,
Max