I’ve been experimenting with the Association Rule Learner for some time and have a question concerning the “underlying data structure” setting.
From the node description I understand that both settings, “ARRAY” and “TIDList” lead to correct results. The setting affects only memory use and running time.
In my case the difference is a bit extreme: With setting “TIDList” it takes about 15 minutes to obtain results, while with “ARRAY” the result comes within seconds. The results are identical.
My dataset contains about 40.000 baskets with 45.000 distinct articles, the bitvector contains 1,3 million 1s.
My questions: what’s behind the “underlying data structure” setting? Where does the huge time difference come from? Different algorithms? Are there circumstances where I should better use “TIDList”?
Thanks in advance for your answer(s)
the "ARRAY" implementation uses a so-called prefix tree which stores the itemsets as integer arrays. This is fast in processing but very memory intensive, especially if your frequent itemsets are quite long (containing many co-occurring items). The maximum size of the prefix tree is 2^n -1 (n being the number of items).
the "TIDList" implementation stores for each transaction only those items which are present in this transaction in a list and then combines those transaction id lists in order to find the frequent ones. This is much slower but more memory efficient if you have sparse data. The time complexity of this algorithm is in O(m2^n) (m being the number of transactions and n the number of items), whereas the maximal size is only m*n: a list for each transaction with each item set.
In general, the size of the search space is dominated by length of the largest maximal frequent itemset.
Hope that answers you question?