Monday, August 26, 2013

Apriori algorithm with R

The apriori algorithm is used to discover association rules, and what is that?.

Association rules is about discover pattern in data, usually transactional data,  like sales (each product when you do a purchase is an item), temporal events (each purchase with sequencial order), and could be used in texts (where each item would be a word ).

So what is the trick behind that?, apriori algorithm  mainly counts every time an item appears, later calculated some metrics like "confidence", and "support" in each iteration.

Here a few concepts association rules.

Support:  it show the transaction proportion where a item appears.
X: count the times that an item appears in the dataset
N: quantity of transaction.

S(x) = X/N

Confidence: it's the confidence of a rule. that indicates how much accurate is a rule.

So, the transaction format could be:

Single.
taken the example of sales, in this format a line represent a product, so should be more of one lines with diferents products which referer to the same transaction. here a example:

Basket sparse sequential.

Each line represent a transaction, so you get a sparse format with variation of the number of columns by row instead of a csv format with equals columns.


Basket.

Each line represent a transaction but with equals columns, so for large products
this could be a nightmare, if your machine doesn't have a lot of memory. this is support by SPSS (clementine or modeler)





Well first, we need to install these packages,  "arules""arulesViz", "arulessecuences".
R use the format basket sparse and single, here I used format basket sparse.

install.packages("arules");
install.packages("arulesViz");
install.packages("arulesSecuences");

We need to define the support and the confidence,
you could edit this in the file arules.r

support1 = c(0.2) #it's a low support because 
                  #I want to see what happens
                  #at this level 
support2 = c(0.7)   # a higher support,
confidence = c(0.9) # and confidence often should be over 0.8

tr = read.transactions("transacciones.basket",
                       sep=',',
                       cols=c(1),
                       format="basket");
image(tr);
summary(tr);
Image plot is like a heatmap where we can see where a cluster is
or which are the products more buyed. If the list products is too big,
this is not useful. On the other hand "summary" show us an overview.

itemFrequencyPlot(tr, supp=support1)

the command above makes this graph:

And here we, execute the apriori algorithm with the data transaction (tr) and the parameters we defined before:

rules = apriori(tr, parameter= list(supp=support1, conf=confidence))
inspect(rules)
plot(rules, method="graph", control=list(type="items"))
plot(rules, method="grouped")

References
  1. Introduction to Data Mining,
Follow me

No comments:

Post a Comment