.. toctree:: :maxdepth: 2 :hidden: .. _bigmler-association: Association subcommand ====================== Association Discovery is a popular method to find out relations among values in high-dimensional datasets. A common case where association discovery is often used is market basket analysis. This analysis seeks for customer shopping patterns across large transactional datasets. For instance, do customers who buy hamburgers and ketchup also consume bread? Businesses use those insights to make decisions on promotions and product placements. Association Discovery can also be used for other purposes such as early incident detection, web usage analysis, or software intrusion detection. In BigML, the Association resource object can be built from any dataset, and its results are a list of association rules between the items in the dataset. In the example case, the corresponding association rule would have hamburguers and ketchup as the items at the left hand side of the association rule and bread would be the item at the right hand side. Both sides in this association rule are related, in the sense that observing the items in the left hand side implies observing the items in the right hand side. There are some metrics to ponder the quality of these association rules: - Support: the proportion of instances which contain an itemset. For an association rule, it means the number of instances in the dataset which contain the rule's antecedent and rule's consequent together over the total number of instances (N) in the dataset. It gives a measure of the importance of the rule. Association rules have to satisfy a minimum support constraint (i.e., min_support). - Coverage: the support of the antedecent of an association rule. It measures how often a rule can be applied. - Confidence or (strength): The probability of seeing the rule's consequent under the condition that the instances also contain the rule's antecedent. Confidence is computed using the support of the association rule over the coverage. That is, the percentage of instances which contain the consequent and antecedent together over the number of instances which only contain the antecedent. Confidence is directed and gives different values for the association rules Antecedent → Consequent and Consequent → Antecedent. Association rules also need to satisfy a minimum confidence constraint (i.e., min_confidence). - Leverage: the difference of the support of the association rule (i.e., the antecedent and consequent appearing together) and what would be expected if antecedent and consequent where statistically independent. This is a value between -1 and 1. A positive value suggests a positive relationship and a negative value suggests a negative relationship. 0 indicates independence. Lift: how many times more often antecedent and consequent occur together than expected if they where statistically independent. A value of 1 suggests that there is no relationship between the antecedent and the consequent. Higher values suggest stronger positive relationships. Lower values suggest stronger negative relationships (the presence of the antecedent reduces the likelihood of the consequent) As to the items used in association rules, each type of field is parsed to extract items for the rules as follows: - Categorical: each different value (class) will be considered a separate item. - Text: each unique term will be considered a separate item. - Items: each different item in the items summary will be considered. - Numeric: Values will be converted into categorical by making a segmentation of the values. For example, a numeric field with values ranging from 0 to 600 split into 3 segments: segment 1 → [0, 200), segment 2 → [200, 400), segment 3 → [400, 600]. You can refine the behavior of the transformation using `discretization `_ and `field_discretizations `_. The ``bigmler association`` subcommand will discover the association rules present in your datasets. Starting from the raw data in your files: .. code-block:: bash bigmler association --train my_file.csv will generate the ``source``, ``dataset`` and ``association`` objects required to present the association rules hidden in your data. You can also limit the number of rules extracted using the ``--max-k`` option .. code-block:: bash bigmler association --dataset dataset/532db2b637203f3f1a000103 \ --max-k 20 With the prior command only 20 association rules will be extracted. Similarly, you can change the search strategy used to find them .. code-block:: bash bigmler association --dataset dataset/532db2b637203f3f1a000103 \ --search-strategy confidence In this case, the ``confidence`` is used (the default value being ``leverage``). Association Specific Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ===================================== ========================================= ``--association-attributes`` Path to a JSON file containing attributes (any of the updatable attributes described in the `developers section `_ ) for the association ``--max-k`` K Maximum number of rules to be found ``--search-strategy`` STRATEGY Strategy used when searching for the associations. The possible values are: confidence, coverage, leverage, lift, support ===================================== =========================================