.. toctree:: :maxdepth: 2 :hidden: Multi-labeled categories in training data ========================================= Sometimes the information you want to predict is not a single category but a set of complementary categories. In this case, training data is usually presented as a row of features and an objective field that contains the associated set of categories joined by some kind of delimiter. BigMLer can also handle this scenario. Let's say you have a simple file .. code-block:: bash color,year,sex,class red,2000,male,"Student,Teenager" green,1990,female,"Student,Adult" red,1995,female,"Teenager,Adult" with information about a group of people and we want to predict the ``class`` another person will fall into. As you can see, each record has more than one ``class`` per person (for example, the first person is labeled as being both a ``Student`` and a ``Teenager``) and they are all stored in the ``class`` field by concatenating all the applicable labels using ``,`` as separator. Each of these labels is, 'per se', an objective to be predicted, and that's what we can rely on BigMLer to do. The simplest multi-label command in BigMLer is .. code-block:: bash bigmler --multi-label --train data/tiny_multilabel.csv First, it will analyze the training file to extract all the ``labels`` stored in the objective field. Then, a new extended file will be generated from it by adding a new field per label. Each generated field will contain a boolean set to ``True`` if the associated label is in the objective field and ``False`` otherwise .. code-block:: bash color,year,sex,class - Adult,class - Student,class - Teenager red,2000,male,False,True,True green,1990,female,True,True,False red,1995,female,True,False,True This new file will be fed to BigML to build a ``source``, a ``dataset`` and a set of ``models`` using four input fields: the first three fields as input features and one of the label fields as objective. Thus, each of the classes that label the training set can be predicted independently using one of the models. But, naturally, when predicting a multi-labeled field you expect to obtain all the labels that qualify the input features at once, as you provide them in the training data records. That's also what BigMLer does. The syntax to predict using multi-labeled training data sets is similar to the single labeled case .. code-block:: bash bigmler --multi-label --train data/tiny_multilabel.csv \ --test data/tiny_test_multilabel.csv the main difference being that the ouput file ``predictions.csv`` will have the following structure .. code-block:: bash "Adult,Student","0.34237,0.20654" "Adult,Teenager","0.34237,0.34237" where the first column contains the ``class`` prediction and the second one the confidences for each label prediction. If the models predict ``True`` for more than one label, the prediction is presented as a sequence of labels (and their corresponding confidences) delimited by ``,``. As you may have noted, BigMLer uses ``,`` both as default training data fields separator and as label separator. You can change this behaviour by using the ``--training-separator``, ``--label-separator`` and ``--test-separator`` flags to use different one-character separators .. code-block:: bash bigmler --multi-label --train data/multilabel.tsv \ --test data/test_multilabel.tsv --training-separator '\t' \ --test-separator '\t' --label-separator ':' This command would use the ``tab`` character as train and test data field delimiter and ``:`` as label delimiter (the examples in the tests set use ``,`` as field delimiter and ':' as label separator). You can also choose to restrict the prediction to a subset of labels using the ``--labels`` flag. The flag should be set to a comma-separated list of labels. Setting this flag can also reduce the processing time for the training file, because BigMLer will rely on them to produce the extended version of the training file. Be careful, though, to avoid typos in the labels in this case, or no objective fields will be created. Following the previous example .. code-block:: bash bigmler --multi-label --train data/multilabel.csv \ --test data/test_multilabel.csv --label-separator ':' \ --labels Adult,Student will limit the predictions to the ``Adult`` and ``Student`` classes, leaving out the ``Teenager`` classification. Multi-labeled predictions can also be computed using ensembles, one for each label. To create an ensemble prediction, use the ``--number-of-models`` option that will set the number of models in each ensemble .. code-block:: bash bigmler --multi-label --train data/multilabel.csv \ --number-of-models 20 --label-separator ':' \ --test data/test_multilabel.csv The ids of the ensembles will be stored in an ``ensembles`` file in the output directory, and can be used in other predictions by setting the ``--ensembles`` option .. code-block:: bash bigmler --multi-label --ensembles multilabel/ensembles \ --test data/test_multilabel.csv or you can retrieve all previously tagged ensembles with ``--ensemble-tag`` .. code-block:: bash bigmler --multi-label --ensemble-tag multilabel \ --test data/test_multilabel.csv Multi-labeled resources ======================= The resources generated from a multi-labeled training data file can also be recovered and used to generate more multi-labeled predictions. As in the single-labeled case .. code-block:: bash bigmler --multi-label --source source/522521bf37203f412f000100 \ --test data/test_multilabel.csv would generate a dataset and the corresponding set of models needed to create a ``predictions.csv`` file that contains the multi-labeled predictions. Similarly, starting from a previously created multi-labeled dataset .. code-block:: bash bigmler --multi-label --dataset source/522521bf37203f412fac0135 \ --test data/test_multilabel.csv --output multilabel/predictions.csv creates a bunch of models, one per label, and predicts storing the results of each operation in the ``multilabel`` directory, and finally .. code-block:: bash bigmler --multi-label --models multilabel/models \ --test data/test_multilabel.csv will retrieve the set of models created in the last example and use them in new predictions. In addition, for these three cases you can restrict the labels to predict to a subset of the complete list available in the original objective field. The ``--labels`` option can be set to a comma-separated list of the selected labels in order to do so. The ``--model-tag`` can be used as well to retrieve multi-labeled models and predict with them .. code-block:: bash bigmler --multi-label --model-tag my_multilabel \ --test data/test_multilabel.csv Finally, BigMLer is also able to handle training files with more than one multi-labeled field. Using the ``--multi-label-fields`` option you can settle the fields that will be expanded as containing multiple labels in the generated source and dataset. .. code-block:: bash bigmler --multi-label --multi-label-fields class,type \ --train data/multilabel_multi.csv --objective class This command creates a source (and its corresponding dataset) where both the ``class`` and ``type`` fields have been analysed to create a new field per label. Then the ``--objective`` option sets ``class`` to be the objective field and only the models needed to predict this field are created. You could also create a new multi-label prediction for another multi-label field, ``type`` in this case, by issuing a new BigMLer command that uses the previously generated dataset as starting point .. code-block:: bash bigmler --multi-label --dataset dataset/52cafddb035d07269000075b \ --objective type This would generate the models needed to predict ``type``. It's important to remark that the models used to predict ``class`` in the first example will use the rest of fields (including ``type`` as well as the ones generated by expanding it) to build the prediction tree. If you don't want this fields to be used in the model construction, you can set the ``--model-fields`` option to exclude them. For instance, if ``type`` has two labels, ``label1`` and ``label2``, then excluding them from the models that predict ``class`` could be achieved using .. code-block:: bash bigmler --multi-label --dataset dataset/52cafddb035d07269000075b \ --objective class --model-fields=' -type,-type - label1,-type - label2' You can also generate new fields applying aggregation functions such as ``count``, ``first`` or ``last`` on the labels of the multi label fields. The option ``--label-aggregates`` can be set to a comma-separated list of these functions and a new column per multi label field and aggregation function will be added to your source .. code-block:: bash bigmler --multi-label --train data/multilabel.csv \ --label-separator ':' --label-aggregates count,last \ --objective class will generate ``class - count`` and ``class - last`` in addition to the set of per label fields. Multi-label evaluations ----------------------- Multi-label predictions are computed using a set of binary models (or ensembles), one for each label to predict. Each model can be evaluated to check its performance. In order to do so, you can mimic the commands explained in the ``evaluations`` section for the single-label models and ensembles. Starting from a local CSV file .. code-block:: bash bigmler --multi-label --train data/multilabel.csv \ --label-separator ":" --evaluate will build the source, dataset and model objects for you using a random 80% portion of data in your training file. After that, the remaining 20% of the data will be run through each of the models to obtain an evaluation of the corresponding model. BigMLer retrieves all evaluations and saves them locally in json and txt format. They are named using the objective field name and the value of the label that they refer to. Finally, it averages the results obtained in all the evaluations to generate a mean evaluation stored in the ``evaluation.txt`` and ``evaluation.json`` files. As an example, if your objective field name is ``class`` and the labels it contains are ``Adult,Student``, the generated files will be .. code-block:: bash Generated files: MonNov0413_201326 - evaluations - extended_multilabel.csv - source - evaluation_class_student.txt - models - evaluation_class_adult.json - dataset - evaluation.json - evaluation.txt - evaluation_class_student.json - bigmler_sessions - evaluation_class_adult.txt You can use the same procedure with a previously existing multi-label source or dataset .. code-block:: bash bigmler --multi-label --source source/50a1e520eabcb404cd0000d1 \ --evaluate bigmler --multi-label --dataset dataset/50a1f441035d0706d9000371 \ --evaluate Finally, you can also evaluate a preexisting set of models or ensembles using a separate set of data stored in a file or a previous dataset .. code-block:: bash bigmler --multi-label --models MonNov0413_201326/models \ --test data/test_multilabel.csv --evaluate bigmler --multi-label --ensembles MonNov0413_201328/ensembles \ --dataset dataset/50a1f441035d0706d9000371 --evaluate Multi-label Options ^^^^^^^^^^^^^^^^^^^ ======================================= ======================================= ``--multi-label`` Use multiple labels in the objective field ``--labels`` Comma-separated list of labels used ``--training-separator`` *SEPARATOR* Character used as field separator in train data field ``--label-separator`` *SEPARATOR* Character used as label separator in the multi-labeled objective field ======================================= =======================================