Multi-labeled categories in training data

Sometimes the information you want to predict is not a single category but a set of complementary categories. In this case, training data is usually presented as a row of features and an objective field that contains the associated set of categories joined by some kind of delimiter. BigMLer can also handle this scenario.

Let’s say you have a simple file

color,year,sex,class
red,2000,male,"Student,Teenager"
green,1990,female,"Student,Adult"
red,1995,female,"Teenager,Adult"

with information about a group of people and we want to predict the class another person will fall into. As you can see, each record has more than one class per person (for example, the first person is labeled as being both a Student and a Teenager) and they are all stored in the class field by concatenating all the applicable labels using , as separator. Each of these labels is, ‘per se’, an objective to be predicted, and that’s what we can rely on BigMLer to do.

The simplest multi-label command in BigMLer is

bigmler --multi-label --train data/tiny_multilabel.csv

First, it will analyze the training file to extract all the labels stored in the objective field. Then, a new extended file will be generated from it by adding a new field per label. Each generated field will contain a boolean set to True if the associated label is in the objective field and False otherwise

color,year,sex,class - Adult,class - Student,class - Teenager
red,2000,male,False,True,True
green,1990,female,True,True,False
red,1995,female,True,False,True

This new file will be fed to BigML to build a source, a dataset and a set of models using four input fields: the first three fields as input features and one of the label fields as objective. Thus, each of the classes that label the training set can be predicted independently using one of the models.

But, naturally, when predicting a multi-labeled field you expect to obtain all the labels that qualify the input features at once, as you provide them in the training data records. That’s also what BigMLer does. The syntax to predict using multi-labeled training data sets is similar to the single labeled case

bigmler --multi-label --train data/tiny_multilabel.csv \
        --test data/tiny_test_multilabel.csv

the main difference being that the ouput file predictions.csv will have the following structure

"Adult,Student","0.34237,0.20654"
"Adult,Teenager","0.34237,0.34237"

where the first column contains the class prediction and the second one the confidences for each label prediction. If the models predict True for more than one label, the prediction is presented as a sequence of labels (and their corresponding confidences) delimited by ,.

As you may have noted, BigMLer uses , both as default training data fields separator and as label separator. You can change this behaviour by using the --training-separator, --label-separator and --test-separator flags to use different one-character separators

bigmler --multi-label --train data/multilabel.tsv \
        --test data/test_multilabel.tsv --training-separator '\t' \
        --test-separator '\t' --label-separator ':'

This command would use the tab character as train and test data field delimiter and : as label delimiter (the examples in the tests set use , as field delimiter and ‘:’ as label separator).

You can also choose to restrict the prediction to a subset of labels using the --labels flag. The flag should be set to a comma-separated list of labels. Setting this flag can also reduce the processing time for the training file, because BigMLer will rely on them to produce the extended version of the training file. Be careful, though, to avoid typos in the labels in this case, or no objective fields will be created. Following the previous example

bigmler --multi-label --train data/multilabel.csv \
        --test data/test_multilabel.csv --label-separator ':' \
        --labels Adult,Student

will limit the predictions to the Adult and Student classes, leaving out the Teenager classification.

Multi-labeled predictions can also be computed using ensembles, one for each label. To create an ensemble prediction, use the --number-of-models option that will set the number of models in each ensemble

bigmler --multi-label --train data/multilabel.csv \
        --number-of-models 20 --label-separator ':' \
        --test data/test_multilabel.csv

The ids of the ensembles will be stored in an ensembles file in the output directory, and can be used in other predictions by setting the --ensembles option

bigmler --multi-label --ensembles multilabel/ensembles \
        --test data/test_multilabel.csv

or you can retrieve all previously tagged ensembles with --ensemble-tag

bigmler --multi-label --ensemble-tag multilabel \
        --test data/test_multilabel.csv

Multi-labeled resources

The resources generated from a multi-labeled training data file can also be recovered and used to generate more multi-labeled predictions. As in the single-labeled case

bigmler --multi-label --source source/522521bf37203f412f000100 \
        --test data/test_multilabel.csv

would generate a dataset and the corresponding set of models needed to create a predictions.csv file that contains the multi-labeled predictions.

Similarly, starting from a previously created multi-labeled dataset

bigmler --multi-label --dataset source/522521bf37203f412fac0135 \
        --test data/test_multilabel.csv --output multilabel/predictions.csv

creates a bunch of models, one per label, and predicts storing the results of each operation in the multilabel directory, and finally

bigmler --multi-label --models multilabel/models \
        --test data/test_multilabel.csv

will retrieve the set of models created in the last example and use them in new predictions. In addition, for these three cases you can restrict the labels to predict to a subset of the complete list available in the original objective field. The --labels option can be set to a comma-separated list of the selected labels in order to do so.

The --model-tag can be used as well to retrieve multi-labeled models and predict with them

bigmler --multi-label --model-tag my_multilabel \
        --test data/test_multilabel.csv

Finally, BigMLer is also able to handle training files with more than one multi-labeled field. Using the --multi-label-fields option you can settle the fields that will be expanded as containing multiple labels in the generated source and dataset.

bigmler --multi-label --multi-label-fields class,type \
        --train data/multilabel_multi.csv --objective class

This command creates a source (and its corresponding dataset) where both the class and type fields have been analysed to create a new field per label. Then the --objective option sets class to be the objective field and only the models needed to predict this field are created. You could also create a new multi-label prediction for another multi-label field, type in this case, by issuing a new BigMLer command that uses the previously generated dataset as starting point

bigmler --multi-label --dataset dataset/52cafddb035d07269000075b \
        --objective type

This would generate the models needed to predict type. It’s important to remark that the models used to predict class in the first example will use the rest of fields (including type as well as the ones generated by expanding it) to build the prediction tree. If you don’t want this fields to be used in the model construction, you can set the --model-fields option to exclude them. For instance, if type has two labels, label1 and label2, then excluding them from the models that predict class could be achieved using

bigmler --multi-label --dataset dataset/52cafddb035d07269000075b \
        --objective class
        --model-fields=' -type,-type - label1,-type - label2'

You can also generate new fields applying aggregation functions such as count, first or last on the labels of the multi label fields. The option --label-aggregates can be set to a comma-separated list of these functions and a new column per multi label field and aggregation function will be added to your source

bigmler --multi-label --train data/multilabel.csv \
        --label-separator ':' --label-aggregates count,last \
        --objective class

will generate class - count and class - last in addition to the set of per label fields.

Multi-label evaluations

Multi-label predictions are computed using a set of binary models (or ensembles), one for each label to predict. Each model can be evaluated to check its performance. In order to do so, you can mimic the commands explained in the evaluations section for the single-label models and ensembles. Starting from a local CSV file

bigmler --multi-label --train data/multilabel.csv \
        --label-separator ":" --evaluate

will build the source, dataset and model objects for you using a random 80% portion of data in your training file. After that, the remaining 20% of the data will be run through each of the models to obtain an evaluation of the corresponding model. BigMLer retrieves all evaluations and saves them locally in json and txt format. They are named using the objective field name and the value of the label that they refer to. Finally, it averages the results obtained in all the evaluations to generate a mean evaluation stored in the evaluation.txt and evaluation.json files. As an example, if your objective field name is class and the labels it contains are Adult,Student, the generated files will be

Generated files:

MonNov0413_201326

evaluations

extended_multilabel.csv

source

evaluation_class_student.txt

models

evaluation_class_adult.json

dataset

evaluation.json

evaluation.txt

evaluation_class_student.json

bigmler_sessions

evaluation_class_adult.txt

You can use the same procedure with a previously existing multi-label source or dataset

bigmler --multi-label --source source/50a1e520eabcb404cd0000d1 \
        --evaluate
bigmler --multi-label --dataset dataset/50a1f441035d0706d9000371 \
        --evaluate

Finally, you can also evaluate a preexisting set of models or ensembles using a separate set of data stored in a file or a previous dataset

bigmler --multi-label --models MonNov0413_201326/models \
        --test data/test_multilabel.csv --evaluate
bigmler --multi-label --ensembles MonNov0413_201328/ensembles \
        --dataset dataset/50a1f441035d0706d9000371 --evaluate

Multi-label Options

`--multi-label`	Use multiple labels in the objective field
`--labels`	Comma-separated list of labels used
`--training-separator` SEPARATOR	Character used as field separator in train data field
`--label-separator` SEPARATOR	Character used as label separator in the multi-labeled objective field