Multi-labeled categories in training data
Sometimes the information you want to predict is not a single category but a set of complementary categories. In this case, training data is usually presented as a row of features and an objective field that contains the associated set of categories joined by some kind of delimiter. BigMLer can also handle this scenario.
Let’s say you have a simple file
color,year,sex,class
red,2000,male,"Student,Teenager"
green,1990,female,"Student,Adult"
red,1995,female,"Teenager,Adult"
with information about a group of people and we want to predict the class
another person will fall into. As you can see, each record has more
than one class per person (for example, the first person is labeled as
being both a Student and a Teenager) and they are all stored in the
class field by concatenating all the applicable labels using , as
separator. Each of these labels is, ‘per se’, an objective to be predicted, and
that’s what we can rely on BigMLer to do.
The simplest multi-label command in BigMLer is
bigmler --multi-label --train data/tiny_multilabel.csv
First, it will analyze the training file to extract all the labels stored
in the objective field. Then, a new extended file will be generated
from it by adding a new field per label. Each generated field will contain
a boolean set to
True if the associated label is in the objective field and False
otherwise
color,year,sex,class - Adult,class - Student,class - Teenager
red,2000,male,False,True,True
green,1990,female,True,True,False
red,1995,female,True,False,True
This new file will be fed to BigML to build a source, a dataset and
a set of models using four input fields: the first three fields as
input features and one of the label fields as objective. Thus, each
of the classes that label the training set can be predicted independently using
one of the models.
But, naturally, when predicting a multi-labeled field you expect to obtain all the labels that qualify the input features at once, as you provide them in the training data records. That’s also what BigMLer does. The syntax to predict using multi-labeled training data sets is similar to the single labeled case
bigmler --multi-label --train data/tiny_multilabel.csv \
--test data/tiny_test_multilabel.csv
the main difference being that the ouput file predictions.csv will have
the following structure
"Adult,Student","0.34237,0.20654"
"Adult,Teenager","0.34237,0.34237"
where the first column contains the class prediction and the second one the
confidences for each label prediction. If the models predict True for
more than one label, the prediction is presented as a sequence of labels
(and their corresponding confidences) delimited by ,.
As you may have noted, BigMLer uses , both as default training data fields
separator and as label separator. You can change this behaviour by using the
--training-separator, --label-separator and --test-separator flags
to use different one-character separators
bigmler --multi-label --train data/multilabel.tsv \
--test data/test_multilabel.tsv --training-separator '\t' \
--test-separator '\t' --label-separator ':'
This command would use the tab character as train and test data field
delimiter and : as label delimiter (the examples in the tests set use
, as field delimiter and ‘:’ as label separator).
You can also choose to restrict the prediction to a subset of labels using
the --labels flag. The flag should be set to a comma-separated list of
labels. Setting this flag can also reduce the processing time for the
training file, because BigMLer will rely on them to produce the extended
version of the training file. Be careful, though, to avoid typos in the labels
in this case, or no objective fields will be created. Following the previous
example
bigmler --multi-label --train data/multilabel.csv \
--test data/test_multilabel.csv --label-separator ':' \
--labels Adult,Student
will limit the predictions to the Adult and Student classes, leaving
out the Teenager classification.
Multi-labeled predictions can also be computed using ensembles, one for each
label. To create an ensemble prediction, use the --number-of-models option
that will set the number of models in each ensemble
bigmler --multi-label --train data/multilabel.csv \
--number-of-models 20 --label-separator ':' \
--test data/test_multilabel.csv
The ids of the ensembles will be stored in an ensembles file in the output
directory, and can be used in other predictions by setting the --ensembles
option
bigmler --multi-label --ensembles multilabel/ensembles \
--test data/test_multilabel.csv
or you can retrieve all previously tagged ensembles with --ensemble-tag
bigmler --multi-label --ensemble-tag multilabel \
--test data/test_multilabel.csv
Multi-labeled resources
The resources generated from a multi-labeled training data file can also be recovered and used to generate more multi-labeled predictions. As in the single-labeled case
bigmler --multi-label --source source/522521bf37203f412f000100 \
--test data/test_multilabel.csv
would generate a dataset and the corresponding set of models needed to create
a predictions.csv file that contains the multi-labeled predictions.
Similarly, starting from a previously created multi-labeled dataset
bigmler --multi-label --dataset source/522521bf37203f412fac0135 \
--test data/test_multilabel.csv --output multilabel/predictions.csv
creates a bunch of models, one per label, and predicts storing the results
of each operation in the multilabel directory, and finally
bigmler --multi-label --models multilabel/models \
--test data/test_multilabel.csv
will retrieve the set of models created in the last example and use them in new
predictions. In addition, for these three cases you can restrict the labels
to predict to a subset of the complete list available in the original objective
field. The --labels option can be set to a comma-separated list of the
selected labels in order to do so.
The --model-tag can be used as well to retrieve multi-labeled
models and predict with them
bigmler --multi-label --model-tag my_multilabel \
--test data/test_multilabel.csv
Finally, BigMLer is also able to handle training files with more than one
multi-labeled field. Using the --multi-label-fields option you can
settle the fields that will be expanded as containing multiple labels
in the generated source and dataset.
bigmler --multi-label --multi-label-fields class,type \
--train data/multilabel_multi.csv --objective class
This command creates a source (and its corresponding dataset)
where both the class and type fields have been analysed
to create a new field per label. Then the --objective option sets class
to be the objective field and only the models needed to predict this field
are created. You could also create a new multi-label prediction for another
multi-label field, type in this case, by issuing a new BigMLer command
that uses the previously generated dataset as starting point
bigmler --multi-label --dataset dataset/52cafddb035d07269000075b \
--objective type
This would generate the models needed to predict type. It’s important to
remark that the models used to predict class in the first example will
use the rest of fields (including type as well as the ones generated
by expanding it) to build the prediction tree. If you don’t want this
fields to be used in the model construction, you can set the --model-fields
option to exclude them. For instance, if type has two labels, label1
and label2, then excluding them from the models that predict
class could be achieved using
bigmler --multi-label --dataset dataset/52cafddb035d07269000075b \
--objective class
--model-fields=' -type,-type - label1,-type - label2'
You can also generate new fields applying aggregation functions such as
count, first or last on the labels of the multi label fields. The
option --label-aggregates can be set to a comma-separated list of these
functions and a new column per multi label field and aggregation function
will be added to your source
bigmler --multi-label --train data/multilabel.csv \
--label-separator ':' --label-aggregates count,last \
--objective class
will generate class - count and class - last in addition to the set
of per label fields.
Multi-label evaluations
Multi-label predictions are computed using a set of binary models
(or ensembles), one for
each label to predict. Each model can be evaluated to check its
performance. In order to do so, you can mimic the commands explained in the
evaluations section for the single-label models and ensembles. Starting
from a local CSV file
bigmler --multi-label --train data/multilabel.csv \
--label-separator ":" --evaluate
will build the source, dataset and model objects for you using a
random 80% portion of data in your training file. After that, the remaining 20%
of the data will be run through each of the models to obtain an evaluation of
the corresponding model. BigMLer retrieves all evaluations and saves
them locally in json and txt format. They are named using the objective field
name and the value of the label that they refer to. Finally, it averages the
results obtained in all the evaluations to generate a mean evaluation stored
in the evaluation.txt and evaluation.json files. As an example,
if your objective field name is class and the labels it contains are
Adult,Student, the generated files will be
Generated files:
- MonNov0413_201326
evaluations
extended_multilabel.csv
source
evaluation_class_student.txt
models
evaluation_class_adult.json
dataset
evaluation.json
evaluation.txt
evaluation_class_student.json
bigmler_sessions
evaluation_class_adult.txt
You can use the same procedure with a previously existing multi-label source or dataset
bigmler --multi-label --source source/50a1e520eabcb404cd0000d1 \
--evaluate
bigmler --multi-label --dataset dataset/50a1f441035d0706d9000371 \
--evaluate
Finally, you can also evaluate a preexisting set of models or ensembles using a separate set of data stored in a file or a previous dataset
bigmler --multi-label --models MonNov0413_201326/models \
--test data/test_multilabel.csv --evaluate
bigmler --multi-label --ensembles MonNov0413_201328/ensembles \
--dataset dataset/50a1f441035d0706d9000371 --evaluate
Multi-label Options
|
Use multiple labels in the objective field |
|
Comma-separated list of labels used |
|
Character used as field separator in train data field |
|
Character used as label separator in the multi-labeled objective field |