Analyze subcommand
In addition to the main BigMLer capabilities explained so far, there’s a
subcommand bigmler analyze with more options to evaluate the performance
of your models. For instance
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--cross-validation --k-folds 5
will create a k-fold cross-validation by dividing the data in your dataset in
the number of parts given in --k-folds. Then evaluations are created by
selecting one of the parts to be the test set and using the rest of data
to build the model for testing. The generated
evaluations are placed in your output directory and its average is stored in
evaluation.txt and evaluation.json.
Similarly, you’ll be able to create an evaluation for ensembles. Using the
same command above and adding the options to define the ensembles’ properties,
such as --number-of-models, --sample-rate, --randomize or
--replacement
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--cross-validation --k-folds 5 --number-of-models 20
--sample-rate 0.8 --replacement
More insights can be drawn from the bigmler analyze --features command. In
this case, the aim of the command is to analyze the complete set of features
in your dataset to single out the ones that produce models with better
evaluation scores. In this case, we focus on accuracy for categorical
objective fields and r-squared for regressions.
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--features
This command uses an algorithm for smart feature selection as described in this
blog post
that evaluates models built by using subsets of features. It starts by
building one model per feature, chooses the subset of features used in the
model that scores best and, from there on, repeats the procedure
by adding another of the available features in the dataset to the chosen
subset. The iteration stops when no improvement in score is found for a number
of repetitions that can be controlled using the --staleness option
(default is 5). There’s
also a --penalty option (default is 0.1%) that sets the amount that
is substracted from the score per feature added to the
subset. This penalty is intended
to mitigate overfitting, but it also favors models which are quicker to build
and evaluate. The evaluations for the scores are k-fold cross-validations.
The --k-folds value is set to 5 by default, but you can change it
to whatever suits your needs using the --k-folds option.
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--features --k-folds 10 --staleness 3 --penalty 0.002
Would select the best subset of features using 10-fold cross-validation
and a 0.2% penalty per feature, stopping after 3 non-improving iterations.
Depending on the machine learning problem you intend to tackle, you might
want to optimize other evaluation metric, such as precision or
recall. The --optimize option will allow you to set the evaluation
metric you’d like to optimize.
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--features --optimize recall
For categorical models, the evaluation values are obtained by counting
the positive and negative matches for all the instances in
the test set, but sometimes it can be more useful to optimize the
performance of the model for a single category. This can be specially
important in highly non-balanced datasets or when the cost function is
mainly associated to one of the existing classes in the objective field.
Using --optimize-category you can set the category whose evaluation
metrics you’d like to optimize
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--features --optimize recall \
--optimize-category Iris-setosa
You should be aware that the smart feature selection command still generates
a high number of BigML resources. Using k as the k-folds number and
n as the number of explored feature sets, it will be generating k
datasets (1/k``th of the instances each), and ``k * n models and
evaluations. Setting the --max-parallel-models and
--max-parallel-evaluations to higher values (up to k) can help you
speed up partially the creation process because resources will be created
in parallel. You must keep in mind, though, that this parallelization is
limited by the task limit associated to your subscription or account type.
As another optimization method, the bigmler analyze --nodes subcommand
will find for you the best performing model by changing the number of nodes
in its tree. You provide the --min-nodes and --max-nodes that define
the range and --nodes-step controls the increment in each step. The command
runs a k-fold evaluation (see --k-folds option) on a model built with each
node threshold in you range and tries to optimize the evaluation metric you
chose (again, default is accuracy). If improvement stops (see
the –staleness option) or the node threshold reaches the --max-nodes
limit, the process ends and shows the node threshold that
lead to the best score.
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--nodes --min-nodes 10 \
--max-nodes 200 --nodes-step 50
When working with random forest, you can also change the number of
random_candidates or number of fields chosen at random when the models
in the forest are built. Using bigmler analyze --random-fields the number
of random_candidates will range from 1 to the number of fields in the
origin dataset, and BigMLer will cross-validate the random forests to determine
which random_candidates number gives the best performance.
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--random-fields
Please note that, in general, the exact choice of fields selected as random candidates might be more important than their actual number. However, in some marginal cases (e.g. datasets with a high number noise features) the number of random candidates can impact tree performance significantly.
For any of these options (--features, --nodes and --random-fields)
you can add the --predictions-csv flag to the bigmler analyze
command. The results will then include a CSV file that stores the predictions
obtained in the evaluations that gave the best score. The file content includes
the data in your original dataset tagged by k-fold and the prediction and
confidence obtained. This file will be placed in an internal folder of your
chosen output directory.
bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
--features --output-dir my_features --predictions-csv
The output directory for this command is my_features and it will
contain all the information about the resources generated when testing
the different feature combinations
organized in subfolders. The k-fold datasets’
IDs will be stored in an inner test directory. The IDs of the resources
created when testing each combination of features will be stored in
kfold1, kfold2, etc. folders inside the test directory.
If the best-scoring prediction
models are the ones in the kfold4 folder, then the predictions CSV file
will be stored in a new folder named kfold4_pred.
Analyze subcommand Options
|
Sets the k-fold cross-validation mode |
|
Number of folds used in k-fold cross-validation (default is 5) |
|
Sets the smart selection features mode |
|
Number of iterations with no improvement that is considered the limit for the analysis to stop (default is 5) |
|
Coefficient used to penalyze models with many features in the smart selection features mode (default is 0.001). Also used in node threshold selection (default is 0) |
|
Metric that is being optimized in the smart selection features mode or the node threshold search mode (default is accuracy) |
|
Category whoese metric is being optimized in the smart selection features mode or the node threshold search mode (only for categorical models) |
|
Sets the node threshold search mode |
|
Minimum number of nodes to start the node threshold search mode (default 3) |
|
Maximum number of nodes to end the node threshold search mode (default 2000) |
|
Step in the node threshold search iteration (default 50) |
|
Comma-separated list of features in the dataset to be excluded from the features analysis |
|
Causes the training set to be run
through the anomaly detector generating
a batch anomaly score. Only used with
the |