Analyze subcommand

In addition to the main BigMLer capabilities explained so far, there’s a subcommand bigmler analyze with more options to evaluate the performance of your models. For instance

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --cross-validation --k-folds 5

will create a k-fold cross-validation by dividing the data in your dataset in the number of parts given in --k-folds. Then evaluations are created by selecting one of the parts to be the test set and using the rest of data to build the model for testing. The generated evaluations are placed in your output directory and its average is stored in evaluation.txt and evaluation.json.

Similarly, you’ll be able to create an evaluation for ensembles. Using the same command above and adding the options to define the ensembles’ properties, such as --number-of-models, --sample-rate, --randomize or --replacement

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --cross-validation --k-folds 5 --number-of-models 20
                --sample-rate 0.8 --replacement

More insights can be drawn from the bigmler analyze --features command. In this case, the aim of the command is to analyze the complete set of features in your dataset to single out the ones that produce models with better evaluation scores. In this case, we focus on accuracy for categorical objective fields and r-squared for regressions.

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --features

This command uses an algorithm for smart feature selection as described in this blog post that evaluates models built by using subsets of features. It starts by building one model per feature, chooses the subset of features used in the model that scores best and, from there on, repeats the procedure by adding another of the available features in the dataset to the chosen subset. The iteration stops when no improvement in score is found for a number of repetitions that can be controlled using the --staleness option (default is 5). There’s also a --penalty option (default is 0.1%) that sets the amount that is substracted from the score per feature added to the subset. This penalty is intended to mitigate overfitting, but it also favors models which are quicker to build and evaluate. The evaluations for the scores are k-fold cross-validations. The --k-folds value is set to 5 by default, but you can change it to whatever suits your needs using the --k-folds option.

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --features --k-folds 10 --staleness 3 --penalty 0.002

Would select the best subset of features using 10-fold cross-validation and a 0.2% penalty per feature, stopping after 3 non-improving iterations.

Depending on the machine learning problem you intend to tackle, you might want to optimize other evaluation metric, such as precision or recall. The --optimize option will allow you to set the evaluation metric you’d like to optimize.

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --features --optimize recall

For categorical models, the evaluation values are obtained by counting the positive and negative matches for all the instances in the test set, but sometimes it can be more useful to optimize the performance of the model for a single category. This can be specially important in highly non-balanced datasets or when the cost function is mainly associated to one of the existing classes in the objective field. Using --optimize-category you can set the category whose evaluation metrics you’d like to optimize

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --features --optimize recall \
                --optimize-category Iris-setosa

You should be aware that the smart feature selection command still generates a high number of BigML resources. Using k as the k-folds number and n as the number of explored feature sets, it will be generating k datasets (1/k``th of the instances each), and ``k * n models and evaluations. Setting the --max-parallel-models and --max-parallel-evaluations to higher values (up to k) can help you speed up partially the creation process because resources will be created in parallel. You must keep in mind, though, that this parallelization is limited by the task limit associated to your subscription or account type.

As another optimization method, the bigmler analyze --nodes subcommand will find for you the best performing model by changing the number of nodes in its tree. You provide the --min-nodes and --max-nodes that define the range and --nodes-step controls the increment in each step. The command runs a k-fold evaluation (see --k-folds option) on a model built with each node threshold in you range and tries to optimize the evaluation metric you chose (again, default is accuracy). If improvement stops (see the –staleness option) or the node threshold reaches the --max-nodes limit, the process ends and shows the node threshold that lead to the best score.

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --nodes --min-nodes 10 \
                --max-nodes 200 --nodes-step 50

When working with random forest, you can also change the number of random_candidates or number of fields chosen at random when the models in the forest are built. Using bigmler analyze --random-fields the number of random_candidates will range from 1 to the number of fields in the origin dataset, and BigMLer will cross-validate the random forests to determine which random_candidates number gives the best performance.

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --random-fields

Please note that, in general, the exact choice of fields selected as random candidates might be more important than their actual number. However, in some marginal cases (e.g. datasets with a high number noise features) the number of random candidates can impact tree performance significantly.

For any of these options (--features, --nodes and --random-fields) you can add the --predictions-csv flag to the bigmler analyze command. The results will then include a CSV file that stores the predictions obtained in the evaluations that gave the best score. The file content includes the data in your original dataset tagged by k-fold and the prediction and confidence obtained. This file will be placed in an internal folder of your chosen output directory.

bigmler analyze --dataset dataset/5357eb2637203f1668000004 \
                --features --output-dir my_features --predictions-csv

The output directory for this command is my_features and it will contain all the information about the resources generated when testing the different feature combinations organized in subfolders. The k-fold datasets’ IDs will be stored in an inner test directory. The IDs of the resources created when testing each combination of features will be stored in kfold1, kfold2, etc. folders inside the test directory. If the best-scoring prediction models are the ones in the kfold4 folder, then the predictions CSV file will be stored in a new folder named kfold4_pred.

Analyze subcommand Options

--cross-validation

Sets the k-fold cross-validation mode

--k-folds

Number of folds used in k-fold cross-validation (default is 5)

--features

Sets the smart selection features mode

--staleness INTEGER

Number of iterations with no improvement that is considered the limit for the analysis to stop (default is 5)

--penalty FLOAT

Coefficient used to penalyze models with many features in the smart selection features mode (default is 0.001). Also used in node threshold selection (default is 0)

--optimize METRIC

Metric that is being optimized in the smart selection features mode or the node threshold search mode (default is accuracy)

--optimize-category CATEGORY

Category whoese metric is being optimized in the smart selection features mode or the node threshold search mode (only for categorical models)

--nodes

Sets the node threshold search mode

--min-nodes INTEGER

Minimum number of nodes to start the node threshold search mode (default 3)

--max-nodes INTEGER

Maximum number of nodes to end the node threshold search mode (default 2000)

--nodes-step INTEGER

Step in the node threshold search iteration (default 50)

--exclude-features FEATURES

Comma-separated list of features in the dataset to be excluded from the features analysis

--score

Causes the training set to be run through the anomaly detector generating a batch anomaly score. Only used with the --remote flag.