Anomaly subcommand
The bigmler anomaly subcommand generates all the resources needed to buid
an anomaly detection model and/or predict the anomaly scores associated to your
test data. As usual, the simplest call
bigmler anomaly --train data/tiny_kdd.csv
uploads the data in the data/tiny_kdd.csv file and generates
the corresponding source, dataset and anomaly objects in BigML. You
can use any of the generated objects to produce new anomaly detectors.
For instance, you could set a subgroup of the fields of the generated dataset
to produce a different anomaly detector by using
bigmler anomaly --dataset dataset/53b1f71437203f5ac30004ed \
--anomaly-fields="-urgent"
that would exclude the field urgent from the anomaly detector
creation input fields. You can also change the number of top anomalies
enclosed in the anomaly detector list and the number of trees that the anomaly
detector iforest uses. The default values are 10 top anomalies and 128 trees
per iforest:
bigmler anomaly --dataset dataset/53b1f71437203f5ac30004ed \
--top-n 15 --forest-size 50
with this code, the anomaly detector is built using an iforest of 50 trees and will produce a list of the 15 top anomalies.
Similarly to the models and datasets, the generated anomaly detectors
can be shared using the --shared option, e.g.
bigmler anomaly --source source/53b1f71437203f5ac30004e0 \
--shared
will generate a secret link for both the created dataset and anomaly detector that can be used to share the resource selectively.
The anomaly detector can be used to assign an anomaly score to each new input data set. The anomaly score is a number between 0 (not anomalous) and 1 (highest anomaly). The command
bigmler anomaly --anomaly anomaly/53b1f71437203f5ac30005c0 \
--test data/test_kdd.csv
would produce a file anomaly_scores.csv with the anomaly score associated
to each input. When the command is executed, the anomaly detector
information is downloaded
to your local computer and the anomaly score predictions are computed locally,
with no more latencies involved. Just in case you prefer to use BigML
to compute the anomaly score predictions remotely, you can do so too
bigmler anomaly --anomaly anomaly/53b1f71437203f5ac30005c0 \
--test data/my_test.csv --remote
would create a remote source and dataset from the test file data,
generate a batch anomaly score also remotely and finally
download the result to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-info full, that may result
in a large CSV to be created as output.
Similarly, you can split your data in train/test datasets to build the anomaly detector and create batch anomaly scores with the test portion of data
bigmler anomaly --train data/tiny_kdd.csv --test-split 0.2 --remote
or if you want to apply the anomaly detector on the same training data set to create a batch anomaly score, use:
bigmler anomaly --train data/tiny_kdd.csv --score --remote
To extract the top anomalies as a new dataset, or to exclude from the training
dataset the top anomalies in the anomaly detector, set the
--anomalies-dataset to ìn or out respectively:
bigmler anomaly --dataset dataset/53b1f71437203f5ac30004ed \
--anomalies-dataset out
will create a new dataset excluding the top anomalous instances according to the anomaly detector.
Anomaly Specific Subcommand Options
|
BigML anomaly Id |
|
Path to a file containing anomaly/ids. One anomaly per line (e.g., anomaly/4f824203ce80051) |
|
No anomaly detector will be generated |
|
Comma-separated list of fields that will be used in the anomaly detector construction |
|
Number of listed top anomalies |
|
Number of models in the anomaly detector iforest |
|
Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the anomaly creation call |
|
Path to a JSON file containing the anomaly info |
|
Seed to generate deterministic anomalies |
|
Comma-separated list of fields to be kept for reference but not used in the anomaly detector bulding process |
|
Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the anomaly score creation call |
|
Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the batch anomaly score creation call |
|
Separates from the training
dataset the top anomalous
instances enclosed in the
top anomalies list and generates
a new dataset including them
( |