Quick Start
Let’s see some basic usage examples. Check the installation and authentication sections below if you are not familiar with BigML.
Basics
You can create a new model just with
bigmler --train data/iris.csv
If you check your dashboard at BigML, you will see a new source, dataset, and model. Isn’t it magic?
You can generate predictions for a test set using
bigmler --train data/iris.csv --test data/test_iris.csv
You can also specify a file name to save the newly created predictions
bigmler --train data/iris.csv --test data/test_iris.csv --output predictions
If you do not specify the path to an output file, BigMLer will auto-generate
one for you under a .bigmler_outputs directory.
The new directory will be named after the current date and time
(e.g., MonNov1212_174715/predictions.csv). With --prediction-info
flag set to brief only the prediction result will be stored (default is
normal and includes confidence information). You can also set it to
full if you prefer the result to be presented as a row with your test
input data followed by the corresponding prediction. To include a headers row
in the prediction file you can set --prediction-header. For both the
--prediction-info full and --prediction-info brief options, if you
want to include a subset of the fields in your test file you can select them by
setting --prediction-fields to a comma-separated list of them. Then
bigmler --train data/iris.csv --test data/test_iris.csv \
--prediction-info full --prediction-header \
--prediction-fields 'petal length','petal width'
will include in the generated predictions file a headers row
petal length,petal width,species,confidence
and only the values of petal length and petal width will be shown
before the objective field prediction species.
A different objective field (the field that you want to predict) can be
selected using
bigmler --train data/iris.csv --test data/test_iris.csv \
--objective 'sepal length'
If you do not explicitly specify an objective field, BigML will default to the last column in your dataset. You can also use as selector the field column number instead of the name (when –no-train-header is used, for instance).
Also, if your test file uses a particular field separator for its data,
you can tell BigMLer using --test-separator.
For example, if your test file uses the tab character as field separator the
call should be like
bigmler --train data/iris.csv --test data/test_iris.tsv \
--test-separator '\t'
The model’s predictions in BigMLer are based on the mean of the distribution
of training values in the predicted node. In case you would like to use the
median instead, you could just add the --median flag to your command
bigmler --train data/grades.csv --test data/test_grades.csv \
--median
Note that this flag can only be applied to regression models.
If you don’t provide a file name for your training source, BigMLer will try to read it from the standard input
cat data/iris.csv | bigmler --train
or you can also read the test info from there
cat data/test_iris.csv | bigmler --train data/iris.csv --test
BigMLer will try to use the locale of the model both to create a new source
(if the --train flag is used) and to interpret test data. In case
it fails, it will try en_US.UTF-8
or English_United States.1252 and a warning message will be printed.
If you want to change this behaviour you can specify your preferred locale
bigmler --train data/iris.csv --test data/test_iris.csv \
--locale "English_United States.1252"
If you check the .bigmler_outputs folder in your working directory
you will see that BigMLer creates a file with the
model ids that have been generated (e.g., FriNov0912_223645/models).
This file is handy if then you want to use those model ids to generate local
predictions. BigMLer also creates a file with the dataset id that has been
generated (e.g., TueNov1312_003451/dataset) and another one summarizing
the steps taken in the session progress: bigmler_sessions. You can also
store a copy of every created or retrieved resource in your output directory
(e.g., .bigmler_outputs/TueNov1312_003451/model_50c23e5e035d07305a00004f)
by setting the flag --store.
Remote Predictions
All the predictions we saw in the previous section are computed locally in
your computer. BigMLer allows you to ask for a remote computation by adding
the --remote flag. Remote computations are treated as batch computations.
This means that your test data will be loaded in BigML as a regular source and
the corresponding dataset will be created and fed as input data to your
model to generate a remote batch prediction object. BigMLer will download
the predictions file created as a result of this batch prediction and
save it to local storage just as it did for local predictions
bigmler --train data/iris.csv --test data/test_iris.csv \
--remote --output my_dir/remote_predictions.csv
This command will create a source, dataset and model for your training data,
a source and dataset for your test data and a batch prediction using the model
and the test dataset. The results will be stored in the
my_dir/remote_predictions.csv file. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-info full, that may result
in a large CSV to be created as output. Other output configurations can be
set by using the --batch-prediction-attributes option pointing to a JSON
file that contains the desired attributes, like:
{"probabilities": true,
"all_fields": true}
In case you prefer BigMLer to issue
one-by-one remote prediction calls, you can use the --no-batch flag
bigmler --train data/iris.csv --test data/test_iris.csv \
--remote --no-batch
External Connectors
Data can be uploaded from local and remote public files in BigML as you will
see in the sources section. It can also be extracted
from an external database manager like PostgreSQL, MySQL, Elasticsearch or
SQL Server. An externalconnector resource can be created in BigML to use it
as data feed.
bigmler connector --host my_data.hostname.com \
--port 1234 \
--engine postgresql \
--user my_username \
--password my_password \
--database my_database \
--output-dir out
This command will generate the externalconnector and the corresponding
external connector ID will be stored in the external_connector file of
your out directory. Using this ID as reference and the query of choice
when creating a source in BigML, you will be able to connect and upload
data to the platform.
Remote Sources
You can create models using remote sources as well. You just need a valid URL that points to your data. BigML recognizes a growing list of schemas (http, https, s3, azure, odata, etc). For example
bigmler --train https://test:test@static.bigml.com/csv/iris.csv
bigmler --train "s3://bigml-public/csv/iris.csv?access-key=[your-access-key]&secret-key=[your-secret-key]"
bigmler --train azure://csv/diabetes.csv?AccountName=bigmlpublic
bigmler --train odata://api.datamarket.azure.com/www.bcn.cat/BCNOFFERING0005/v1/CARRegistration?$top=100
Also, you can use an existing connector to an external source (see the external connectors section). The connector ID and the particular query must be placed in a JSON file:
bigmler --train my_connector.json
where the JSON file should contain the following structure:
{"source": "postgresql",
"externalconnector_id": "51901f4337203f3a9a000215",
"query": "select * from my_table"}
Can you imagine how powerful this feature is? You can create predictive models for huge amounts of data without using you local CPU, memory, disk or bandwidth. Welcome to the cloud!!!
To learn more about other sources and options, please check the Source subcommand subcommand.
Ensembles
You can also easily create ensembles. For example, using bagging is as easy as
bigmler --train data/iris.csv --test data/test_iris.csv \
--number-of-models 10 --sample-rate 0.75 --replacement \
--tag my_ensemble
To create a random decision forest just use the –randomize option
bigmler --train data/iris.csv --test data/test_iris.csv \
--number-of-models 10 --sample-rate 0.75 --replacement \
--tag my_random_forest --randomize
The fields to choose from will be randomized at each split creating a random decision forest that when used together will increase the prediction performance of the individual models.
To create a boosted trees’ ensemble use the –boosting option
bigmler --train data/iris.csv --test data/test_iris.csv \
--boosting --tag my_boosted_trees
or add the --boosting-iterations limit
bigmler --train data/iris.csv --test data/test_iris.csv \
--booting-iterations 10 --sample-rate 0.75 --replacement \
--tag my_boosted_trees
Once you have an existing ensemble, you can use it to predict. You can do so with the command
bigmler --ensemble ensemble/51901f4337203f3a9a000215 \
--test data/test_iris.csv
Or if you want to evaluate it
bigmler --ensemble ensemble/51901f4337203f3a9a000215 \
--test data/iris.csv --evaluate
There are some more advanced options that can help you build local predictions
with your ensembles.
When the number of local models becomes quite large holding all the models in
memory may exhaust your resources. To avoid this problem you can use the
--max_batch_models flag which controls how many local models are held
in memory at the same time
bigmler --train data/iris.csv --test data/test_iris.csv \
--number-of-models 10 --sample-rate 0.75 --max-batch-models 5
The predictions generated when using this option will be stored in a file per model and named after the models’ id (e.g. model_50c23e5e035d07305a00004f__predictions.csv”). Each line contains the prediction, its confidence, the node’s distribution and the node’s total number of instances. The default value for ``max-batch-models` is 10.
When using ensembles, model’s predictions are combined to issue a final
prediction. There are several different methods to build the combination.
You can choose plurality, confidence weighted, probability weighted
or threshold using the --method flag
bigmler --train data/iris.csv --test data/test_iris.csv \
--number-of-models 10 --sample-rate 0.75 \
--method "confidence weighted"
For classification ensembles, the combination is made by majority vote:
plurality weights each model’s prediction as one vote,
confidence weighted uses confidences as weight for the prediction,
probability weighted uses the probability of the class in the distribution
of classes in the node as weight, and threshold uses an integer number
as threshold and a class name to issue the prediction: if the votes for
the chosen class reach the threshold value, then the class is predicted
and plurality for the rest of predictions is used otherwise
bigmler --train data/iris.csv --test data/test_iris.csv \
--number-of-models 10 --sample-rate 0.75 \
--method threshold --threshold 4 --class 'Iris-setosa'
For regression ensembles, the predicted values are averaged: plurality
again weights each predicted value as one,
confidence weighted weights each prediction according to the associated
error and probability weighted gives the same results as plurality.
As in the model’s case, you can base your prediction on the median of the
predicted node’s distribution by adding --median to your BigMLer command.
It is also possible to enlarge the number of models that build your prediction
gradually. You can build more than one ensemble for the same test data and
combine the votes of all of them by using the flag combine_votes
followed by the comma separated list of directories where predictions are
stored. For instance
bigmler --train data/iris.csv --test data/test_iris.csv \
--number-of-models 20 --sample-rate 0.75 \
--output ./dir1/predictions.csv
bigmler --dataset dataset/50c23e5e035d07305a000056 \
--test data/test_iris.csv --number-of-models 20 \
--sample-rate 0.75 --output ./dir2/predictions.csv
bigmler --combine-votes ./dir1,./dir2
would generate a set of 20 prediction files, one for each model, in ./dir1,
a similar set in ./dir2 and combine all of them to generate the final
prediction.
Making your Dataset and Model public or sharing it privately
Creating a model and making it public in BigML’s gallery is as easy as
bigmler --train data/iris.csv --white-box
If you just want to share it as a black-box model just use
bigmler --train data/iris.csv --black-box
If you also want to make public your dataset
bigmler --train data/iris.csv --public-dataset
You can also share your datasets, models and evaluations privately with
whomever you choose by generating a private link. The --shared flag will
create such a link
bigmler --dataset dataset/534487ef37203f0d6b000894 --shared --no-model
and the link will be listed in the output of the command
bigmler --dataset dataset/534487ef37203f0d6b000894 --shared --no-model
[2014-04-18 09:29:27] Retrieving dataset. https://bigml.com/dashboard/dataset/534487ef37203f0d6b000894
[2014-04-18 09:29:30] Updating dataset. https://bigml.com/dashboard/dataset/534487ef37203f0d6b000894
[2014-04-18 09:29:30] Shared dataset link. https://bigml.com/shared/dataset/8VPwG7Ny39g1mXBRD1sKQLuHrqE
or can also be found in the information pannel for the resource through the web interface.
Descriptive information
Before making your model public, probably you want to add a name, a category, a description, and tags to your resources. This is easy too. For example
bigmler --train data/iris.csv --name "My model" --category 6 \
--description data/description.txt --tag iris --tag my_tag
Please note:
You can get a full list of BigML category codes here.
Descriptions are provided in a text file that can also include markdown.
Many tags can be added to the same resource.
Use
--no_tagif you do not want default BigMLer tags to be added.BigMLer will add the name, category, description, and tags to all the newly created resources in each request.
Projects
Each resource created in BigML can be associated to a project. Projects are
intended for organizational purposes, and BigMLer can create projects
each time a source is created using a --project
option. For instance
bigmler --train data/iris.csv --project "my new project"
will first check for the existence of a project by that name. If it exists,
will associate the source, dataset and model resources to this project.
If it doesn’t, a new project is created and then associated.
You can also associate resources to any project in your account
by specifying the option --project-id followed by its id
bigmler --train data/iris.csv --project-id project/524487ef37203f0d6b000894
Note: Once a source has been associated to a project, all the resources
derived from this source will be automatically associated to the same
project.
You can also create projects or update their properties by using the bigmler
project subcommand. In particular, when projects need
to be created in an organization, the --organization option has to
be added to inform about the ID of the organization where the project should
be created:
bigmler project --organization organization/524487ef37203f0d6b000594 \
--name "my new project"
Only allowed users can create projects in organizations. If you are not the
owner or an administrator, please check your permissions with them first.
You can learn more about organizations at the
API documentation.
You can also create resources in an organization’s project if your user
has the right privileges. In order to do that, you should add the
--org-project option followed by the organization’s project ID.
bigmler --train data/iris.csv \
--org-project project/524487ef37203f0d6b000894
Using the existing resources in BigML
You don’t need to create a model from scratch every time that you use BigMLer. You can generate predictions for a test set using a previously generated model, cluster, etc. The example shows how you would do that for a tree model:
bigmler --model model/50a1f43deabcb404d3000079 --test data/test_iris.csv
You can also use a number of models providing a file with a model/id per line
bigmler --models TueDec0412_174148/models --test data/test_iris.csv
Or all the models that were tagged with a specific tag
bigmler --model-tag my_tag --test data/test_iris.csv
The same can be extended to any other subcomamnd, like bigmler cluster
using the correct option (--cluster cluster/50a1f43deabcb404d3000da2,
--clusters TueDec0412_174148/clusters and cluster-tag my_tag).
Please, check each subcommand available options for details.
You can also use a previously generated dataset to create a new model
bigmler --dataset dataset/50a1f441035d0706d9000371
You can also input the dataset from a file
bigmler --datasets iris_dataset
A previously generated source can also be used to generate a new dataset and model
bigmler --source source/50a1e520eabcb404cd0000d1
And test sources and datasets can also be referenced by id in new BigMLer requests for remote predictions
bigmler --model model/52af53a437203f1cfe0001f0 --remote \
--test-source source/52b0cbe637203f1d3e0015db
bigmler --model model/52af53a437203f1cfe0001f0 --remote \
--test-dataset dataset/52b0fb5637203f5c4f000018
Evaluations
BigMLer can also help you to measure the performance of your supervised models (decision trees, ensembles, deepnets, linear regressions and logistic regressions). The simplest way to build a model and evaluate it all at once is
bigmler --train data/iris.csv --evaluate
which will build the source, dataset and model objects for you using 80% of the data in your training file chosen at random. After that, the remaining 20% of the data will be run through the model to obtain the corresponding evaluation.
The same procedure is available for ensembles:
bigmler --train data/iris.csv --number-of-models 10 --evaluate
for deepnets
bigmler deepnet --train data/iris.csv --evaluate
for linear regressions
bigmler linear-regression --train data/iris.csv --evaluate
and for logistic regressions:
bigmler logistic-regression --train data/iris.csv --evaluate
You can use the same procedure with a previously existing source or dataset
bigmler --source source/50a1e520eabcb404cd0000d1 --evaluate
bigmler --dataset dataset/50a1f441035d0706d9000371 --evaluate
The results of an evaluation are stored both in txt and json files. Its contents will follow the description given in the Developers guide, evaluation section and vary depending on the model being a classification or regression one.
Finally, you can also evaluate a preexisting model using a separate set of data stored in a file or a previous dataset
bigmler --model model/50a1f43deabcb404d3000079 --test data/iris.csv \
--evaluate
bigmler --model model/50a1f43deabcb404d3000079 \
--test-dataset dataset/50a1f441035d0706d9000371 --evaluate
As for predictions, you can specify a particular file name to store the evaluation in
bigmler --train data/iris.csv --evaluate --output my_dir/evaluation
Cross-validation
If you need cross-validation techniques to ponder which parameters (like
the ones related to different kinds of pruning) can improve the quality of your
models, you can use the --cross-validation-rate flag to settle the
part of your training data that will be separated for cross validation. BigMLer
will use a Monte-Carlo cross-validation variant, building 2*n different
models, each of which is constructed by a subset of the training data,
holding out randomly n% of the instances. The held-out data will then be
used to evaluate the corresponding model. For instance, both
bigmler --train data/iris.csv --cross-validation-rate 0.02
bigmler --dataset dataset/519029ae37203f3a9a0002bf \
--cross-validation-rate 0.02
will hold out 2% of the training data to evaluate a model built upon the
remaining 98%. The evaluations will be averaged and the result saved
in json and human-readable formats in cross-validation.json and
cross-validation.txt respectively. Of course, in this kind of
cross-validation you can choose the number of evaluations yourself by
setting the --number-of-evaluations flag. You should just keep in mind
that it must be high enough to ensure low variance, for instance
bigmler --train data/iris.csv --cross-validation-rate 0.1 \
--number-of-evaluations 20
The --max-parallel-evaluations flag will help you limit the number of
parallel evaluation creation calls.
bigmler --train data/iris.csv --cross-validation-rate 0.1 \
--number-of-evaluations 20 --max-parallel-evaluations 2
Configuring Datasets and Models
What if your raw data isn’t necessarily in the format that BigML expects? So we have good news: you can use a number of options to configure your sources, datasets, and models.
Most resources in BigML contain information about the fields used in the
resource construction. Sources contain information about the name, label,
description and type of the fields detected in the data you upload.
In addition to that, datasets contain the information of the values that
each field contains, whether they have missing values or errors and even
if they are preferred fields or non-preferred (fields that are not expected
to convey real information to the model, like user IDs or constant fields).
This information is available in the “fields” attribute of each resource,
but BigMLer can extract it and build a CSV file with a summary of it.
bigmler --source source/50a1f43deabcb404d3010079 \
--export-fields fields_summary.csv \
--output-dir summary
By using this command, BigMLer will create a fields_summary.csv file
in a summary output directory. The file will contain a headers row and
the fields information available in the source, namely the field column,
field ID, field name, field label and field description of each field. If you
execute the same command on a dataset
bigmler --dataset dataset/50a1f43deabcb404d3010079 \
--export-fields fields_summary.csv \
--output-dir summary
you will also see the number of missing values and errors found in each field and an excerpt of the values and errors.
But then, imagine that you want to alter BigML’s default field names or the ones provided by the training set header and capitalize them, even to add a label or a description to each field. You can use several methods. Write a text file with a change per line as follows
bigmler --train data/iris.csv --field-attributes fields.csv
where fields.csv would be
0,'SEPAL LENGTH','label for SEPAL LENGTH','description for SEPAL LENGTH'
1,'SEPAL WIDTH','label for SEPAL WIDTH','description for SEPAL WIDTH'
2,'PETAL LENGTH','label for PETAL LENGTH','description for PETAL LENGTH'
3,'PETAL WIDTH','label for PETAL WIDTH','description for PETAL WIDTH'
4,'SPECIES','label for SPECIES','description for SPECIES'
The number on the left in each line is the column number of the field in your source and is followed by the new field’s name, label and description.
Similarly you can also alter the auto-detect type behavior from BigML assigning specific types to specific fields
bigmler --train data/iris.csv --types types.txt
where types.txt would be
0, 'numeric'
1, 'numeric'
2, 'numeric'
3, 'numeric'
4, 'categorical'
Finally, the same summary file that could be built with the --export-fields
option can be used to modify the updatable information in sources
and datasets. Just edit the CSV file with your favourite editor setting
the new values for the fields and use:
bigmler --source source/50a1f43deabcb404d3010079 \
--import-fields summary/fields_summary.csv
to update the names, labels, descriptions or types of the fields with the ones
in the summary/fields_summary.csv file.
You could
also use this option to change the preferred attributes for each
of the fields. This transformation is made at the dataset level,
so in the prior code it will be applied once a dataset is created from
the referred source. You might as well act
on an existing dataset:
bigmler --dataset dataset/50a1f43deabcb404d3010079 \
--import-fields summary/fields_summary.csv
In order to update more detailed
source options, you can use the --source-attributes option pointing
to a file path that contains the configuration settings to be modified
in JSON format
bigmler --source source/52b8a12037203f48bc00000a \
--source-attributes my_dir/attributes.json --no-dataset
Let’s say this source has a text field with id 000001. The
attributes.json to change its text parsing mode to full field contents
would read
{"fields": {"000001": {"term_analysis": {"token_mode": "full_terms_only"}}}}
you can also reference the fields by its column number in this JSON structures.
If the field to be modified is in the second column (column index starts at 0)
then the contents of the attributes.json file could be as well
{"fields": {"1": {"term_analysis": {"token_mode": "full_terms_only"}}}}
The source-attributes JSON can contain any of the updatable attributes
described in the
developers section
You can specify the fields that you want to include in the dataset by naming
them explicitly
bigmler --train data/iris.csv \
--dataset-fields 'sepal length','sepal width','species'
or the fields that you want to include as predictors in the model
bigmler --train data/iris.csv --model-fields 'sepal length','sepal width'
You can also specify the chosen fields by adding or removing the ones you
choose to the list of preferred fields of the previous resource. Just prefix
their names with + or - respectively. For example,
you could create a model from an existing dataset using all their fields but
the sepal length by saying
bigmler --dataset dataset/50a1f441035d0706d9000371 \
--model-fields -'sepal length'
When evaluating, you can map the fields of the evaluated model to those of the test dataset by writing in a file the field column of the model and the field column of the dataset separated by a comma and using –fields-map flag to specify the name of the file
bigmler --dataset dataset/50a1f441035d0706d9000371 \
--model model/50a1f43deabcb404d3000079 --evaluate \
--fields-map fields_map.txt
where fields_map.txt would contain
0, 1
1, 0
2, 2
3, 3
4, 4
if the first two fields had been reversed.
Finally, you can also tell BigML whether your training and test set come with a header row or not. For example, if both come without header
bigmler --train data/iris_nh.csv --test data/test_iris_nh.csv \
--no-train-header --no-test-header
Splitting Datasets
When following the usual proceedings to evaluate your models you’ll need to
separate the available data in two sets: the training set and the test set. With
BigMLer you won’t need to create two separate physical files. Instead, you
can set a --test-split flag that will set the percentage of data used to
build the test set and leave the rest for training. For instance
bigmler --train data/iris.csv --test-split 0.2 --name iris --evaluate
will build a source with your entire file contents, create the corresponding
dataset and split it in two: a test dataset with 20% of instances and a
training dataset with the remaining 80%. Then, a model will be created based on
the training set data and evaluated using the test set. By default, split is
deterministic, so that every time you issue the same command will get the
same split datasets. If you want to generate
different splits from a unique dataset you can set the --seed option to a
different string in every call
bigmler --train data/iris.csv --test-split 0.2 --name iris \
--seed my_random_string_382734627364 --evaluate
Advanced Dataset management
As you can find in the BigML’s API documentation on datasets besides the basic name, label and description that we discussed in previous sections, there are many more configurable options in a dataset resource. As an example, to publish a dataset in the gallery and set its price you could use
{"private": false, "price": 120.4}
Similarly, you might want to add fields to your existing dataset by combining
some of its fields or simply tagging their rows. Using BigMLer, you can set the
--new-fields option to a file path that contains a JSON structure that
describes the fields you want to select or exclude from the original dataset,
or the ones you want to combine and
the Flatline expression to
combine them. This structure
must follow the rules of a specific languange described in the Transformations
item of the developers
section
bigmler --dataset dataset/52b8a12037203f48bc00000a \
--new-fields my_dir/generators.json
To see a simple example, should you want to include all the fields but the
one with id 000001 and add a new one with a label depending on whether
the value of the field sepal length is smaller than 1,
you would write in generators.json
{"all_but": ["000001"], "new_fields": [{"name": "new_field", "field": "(if (< (f \"sepal length\") 1) \"small\" \"big\")"}]}
Or, as another example, to tag the outliers of the same field one coud use
{"new_fields": [{"name": "outlier?", "field": "(if (within-percentiles? \"sepal length\" 0.5 0.95) \"normal\" \"outlier\")"}]}
You can also export the contents of a generated dataset by using the
--to-csv option. Thus,
bigmler --dataset dataset/52b8a12037203f48bc00000a \
--to-csv my_dataset.csv --no-model
will create a CSV file named my_dataset.csv in the default directory
created by BigMLer to place the command output files. If no file name is given,
the file will be named after the dataset id.
A dataset can also be generated as the union of several datasets using the
flag --multi-dataset. The datasets will be read from a file specified
in the --datasets option and the file must contain one dataset id per line.
bigmler --datasets my_datasets --multi-dataset --no-model
This syntax is used when all the datasets in the my_datasets file share
a common field structre, so the correspondence of the fields of all the
datasets is straight forward. In the general case, the multi-dataset will
inherit the field structure of the first component dataset.
If you want to build a multi-dataset with
datasets whose fields share not the same column disposition, you can specify
which fields are correlated to the ones of the first dataset
by mapping the fields of the rest of datasets to them.
The option --multi-dataset-attributes can point to a JSON
file that contains such a map. The command line syntax would then be
bigmler --datasets my_datasets --multi-dataset \
--multi-dataset-attributes my_fields_map.json \
--no-model
and for a simple case where the second dataset had flipped the first and second fields with respect to the first one, the file would read
{"fields_maps": {"dataset/53330bce37203f222e00004b": {"000000": "000001",
"000001": "000000"}}
}
where dataset/53330bce37203f222e00004b would be the id of the
second dataset in the multi-dataset.
Model Weights
To deal with imbalanced datasets, BigMLer offers three options: --balance,
--weight-field and --objective-weights.
For classification models, the --balance flag will cause all the classes
in the dataset to
contribute evenly. A weight will be assigned automatically to each
instance. This weight is
inversely proportional to the number of instances in the class it belongs to,
in order to ensure even distribution for the classes.
You can also use a field in the dataset that contains the weight you would like
to use for each instance. Using the --weight-field option followed by
the field name or column number will cause BigMLer to use its data as instance
weight. This is valid for both regression and classification models.
The --objective-weights option is used in classification models to
transmit to BigMLer what weight is assigned to each class. The option accepts
a path to a CSV file that should contain the class,``weight`` values one
per row
bigmler --dataset dataset/52b8a12037203f48bc00000a \
--objective-weights my_weights.csv
where the my_weights.csv file could read
Iris-setosa,5
Iris-versicolor,3
so that BigMLer would associate a weight of 5 to the Iris-setosa
class and 3 to the Iris-versicolor class. For additional classes
in the model, like Iris-virginica in the previous example,
weight 1 is used as default. All specified weights must be non-negative
numbers (with either integer or real values) and at least one of them must
be non-zero.
Predictions’ missing strategy
Sometimes the available data lacks some of the features our models use to
predict. In these occasions, BigML offers two different ways of handling
input data with missing values, that is to say, the missing strategy. When the
path to the prediction reaches a split point that checks
the value of a field which is missing in your input data, using the
last prediction strategy the final prediction will be the prediction for
the last node in the path before that point, and using the proportional
strategy it will be a weighted average of all the predictions for the final
nodes reached considering that both branches of the split are possible.
BigMLer adds the --missing-strategy option, that can be set either to
last or proportional to choose the behavior in such cases. Last
prediction is the one used when this option is not used.
bigmler --model model/52b8a12037203f48bc00001a \
--missing-strategy proportional --test my_test.csv
Models with missing splits
Another configuration argument that can change models when
the training data has instances with missing values in some of its features
is --missing-splits. By setting this flag, the model building algorithm
will be able to include the instances
that have missing values for the field used to split the data in each node
in one of the stemming branches. This will, obviously, affect also the
predictions given by the model for input data with missing values. Here’s an
example to build
a model using missing-splits and predict with it.
bigmler --dataset dataset/52b8a12037203f48bc00023b \
--missing-splits --test my_test.csv
Fitering Sources
Imagine that you have create a new source and that you want to create a specific dataset filtering the rows of the source that only meet certain criteria. You can do that using a JSON expresion as follows
bigmler --source source/50a2bb64035d0706db0006cc --json-filter filter.json
where filter.json is a file containg a expression like this
["<", 7.00, ["field", "000000"]]
or a LISP expression as follows
bigmler --source source/50a2bb64035d0706db0006cc --lisp-filter filter.lisp
where filter.lisp is a file containing a expression like this
(< 7.00 (field "sepal length"))
For more details, see the BigML’s API documentation on filtering rows.
High number of Categories
In BigML there’s a limit in the number of categories of a categorical
objective field. This limit is set to ensure the quality of the resulting
models. This may become a restriction when dealing with
categorical objective fields with a high number of categories. To cope with
these cases, BigMLer offers the –max-categories option. Setting to a number
lower than the mentioned limit, the existing categories will be organized in
subsets of that size. Then the original dataset will be copied many times, one
per subset, and its objective field will only keep the categories belonging to
each subset plus a generic ***** other ***** category that will summarize
the rest of categories. Then a model will be created from each dataset and
the test data will be run through them to generate partial predictions. The
final prediction will be extracted by choosing the class with highest
confidence from the distributions obtained for
each model’s prediction ignoring the ***** other ****** generic category.
For instance, to use the same iris.csv example, you could do
bigmler --train data/iris.csv --max-categories 1 \
--test data/test_iris.csv --objective species
This command would generate a source and dataset object, as usual, but then,
as the total number of categories is three and –max-categories is set to 1,
three more datasets will be created, one per each category. After generating
the corresponding models, the test data will be run through them and their
predictions combined to obtain the final predictions file. The same procedure
would be applied if starting from a preexisting source or dataset using the
--source or --dataset options. Please note that the --objective
flag is mandatory in this case to ensure that the right categorical field
is selected as objective field.
--method option accepts a new combine value to use such kind of
combination. You can use it if you need to create a new group of predictions
based on the same models produced in the first example. Filling the path to the
model ids file
bigmler --models my_dir/models --method combine \
--test data/new_test.csv
the new predictions will be created. Also, you could use the set of datasets
created in the first case as starting point. Their ids are stored in a
dataset_parts file that can be found in the output location
bigmler --dataset my_dir/dataset_parts --method combine \
--test data/test.csv
This command would cause a new set of models, one per dataset, to be generated and their predictions would be combined in a final predictions file.
Additional Features
Using local models to predict
Most of the previously described commands need the remote resources to
be downloaded to work. For instance, when you want to create a new
model from an existing dataset, BigMLer is going to download the dataset
JSON structure to extract the fields and objective field information,
and only then ask for the model creation. As mentioned,
the --store flag forces BigMLer to store the downloaded JSON
structures in local files inside your output directory. If you use that flag
when building a model with BigMLer, then the model is stored in your computer.
This model file contains all the information you need in order to make
new predictions, so you can use the
--model-file option to set the path to this file and predict
the value of your objective field for new input data with no reference at all
to your remote resources. You could even delete the original remote model and
work exclusively with the locally downloaded file
bigmler --model-file my_dir/model_532db2b637203f3f1a000136 \
--test data/test_iris.csv
The same is available for clusters
bigmler cluster --cluster-file my_dir/cluster_532db2b637203f3f1a000348 \
--test data/test_diabetes.csv
anomaly detectors
bigmler anomaly --anomaly-file my_dir/anomaly_532db2b637203f3f1a00053a \
--test data/test_kdd.csv
logistic regressions
bigmler logistic-regression \
--logistic-file my_dir/logisticregression_532db2b637203f3f1a00053a \
--test data/test_diabetes.csv
linear regressions
bigmler linear-regression \
--linear-file my_dir/linearregression_532db2b637203f3f1a00053a \
--test data/test_diabetes.csv
topic models
bigmler topic-model \
--topic-model-file my_dir/topicmodel_532db2b637203f3f1a00053a \
--test data/test_spam.csv
time series
bigmler time-series \
--time-series-file my_dir/timeseries_532db2b637203f5f1a00053a \
--horizon 20
deepnets
bigmler deepnets --deepnet-file my_dir/deepnet_532db2b637203f5f1a00053a \
--test data/test_diabetes.csv
Even for ensembles
bigmler --ensemble-file my_dir/ensemble_532db2b637203f3f1a00053b \
--test data/test_iris.csv
In this case, the models included in the ensemble are expected to be stored also in the same directory where the local file for the ensemble is. They are downloaded otherwise.
Resuming Previous Commands
Network connections failures or other external causes can break the BigMLer command process. To resume a command ended by an unexpected event you can issue
bigmler --resume
BigMLer keeps track of each command you issue in a .bigmler file and of
the output directory in .bigmler_dir_stack of your working directory.
Then --resume will recover the last issued command and try to continue
work from the point it was stopped. There’s also a --stack-level flag
bigmler --resume --stack-level 1
to allow resuming a previous command in the stack. In the example, the one before the last.
Building reports
The resources generated in the execution of a BigMLer command are listed in
the standard output by default,
but they can be summarized as well in a Gazibit format.
Gazibit is a platform where you can create interactive
presentations in a
flexible and dynamic way. Using BigMLer’s --reports gazibit option you’ll
be able to generate a Gazibit summary report of your newly created
resources. In
case you use also the --shared flag, a second template will be generated
where the links for the shared resources will be used. Both reports will be
stored in the reports subdirectory of your output directory, where all of
the files generated by the BigMLer command are. Thus,
bigmler --train data/iris.csv --reports gazibit --shared \
--output-dir my_dir
will generate two files: gazibit.json and gazibit_shared.json in a
reports subdirectory of your my_dir directory. In case you provide
your Gazibit token in the GAZIBIT_TOKEN environment variable, they will
also be uploaded to your account in Gazibit. Upload can be avoided, by
using the --no-upload flag.
User Chosen Defaults
BigMLer will look for bigmler.ini file in the working directory where
users can personalize the default values they like for the most relevant flags.
The options should be written in a config style, e.g.
[BigMLer]
dev = true
resources_log = ./my_log.log
as you can see, under a [BigMLer] section the file should contain one line
per option. Dashes in flags are transformed to undescores in options.
The example would keep development mode on and would log all created
resources to my_log.log for any new bigmler command issued under the
same working directory if none of the related flags are set.
Naturally, the default value options given in this file will be overriden by the corresponding flag value in the present command. To follow the previous example, if you use
bigmler --train data/iris.csv --resources-log ./another_log.log
in the same working directory, the value of the flag will be preeminent and
resources will be logged in another_log.log. For boolean-valued flags,
such as --replacement itself, you’ll need to use the associated negative
flags to
overide the default behaviour. That is, following the former example if you
want to avoid storing the downloaded resource JSON information,
you should use the --no-store flag.
bigmler --train data/iris.csv --no-store
The set of negative flags is:
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
|
as opposed to |
Optional Arguments
General configuration
|
BigML’s username. If left unspecified, it will default to the
values of the |
|
BigML’s api_key. If left unspecified, it will default to the
values of the |
|
Activates debug level and shows log info for each https request |
Basic Functionality
|
Full path to a training set. It can be a remote URL to a (gzipped or compressed) CSV file. The protocol schemes can be http, https, s3, azure, odata |
|
Full path to a test set. A file containing the data that you want to input to generate predictions |
|
The column number of the Objective Field (the field that you want to predict) or its name |
|
Full path to a
file to save
predictions.
If unspecified,
it will default
to an
auto-generated
file created by
BigMLer. It
overrides
|
|
Directory where
all the session
files
will be stored.
It is overriden
by |
|
Prediction method
used:
|
|
The pruning
applied in
building the
model.
It’s allowed
values are
|
|
The strategy
applied
predicting
when a
missing value is
found in a model
split.
It’s allowed
values are
|
|
Turns on the missing_splits flag in model creation. The model splits can include in one of its branches the data with missing values |
|
Turns on evaluation mode |
|
Retries command execution |
|
Level of the retried command in the stack |
|
Fraction of the training data held out for Monte-Carlo cross-validation |
|
Number of runs that will be used in cross-validation |
|
Maximum number of evaluations to create in parallel |
|
Project name for the project to be associated to newly created sources |
|
Project id for the project to be associated to newly created sources |
|
Project id for the project of an Organization |
|
Causes the output of a batch prediction, batch centroid or batch anomaly score not to be downloaded as a CSV file |
|
Causes the output of a batch prediction, batch centroid or batch anomaly score to be stored remotely as a new dataset |
|
Predictions for single models are returned based on the median of the distribution in the predicted node |
Meta information
|
Name for the resources in BigML. |
|
Category code. See full list. |
|
Path to a file with a description in plain text or markdown |
|
Tag to later retrieve new resources |
|
Puts BigMLer default tag if no other tag is given |
Data Configuration
|
The train set file hasn’t a header |
|
The test set file hasn’t a header |
|
Path to a file describing field attributes One definition per line (e.g., 0,’Last Name’) |
|
Path to a file describing field types. One definition per line (e.g., 0, ‘numeric’) |
|
Path to a file describing test field attributes. One definition per line (e.g., 0,’Last Name’) |
|
Path to a file describing test field types. One definition per line (e.g., 0, ‘numeric’) |
|
Comma-separated list of field column numbers to include in the dataset |
|
Comma-separated list of input fields (predictors) to create the model |
|
Path to a file containing a JSON expression with attributes to be used as arguments (any of the updatable attributes described in the developers section ) in create source calls |
|
Path to a file containing a JSON expression with attributes to be used as arguments (any of the updatable attributes described in the developers section ) in create dataset calls |
|
Path to a file containing a JSON expression with attributes to be used as arguments (any of the updatable attributes described in the developers section ) in create model calls |
|
Path to a file containing a JSON expression with attributes to be used as arguments (any of the updatable attributes described in the developers section ) in create ensemble calls |
|
Path to a file containing a JSON expression with attributes to be used as arguments (any of the updatable attributes described in the developers section ) in create evaluation calls |
|
Path to a file containing a JSON expression with attributes to be used as arguments (any of the updatable attributes described in the developers section ) in create batch prediction calls |
|
Path to a file containing a JSON expression to filter the source |
|
Path to a file containing a LISP expression to filter the source |
|
Locale code string |
|
Path to a file containing the dataset to model fields map for evaluation |
|
Character used as test data field separator |
|
Include a headers row in the prediction file |
|
Comma-separated list of fields of the test file to be included in the prediction file |
|
Sets the maximum number of categories that will be used in a dataset. When more categories are found, new datasets are generated to analize the remaining categories |
|
Path to a file containing a JSON expression used to generate a new dataset with new fields created via Flatline <https://github.com/bigmlcom/flatline> by combining or setting their values |
|
Maximum number or nodes to grow the tree with |
|
Automatically balance data to treat all classes evenly |
|
Field name or column number that contains the weights to be used for each instance |
|
Creates a secret link for every dataset, model or evaluation used in the command |
|
Report formats: “gazibit” |
|
Disables reports upload |
|
Sets the evaluation mode that uses the list of test datasets and extracts one each time to test the model built with the rest of them (k-fold cross-validation) |
|
Character used as separator in multi-valued arguments (default is comma) |
|
Turns off the missing_splits flag in model creation. |
Remote Resources
|
BigML source Id |
|
BigML dataset Id |
|
Path to a file containing a dataset Id |
|
BigML model Id |
|
Path to a file containing model/ids. One model per line (e.g., model/4f824203ce80053) |
|
BigML ensemble Id |
|
Path to a file containing ensembles Ids |
|
BigML test source Id (only for remote predictions) |
|
BigML test dataset Id (only for remote predictions) |
|
Path to the file that contains datasets ids used in evaluations, one id per line. |
|
BigML source Id |
|
BigML dataset Id |
|
Computes predictions remotely (in batch mode by default) |
|
Remote predictions are computed individually |
|
Ensemble’s local predictions are computed storing the predictions of each model in a separate local file before combining them (the default is –fast, that keeps in memory each model’s prediction) |
|
Retrieve models that were tagged with tag |
|
Retrieve ensembles that were tagged with tag |
Ensembles
|
Number of models to create |
|
Sample rate to use (a float between 0.01 and 1) |
|
Use replacement when sampling |
|
Max number of models to create in parallel |
|
Max number of local models to be predicted from in parallel. For ensembles with a number of models over it, predictions are stored in files as they are computed and retrived and combined eventually |
|
Use a random set of fields to split on |
|
Combines the votes of models generated in a list of directories |
|
Ensemble sampling rate for bagging |
|
Value used as seed in ensembles random selections |
|
Don’t use replacement when bagging |
|
Create a boosted ensemble |
|
Maximum number of iterations used in boosted ensembles. |
|
The portion of the dataset that will be held out for testing at the end of every iteration in boosted ensembles (between 0 and 1) |
|
Causes the out of bag samples not to be tested after every iteration in boosted ensembles. |
|
It controls how aggressively the boosting algorithm will fit the data in boosted ensembles (between 0 and 1) |
|
Causes the out of bag samples not to be tested after every iteration to choose the gradient step size in boosted ensembles. |
If you are not choosing to create an ensemble, make sure that you tag your models conveniently so that you can then retrieve them later to generate predictions.
Public Resources
|
Makes newly created dataset public |
|
Makes newly created model a public black-box |
|
Makes newly created model a public white-box |
|
Sets the price for a public model |
|
Sets the price for a public dataset |
|
Sets the credits consumed by prediction |
Notice that datasets and models will be made public without assigning any price to them.
Local Resources
|
Path to a JSON file containing the model info |
|
Path to a JSON file containing the ensemble info |
Fancy Options
|
Does not create a model. BigMLer will only create a source |
|
Does not create a model. BigMLer will only create a dataset |
|
Keeps a log of the resources generated in each command |
|
Shows the version number |
|
Turns on (1) or off (0) the verbosity. |
|
Clears the |
|
Stores every created or retrieved resource in your output directory |
BigMLer encodings and locale
All data uploaded to BigML (and used in BigMLer) is expected to be UTF-8
encoded. The data itself, besides its encoding,
can contain information in different languages. English is the default
language, but that can be set to a different value using –locale. Setting
the language determines the conventions for parsing number literals
(decimal separator), dates, etc.
Also, BigMLer will write information to your console and local files.
Most Operating Systems will also accept UTF-8 output, which is used
by default. However, Windows systems may need a different encoding.
We allow the user to specify this enconding
as an environment variable BIGML_SYS_ENCODING. In this case, BigMLer will
try to guess the system encoding when absent.