Reify subcommand
This subcommand extracts the information in the existing resources to determine
the arguments that were used when they were created,
and generates scripts that could be used to reproduce them. Currently, the
language used in the scripts will be Python. The usual starting
point for BigML resources is a source created from inline, local or remote
data. Thus, the script keeps analyzing the chain of calls that led to a
certain resource until the root source is found.
The simplest example would be:
bigmler reify --id source/55d77ba60d052e23430027bb
that will output:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Python code to reify source/5bd431db3980b574bb0145bf
Generated by BigMLer
"""
def main():
from bigml.api import BigML
api = BigML()
source_url1 = "https://static.bigml.com/csv/iris.csv"
source1 = api.create_source(source_url1)
api.ok(source1)
args = \
{'fields': {'000000': {'name': 'sepal length', 'optype': 'numeric'},
'000001': {'name': 'sepal width', 'optype': 'numeric'},
'000002': {'name': 'petal length', 'optype': 'numeric'},
'000003': {'name': 'petal width', 'optype': 'numeric'},
'000004': {'name': 'species',
'optype': 'categorical',
'term_analysis': {'enabled': True}}}}
source2 = api.update_source(source1, args)
api.ok(source2)
if __name__ == "__main__":
main()
According to this output, the source was created from a remote file
located at https://static.bigml.com/csv/iris.csv
and the types of each of it’s fields are described and stored to ensure
that they match the ones in the resource.
This script will be stored in the command output directory and named
reify.py (you can specify a different name and location using the
--output option).
Other resources will have more complex workflows and more user-given
attributes. Let’s see for instance the
script to generate an evaluation from a train/test split of a source that
was created using the
bigmler --train data/iris.csv --evaluate command:
bigmler reify --id evaluation/55d919850d052e234b000833
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Python code to reify evaluation/5be371a02774cb26da00061c
Generated by BigMLer
"""
def main():
from bigml.api import BigML
api = BigML()
source1_file = "iris.csv"
args = \
{'category': 12,
'description': 'Created using BigMLer',
'fields': {'000000': {'name': 'sepal length', 'optype': 'numeric'},
'000001': {'name': 'sepal width', 'optype': 'numeric'},
'000002': {'name': 'petal length', 'optype': 'numeric'},
'000003': {'name': 'petal width', 'optype': 'numeric'},
'000004': {'name': 'species',
'optype': 'categorical',
'term_analysis': {'enabled': True}}},
'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
source2 = api.create_source(source1_file, args)
api.ok(source2)
args = \
{'category': 12,
'description': 'Created using BigMLer',
'objective_field': {'id': '000004'},
'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
dataset1 = api.create_dataset(source2, args)
api.ok(dataset1)
args = \
{'category': 12,
'description': 'Created using BigMLer',
'sample_rate': 0.8,
'seed': 'BigML, Machine Learning made easy',
'split_candidates': 32,
'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
model1 = api.create_model(dataset1, args)
api.ok(model1)
args = \
{'category': 12,
'description': 'Created using BigMLer',
'fields_map': {'000001': '000001',
'000002': '000002',
'000003': '000003',
'000004': '000004'},
'operating_kind': 'probability',
'out_of_bag': True,
'sample_rate': 0.8,
'seed': 'BigML, Machine Learning made easy',
'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
evaluation1 = api.create_evaluation(model1, dataset1, args)
api.ok(evaluation1)
if __name__ == "__main__":
main()
As you can see, BigMLer has added a default category,
description and tags attributes, has built the model on 80% of the data
and used the out_of_bag attribute for the
evaluation to use the remaining part of the dataset test data.
The bigmler reify command can generate also other types of
output depending on the
choice of the --language option. The available options are python
(the one by default), nb and whizzml.
The nb option will generate a jupyter notebook file.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reified resource: evaluation/5be371a02774cb26da00061c"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember to set your credentials in the BIGML_USERNAME and BIGML_API_KEY environment variables."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from bigml.api import BigML\n",
"api = BigML()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add the inputs for the workflow"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"source1_file = \"iris.csv\""
]
},
...
]
}
We can also reify any
resource and obtain the WhizzML script that would recreate it using
--language whizzml:
;;Step 1
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(5 fields (1 categorical, 4 numeric))
;;source/5be371949252734ec7000938
;;created by mmartin
(define source2
(update-and-wait source1
{"fields"
{"000000" {"name" "sepal length" "optype" "numeric"}
"000001" {"name" "sepal width" "optype" "numeric"}
"000002" {"name" "petal length" "optype" "numeric"}
"000003" {"name" "petal width" "optype" "numeric"}
"000004"
{"name" "species"
"optype" "categorical"
"term_analysis" {"enabled" true}}}
"category" 12
"description" "Created using BigMLer"
"tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]}))
;;Step 2
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(150 instances, 5 fields (1 categorical, 4 numeric))
;;dataset/5be371972774cb26d5000954
;;created by mmartin
(define dataset1
(create-and-wait-dataset {"source" source2
"description" "Created using BigMLer"
"category" 12
"tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]
"objective_field" {"id" "000004"}}))
;;Step 3
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(512-node, pruned, deterministic order, sample rate=0.8)
;;model/5be3719a2774cb26d60020fa
;;created by mmartin
(define model1
(create-and-wait-model {"dataset" dataset1
"description" "Created using BigMLer"
"category" 12
"tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]
"sample_rate" 0.8
"seed" "BigML, Machine Learning made easy"
"split_candidates" 32}))
;;Step 4
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(512-node, pruned, deterministic order, sample rate=0.8, operating kind=probability, sample rate=0.2, out of bag)
;;evaluation/5be371a02774cb26da00061c
;;created by mmartin
(define evaluation1
(create-and-wait-evaluation {"description" "Created using BigMLer"
"category" 12
"tags"
["BigMLer" "BigMLer_ThuNov0818_001323"]
"fields_map"
{"000001" "000001"
"000002" "000002"
"000003" "000003"
"000004" "000004"}
"sample_rate" 0.8
"seed" "BigML, Machine Learning made easy"
"operating_kind" "probability"
"out_of_bag" true
"dataset" dataset1
"model" model1}))
(define output-evaluation evaluation1)
Reify Subcommand Options
|
ID for the resource to be reified |
|
Language to be used for the script. Currently only Python is available |
|
Path to the file where the script will be stored |
|
Causes the fields information to be added to the source arguments |
Execute subcommand
This subcommand creates and executes scripts in WhizzML (BigML’s automation language). With WhizzML you can program any specific workflow that involves Machine Learning resources like datasets, models, etc. You just write a script using the directives in the reference manual and upload it to BigML, where it will be available as one more resource in your dashboard. Scripts can also be shared and published in the gallery, so you can reuse other users’ scripts and execute them. These operations can also be done using the bigmler execute subcommand.
The simplest example is executing some basic code, like adding two numbers:
bigmler execute --code "(+ 1 2)" --output-dir simple_exe
With this command, bigmler will generate a script in BigML whose source code
is the one given as a string in the --code option. The script ID will
be stored in a file called scripts in the simple_text
directory. After that, the
script will be executed, so a new resource called execution will be
created in BigML, and the corresponding ID will be stored in the
execution file of the output directory.
Similarly, the result of the execution will be stored
in whizzml_results.txt and whizzml_results.json
(in human-readable format and JSON respectively) in the
directory set in the --output-dir option. You can also use the code
stored in a file with the --code-file option.
Adding the --no-execute flag to the command will cause the process to
stop right after the script creation. You can also compile your code as a
library to be used in many scripts by setting the --to-library flag.
bigmler execute --code-file my_library.whizzml --to-library
Existing scripts can be referenced for execution with the --script option
bigmler execute --script script/50a2bb64035d0706db000643
or the script ID can be read from a file:
bigmler execute --scripts simple_exe/scripts
The script we used as an example is very simple and needs no additional
parameter. But, in general, scripts
will have input parameters and output variables. The inputs define the script
signature and must be declared in order to create the script. The outputs
are optional and any variable in the script can be declared to be an output.
Both inputs and outputs can be declared using the --declare-inputs and
--declare-outputs options. These options must contain the path
to the JSON file where the information about the
inputs and outputs (respectively) is stored.
bigmler execute --code '(define addition (+ a b))' \
--declare-inputs my_inputs_dec.json \
--declare-outputs my_outputs_dec.json \
--no-execute
in this example, the my_inputs_dec.json file could contain
[{"name": "a",
"default": 0,
"type": "number"},
{"name": "b",
"default": 0,
"type": "number",
"description": "second number to add"}]
and my_outputs_dec.json
[{"name": "addition",
"type": "number"}]
so that the value of the addition variable would be returned as
output in the execution results.
Additionally, a script can import libraries. The list of libraries to be
used as imports can be added to the command with the option --imports
followed by a comma-separated list of library IDs.
Once the script has been created and its inputs and outputs declared, to
execute it you’ll need to provide a value for each input. This can be
done using --inputs, that will also point to a JSON file where
each input should have its corresponding value.
bigmler execute --script script/50a2bb64035d0706db000643 \
--inputs my_inputs.json
where the my_inputs.json file would contain:
[["a", 1],
["b", 2]]
For more details about the syntax to declare inputs and outputs, please refer to the Developers documentation.
You can also provide default configuration attributes
for the resources generated in an execution. Add the
--creation-defaults option followed by the path
to a JSON file that contains a dictionary whose keys are the resource types
to which the configuration defaults apply and whose values are the
configuration attributes set by default.
bigmler execute --code-file my_script.whizzml \
--creation-defaults defaults.json
For instance, if my_script.whizzml creates an ensemble from a remote
file:
(define file "s3://bigml-public/csv/iris.csv")
(define source (create-and-wait-source {"remote" file}))
(define dataset (create-and wait-dataset {"source" source}))
(define ensemble (create-and-wait-ensemble {"dataset" dataset}))
and my_create_defaults.json contains
{
"source": {
"project": "project/54d9553bf0a5ea5fc0000016"
},
"ensemble": {
"number_of_models": 100, "sample_rate": 0.9
}
}
the source created by the script will be associated to the given project and the ensemble will have 100 models and a 0.9 sample rate unless the source code in your script explicitly specifies a different value, in which case it takes precedence over these defaults.
Execute Subcommand Options
|
WhizzML source code to be executed |
|
Path to the file that contains Whizzml source code |
|
Path to the JSON file with the default configurations for created resources. Please, see details in the API Developers documentation |
|
Path to the JSON file with the description of the input parameters. Please, see details in the API Developers documentation |
|
Path to the JSON file with the description of the script outputs. Please, see details in the API Developers documentation |
|
Path to a file that contains the location of the files to be embedded in the script as libraries |
|
BigML execution ID |
|
BigML execution JSON structure file |
|
Select executions tagged with EXECUTION_TAG |
|
Path to a file containing execution/ids. Just one execution per line (e.g., execution/50a20697035d0706da0004a4) |
|
Comma-separated list of libraries IDs to be included as imports in scripts or other libraries |
|
Path to the JSON file with the description of the execution inputs for a list of scripts |
|
Path to the JSON file with the description of the execution inputs. Please, see details in the API Developers documentation |
|
Path to a file containing libraries/ids. Just one library per line (e.g., library/50a20697035d0706da0004a4) |
|
BigML library Id. |
|
BigML library JSON structure file. |
|
Select libraries tagged with tag to be deleted |
|
Path to the JSON file with the names of the output parameters. Please, see details in the API Developers documentation |
|
BigML script Id. |
|
BigML script JSON structure file. |
|
Select script tagged with tag to be deleted |
|
Path to a file containing script/ids. Just one script per line (e.g., script/50a20697035d0706da0004a4). |
|
Boolean that causes the code to be compiled and stored as a library |
Whizzml subcommand
This subcommand creates packages of scripts and libraries in WhizzML
(BigML’s automation
language) based on the information provided by a metadata.json
file. These operations
can also be performed individually using the bigmler execute subcommand, but
bigmler whizzml reads the components of the package, and for each
component analyzes the corresponding metadata.json file to identify
the kind of code (script or library) that it contains and creates the
corresponding
resource in BigML. The metadata.json is expected to contain the
name, kind, description, inputs and outputs needed to create the script.
As an example,
{
"name": "Example of whizzml script",
"description": "Test example of a whizzml script that adds two numbers",
"kind": "script",
"source_code": "code.whizzml",
"inputs": [
{
"name": "a",
"type": "number",
"description": "First number"
},
{
"name": "b",
"type": "number",
"description": "Second number"
}
],
"outputs": [
{
"name": "addition",
"type": "number",
"description": "Sum of the numbers"
}
]
}
describes a script whose code is to be found in the code.whizzml file.
The script will have two inputs a and b and one output: addition.
In order to create this script, you can type the following command:
bigmler whizzml --package-dir my_package --output-dir creation_log
and bigmler will:
look for the
metadata.jsonfile located in themy_packagedirectory.parse the JSON, identify that it defines a script and look for its code in the
code.whizzmlfilecreate the corresponding BigML script resource, adding as arguments the ones provided in
inputs,outputs,nameanddescription.
Packages can contain more than one script. In this case, a nested directory
structure is expected. The metadata.json file for a package with many
components should include the name of the directories where these components
can be found:
{
"name": "Best k",
"description": "Library and scripts implementing Pham-Dimov-Nguyen k selection algorithm",
"kind": "package",
"components":[
"best-k-means",
"cluster",
"evaluation",
"batchcentroid"
]
}
In this example, each string in the components attributes list corresponds
to one directory where a new script or library (with its corresponding
metadata.json descriptor) is stored. Then, using bigmler whizzml
for this composite package will create each of the component scripts or
libraries. It will also handle dependencies, using the IDs of the created
libraries as imports for the scripts when needed. The metadata.json
that corresponds to a library is simpler than the one used for the script,
the difference being that kind in this case will be set to library
and no inputs or outputs are provided.
{
"name": "Best K-Means",
"description": "Best K-Means Clustering using the Pham, Dimov, and Nguyen Algorithm",
"kind": "library",
"source_code": "library.whizzml"
}
To include a library in the list of imports of a script, the imports
attribute is used in the script’s metadata.json. The imports
should be the list of folders that contain each library source code and
metadata.
{
"name": "Compute Best K-means Batchcentroid",
"description": "Basic script to use the best-kmeans library",
"kind": "script",
"source_code": "script.whizzml",
"imports": ["../best-k-means"],
"inputs": [
{
"name": "dataset",
"type": "dataset-id",
"description": "Dataset ID"
},
{
"name": "cluster-args",
"type": "map",
"description": "Map of args for clustering (excluding dataset and k) for k search",
"default": {}
},
{
"name": "k-min",
"type": "number",
"description": "Minimum value of k for search"
},
{
"name": "k-max",
"type": "number",
"description": "Maximum value of k for search"
},
{
"name": "bestcluster-args",
"type": "map",
"description": "Map of args for clustering (excluding dataset and k) for optimal k",
"default": {}
},
{
"name": "clean",
"type": "boolean",
"description": "Delete intermediate objects created during computation"
},
{
"name": "logf",
"type": "boolean",
"description": "Generate log entries"
}
],
"outputs": [
{
"name": "best-batchcentroid",
"type": "string",
"description": "Batchcentroid ID"
}
]
}
Also, existing scripts or libraries can be downladed and stored in the
user’s file system following the structure and conventions needed to be
uploaded again to BigML. By using the option --from followed by the
script or library ID and the --package-dir pointing to the storage folder.
bigmler whizzml --package-dir package_bck \
--from script/5a3ae0f14006833a070003a4
If the script is self-contained, the
previous command will create a package_bck folder where the
corresponding metadata.json file will store all the attributes, like the
name and description of the script, its inputs and outputs, the kind of
resource (script or library) and a source_code attribute that will
contain the name of the file where the source code will be placed.
Other complex scripts (and libraries) may not be self-contained and
will be importing functions defined in WhizzML libraries.
In that case, the package_bck folder will contain a list
of subdirectories, one per script or imported library. Each subdirectory will
contain the information about either the script or the library as described
in the previous paragraph. The --package-dir containing folder will in this
case also contain a metadata.json where the list of subfolders is stored
in its components attribute so that each of them can be generated and
imported correctly. It also contains the name and description of the
downloaded script and the kind attribute will be set to package.
Whizzml Subcommand Options
|
Directory that stores the package files |
|
It causes the subcommand to embed the libraries code in the package scripts instead of creating libraries and importing them |
|
ID of the script that is used as source for the package files |
Retrain subcommand
This subcommand can be used to retrain an existing modeling resource (model,
ensemble, deepnet, etc.) by adding new data to it. In BigML, resources are
immutable to ensure traceability, but at the same time they are reproducible.
Therefore, any model can be rebuilt using the data stored in a new consolidated
dataset or even from a list of existing datasets. That’s retraining the model
and the bigmler retrain
subcommand provides a simple way to do it.
In the basic use case, different parameters and model types are tried and evaluated till the best performing model is found. Then you can call:
bigmler retrain --id model/5a3ae0f14006833a070003a4 --add data/iris.csv \
--output-dir retrain_directory
so that the data in your local data/iris.csv file is uploaded to the
platform and all the steps that led to your existing model are reproduced to
create a new merged dataset that will be used to retrain your model. The
command output will contain the URL that you need to call to ensure you
always use the latest version of your model. The URL will look like:
Instead of using the original model ID, you can choose to add a unique tag
to your modeling resource and use that as reference:
bigmler retrain --ensemble-tag my_ensemble --add data/iris.csv \
--output-dir retrain_directory
in this case, the resource to retrain is an ensemble that has been
previously tagged as my_ensemble. The bigmler retrain command will
look for the newest ensemble that contains that tag and after uploading and
consolidating your data with the one previously used in the ensemble, it will
rebuild it. The reference used in the URL that will contain the latest version
of the ensemble will use this tag also as reference:
In a different scenario, you might want to retrain your model from a list of datasets, for instance training an anomaly detector using the data of the last 6 months. This means that you don’t want your data to be merged. Rather you would like to use a window over the list of available datasets.
bigmler retrain --ensemble-tag my_ensemble --add data/iris.csv \
--window-size 6 --output-dir retrain_directory
In this case, adding the --window-size option to your command will cause
the dataset created by uploading your new data to be added to the list of
datasets as a separate resource. Then model will be rebuilt using the number
of datasets set as --window-size.
The operations run by bigmler retrain are mainly run in BigML’s servers
using WhizzML scripts. This scripts are previously created in the user’s
account the first time you run the command, but they can also be recreated
by using the --upgrade flag in any bigmler retrain command call.
Retrain Subcommand Options
|
ID for the resource to be reified |
|
Maximum number of datasets to be used |
|
Path to the file that contains the data to be added |
|
Causes the scripts that generate the models rebuild to be recreated |
|
Retrieves models that were tagged with tag |
|
Retrieves ensembles that were tagged with tag |
|
Retrieves clusters that were tagged with tag |
|
Retrieves anomalies that were tagged with tag |
|
Retrieves logistic regressions that were tagged with tag |
|
Retrieves linear regressions that were tagged with tag |
|
Retrieves topic models that were tagged with tag |
|
Retrieves time series that were tagged with tag |
|
Retrieves associations that were tagged with tag |
|
Retrieves deepnets that were tagged with tag |