Reify subcommand

This subcommand extracts the information in the existing resources to determine the arguments that were used when they were created, and generates scripts that could be used to reproduce them. Currently, the language used in the scripts will be Python. The usual starting point for BigML resources is a source created from inline, local or remote data. Thus, the script keeps analyzing the chain of calls that led to a certain resource until the root source is found.

The simplest example would be:

bigmler reify --id source/55d77ba60d052e23430027bb

that will output:

#!/usr/bin/env python
# -​*- coding: utf-8 -*​-
"""Python code to reify source/5bd431db3980b574bb0145bf

Generated by BigMLer
"""


def main():

    from bigml.api import BigML
    api = BigML()
    source_url1 = "https://static.bigml.com/csv/iris.csv"
    source1 = api.create_source(source_url1)
    api.ok(source1)

    args = \
        {'fields': {'000000': {'name': 'sepal length', 'optype': 'numeric'},
                    '000001': {'name': 'sepal width', 'optype': 'numeric'},
                    '000002': {'name': 'petal length', 'optype': 'numeric'},
                    '000003': {'name': 'petal width', 'optype': 'numeric'},
                    '000004': {'name': 'species',
                               'optype': 'categorical',
                               'term_analysis': {'enabled': True}}}}
    source2 = api.update_source(source1, args)
    api.ok(source2)

if __name__ == "__main__":
    main()

According to this output, the source was created from a remote file located at https://static.bigml.com/csv/iris.csv and the types of each of it’s fields are described and stored to ensure that they match the ones in the resource.

This script will be stored in the command output directory and named reify.py (you can specify a different name and location using the --output option).

Other resources will have more complex workflows and more user-given attributes. Let’s see for instance the script to generate an evaluation from a train/test split of a source that was created using the bigmler --train data/iris.csv --evaluate command:

bigmler reify --id evaluation/55d919850d052e234b000833
#!/usr/bin/env python
# -​*- coding: utf-8 -*​-
"""Python code to reify evaluation/5be371a02774cb26da00061c

Generated by BigMLer
"""


def main():

    from bigml.api import BigML
    api = BigML()
    source1_file = "iris.csv"
    args = \
        {'category': 12,
         'description': 'Created using BigMLer',
         'fields': {'000000': {'name': 'sepal length', 'optype': 'numeric'},
                    '000001': {'name': 'sepal width', 'optype': 'numeric'},
                    '000002': {'name': 'petal length', 'optype': 'numeric'},
                    '000003': {'name': 'petal width', 'optype': 'numeric'},
                    '000004': {'name': 'species',
                               'optype': 'categorical',
                               'term_analysis': {'enabled': True}}},
         'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
    source2 = api.create_source(source1_file, args)
    api.ok(source2)

    args = \
        {'category': 12,
         'description': 'Created using BigMLer',
         'objective_field': {'id': '000004'},
         'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
    dataset1 = api.create_dataset(source2, args)
    api.ok(dataset1)

    args = \
        {'category': 12,
         'description': 'Created using BigMLer',
         'sample_rate': 0.8,
         'seed': 'BigML, Machine Learning made easy',
         'split_candidates': 32,
         'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
    model1 = api.create_model(dataset1, args)
    api.ok(model1)

    args = \
        {'category': 12,
         'description': 'Created using BigMLer',
         'fields_map': {'000001': '000001',
                        '000002': '000002',
                        '000003': '000003',
                        '000004': '000004'},
         'operating_kind': 'probability',
         'out_of_bag': True,
         'sample_rate': 0.8,
         'seed': 'BigML, Machine Learning made easy',
         'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']}
    evaluation1 = api.create_evaluation(model1, dataset1, args)
    api.ok(evaluation1)

if __name__ == "__main__":
    main()

As you can see, BigMLer has added a default category, description and tags attributes, has built the model on 80% of the data and used the out_of_bag attribute for the evaluation to use the remaining part of the dataset test data.

The bigmler reify command can generate also other types of output depending on the choice of the --language option. The available options are python (the one by default), nb and whizzml. The nb option will generate a jupyter notebook file.

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reified resource: evaluation/5be371a02774cb26da00061c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember to set your credentials in the BIGML_USERNAME and BIGML_API_KEY environment variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigml.api import BigML\n",
    "api = BigML()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add the inputs for the workflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "source1_file = \"iris.csv\""
   ]
  },
  ...
 ]
}

We can also reify any resource and obtain the WhizzML script that would recreate it using --language whizzml:

;;Step 1
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(5 fields (1 categorical, 4 numeric))
;;source/5be371949252734ec7000938
;;created by mmartin
(define source2
  (update-and-wait source1
                   {"fields"
                    {"000000" {"name" "sepal length" "optype" "numeric"}
                     "000001" {"name" "sepal width" "optype" "numeric"}
                     "000002" {"name" "petal length" "optype" "numeric"}
                     "000003" {"name" "petal width" "optype" "numeric"}
                     "000004"
                     {"name" "species"
                      "optype" "categorical"
                      "term_analysis" {"enabled" true}}}
                    "category" 12
                    "description" "Created using BigMLer"
                    "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]}))

;;Step 2
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(150 instances, 5 fields (1 categorical, 4 numeric))
;;dataset/5be371972774cb26d5000954
;;created by mmartin
(define dataset1
  (create-and-wait-dataset {"source" source2
                            "description" "Created using BigMLer"
                            "category" 12
                            "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]
                            "objective_field" {"id" "000004"}}))

;;Step 3
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(512-node, pruned, deterministic order, sample rate=0.8)
;;model/5be3719a2774cb26d60020fa
;;created by mmartin
(define model1
  (create-and-wait-model {"dataset" dataset1
                          "description" "Created using BigMLer"
                          "category" 12
                          "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]
                          "sample_rate" 0.8
                          "seed" "BigML, Machine Learning made easy"
                          "split_candidates" 32}))

;;Step 4
;;WhizzML for resource: BigMLer_ThuNov0818_001323
;;(512-node, pruned, deterministic order, sample rate=0.8, operating kind=probability, sample rate=0.2, out of bag)
;;evaluation/5be371a02774cb26da00061c
;;created by mmartin
(define evaluation1
  (create-and-wait-evaluation {"description" "Created using BigMLer"
                               "category" 12
                               "tags"
                               ["BigMLer" "BigMLer_ThuNov0818_001323"]
                               "fields_map"
                               {"000001" "000001"
                                "000002" "000002"
                                "000003" "000003"
                                "000004" "000004"}
                               "sample_rate" 0.8
                               "seed" "BigML, Machine Learning made easy"
                               "operating_kind" "probability"
                               "out_of_bag" true
                               "dataset" dataset1
                               "model" model1}))
(define output-evaluation evaluation1)

Reify Subcommand Options

--id RESOURCE_ID

ID for the resource to be reified

--language SCRIPTING_LANG

Language to be used for the script. Currently only Python is available

--output PATH

Path to the file where the script will be stored

--add-fields

Causes the fields information to be added to the source arguments

Execute subcommand

This subcommand creates and executes scripts in WhizzML (BigML’s automation language). With WhizzML you can program any specific workflow that involves Machine Learning resources like datasets, models, etc. You just write a script using the directives in the reference manual and upload it to BigML, where it will be available as one more resource in your dashboard. Scripts can also be shared and published in the gallery, so you can reuse other users’ scripts and execute them. These operations can also be done using the bigmler execute subcommand.

The simplest example is executing some basic code, like adding two numbers:

bigmler execute --code "(+ 1 2)" --output-dir simple_exe

With this command, bigmler will generate a script in BigML whose source code is the one given as a string in the --code option. The script ID will be stored in a file called scripts in the simple_text directory. After that, the script will be executed, so a new resource called execution will be created in BigML, and the corresponding ID will be stored in the execution file of the output directory. Similarly, the result of the execution will be stored in whizzml_results.txt and whizzml_results.json (in human-readable format and JSON respectively) in the directory set in the --output-dir option. You can also use the code stored in a file with the --code-file option.

Adding the --no-execute flag to the command will cause the process to stop right after the script creation. You can also compile your code as a library to be used in many scripts by setting the --to-library flag.

bigmler execute --code-file my_library.whizzml --to-library

Existing scripts can be referenced for execution with the --script option

bigmler execute --script script/50a2bb64035d0706db000643

or the script ID can be read from a file:

bigmler execute --scripts simple_exe/scripts

The script we used as an example is very simple and needs no additional parameter. But, in general, scripts will have input parameters and output variables. The inputs define the script signature and must be declared in order to create the script. The outputs are optional and any variable in the script can be declared to be an output. Both inputs and outputs can be declared using the --declare-inputs and --declare-outputs options. These options must contain the path to the JSON file where the information about the inputs and outputs (respectively) is stored.

bigmler execute --code '(define addition (+ a b))' \
                --declare-inputs my_inputs_dec.json \
                --declare-outputs my_outputs_dec.json \
                --no-execute

in this example, the my_inputs_dec.json file could contain

[{"name": "a",
  "default": 0,
  "type": "number"},
 {"name": "b",
  "default": 0,
  "type": "number",
  "description": "second number to add"}]

and my_outputs_dec.json

[{"name": "addition",
  "type": "number"}]

so that the value of the addition variable would be returned as output in the execution results.

Additionally, a script can import libraries. The list of libraries to be used as imports can be added to the command with the option --imports followed by a comma-separated list of library IDs.

Once the script has been created and its inputs and outputs declared, to execute it you’ll need to provide a value for each input. This can be done using --inputs, that will also point to a JSON file where each input should have its corresponding value.

bigmler execute --script script/50a2bb64035d0706db000643 \
                --inputs my_inputs.json

where the my_inputs.json file would contain:

[["a", 1],
 ["b", 2]]

For more details about the syntax to declare inputs and outputs, please refer to the Developers documentation.

You can also provide default configuration attributes for the resources generated in an execution. Add the --creation-defaults option followed by the path to a JSON file that contains a dictionary whose keys are the resource types to which the configuration defaults apply and whose values are the configuration attributes set by default.

bigmler execute --code-file my_script.whizzml \
                --creation-defaults defaults.json

For instance, if my_script.whizzml creates an ensemble from a remote file:

(define file "s3://bigml-public/csv/iris.csv")
(define source (create-and-wait-source {"remote" file}))
(define dataset (create-and wait-dataset {"source" source}))
(define ensemble (create-and-wait-ensemble {"dataset" dataset}))

and my_create_defaults.json contains

{
    "source": {
    "project": "project/54d9553bf0a5ea5fc0000016"
    },
    "ensemble": {
    "number_of_models": 100, "sample_rate": 0.9
    }
}

the source created by the script will be associated to the given project and the ensemble will have 100 models and a 0.9 sample rate unless the source code in your script explicitly specifies a different value, in which case it takes precedence over these defaults.

Execute Subcommand Options

--code SOURCE_CODE

WhizzML source code to be executed

--code-file PATH

Path to the file that contains Whizzml source code

--creation-defaults RESOURCE_DEFAULTS

Path to the JSON file with the default configurations for created resources. Please, see details in the API Developers documentation

--declare-inputs INPUTS_DECLARATION

Path to the JSON file with the description of the input parameters. Please, see details in the API Developers documentation

--declare-outputs OUTPUTS_DECLARATION

Path to the JSON file with the description of the script outputs. Please, see details in the API Developers documentation

--embedded-libraries PATH

Path to a file that contains the location of the files to be embedded in the script as libraries

--execution EXECUTION_ID

BigML execution ID

--execution-file EXECUTION_FILE

BigML execution JSON structure file

--execution-tag EXECUTION_TAG

Select executions tagged with EXECUTION_TAG

--executions EXECUTIONS

Path to a file containing execution/ids. Just one execution per line (e.g., execution/50a20697035d0706da0004a4)

--imports LIBRARIES

Comma-separated list of libraries IDs to be included as imports in scripts or other libraries

--input-maps INPUT_MAPS

Path to the JSON file with the description of the execution inputs for a list of scripts

--inputs INPUTS

Path to the JSON file with the description of the execution inputs. Please, see details in the API Developers documentation

--libraries LIBRARIES

Path to a file containing libraries/ids. Just one library per line (e.g., library/50a20697035d0706da0004a4)

--library LIBRARY

BigML library Id.

--library-file LIBRARY_FILE

BigML library JSON structure file.

--library-tag LIBRARY_TAG

Select libraries tagged with tag to be deleted

--outputs OUTPUTS

Path to the JSON file with the names of the output parameters. Please, see details in the API Developers documentation

--script SCRIPT

BigML script Id.

--script-file SCRIPT_FILE

BigML script JSON structure file.

--script-tag SCRIPT_TAG

Select script tagged with tag to be deleted

--scripts SCRIPTS

Path to a file containing script/ids. Just one script per line (e.g., script/50a20697035d0706da0004a4).

--to-library

Boolean that causes the code to be compiled and stored as a library

Whizzml subcommand

This subcommand creates packages of scripts and libraries in WhizzML (BigML’s automation language) based on the information provided by a metadata.json file. These operations can also be performed individually using the bigmler execute subcommand, but bigmler whizzml reads the components of the package, and for each component analyzes the corresponding metadata.json file to identify the kind of code (script or library) that it contains and creates the corresponding resource in BigML. The metadata.json is expected to contain the name, kind, description, inputs and outputs needed to create the script. As an example,

{
  "name": "Example of whizzml script",
  "description": "Test example of a whizzml script that adds two numbers",
  "kind": "script",
  "source_code": "code.whizzml",
  "inputs": [
      {
          "name": "a",
          "type": "number",
          "description": "First number"
      },
      {
          "name": "b",
          "type": "number",
          "description": "Second number"
      }
  ],
  "outputs": [
      {
          "name": "addition",
          "type": "number",
          "description": "Sum of the numbers"
      }
  ]
}

describes a script whose code is to be found in the code.whizzml file. The script will have two inputs a and b and one output: addition.

In order to create this script, you can type the following command:

bigmler whizzml --package-dir my_package --output-dir creation_log

and bigmler will:

  • look for the metadata.json file located in the my_package directory.

  • parse the JSON, identify that it defines a script and look for its code in the code.whizzml file

  • create the corresponding BigML script resource, adding as arguments the ones provided in inputs, outputs, name and description.

Packages can contain more than one script. In this case, a nested directory structure is expected. The metadata.json file for a package with many components should include the name of the directories where these components can be found:

{
  "name": "Best k",
  "description": "Library and scripts implementing Pham-Dimov-Nguyen k selection algorithm",
  "kind": "package",
  "components":[
    "best-k-means",
    "cluster",
    "evaluation",
    "batchcentroid"
  ]
}

In this example, each string in the components attributes list corresponds to one directory where a new script or library (with its corresponding metadata.json descriptor) is stored. Then, using bigmler whizzml for this composite package will create each of the component scripts or libraries. It will also handle dependencies, using the IDs of the created libraries as imports for the scripts when needed. The metadata.json that corresponds to a library is simpler than the one used for the script, the difference being that kind in this case will be set to library and no inputs or outputs are provided.

{
  "name": "Best K-Means",
  "description": "Best K-Means Clustering using the Pham, Dimov, and Nguyen Algorithm",
  "kind": "library",
  "source_code": "library.whizzml"
}

To include a library in the list of imports of a script, the imports attribute is used in the script’s metadata.json. The imports should be the list of folders that contain each library source code and metadata.

{
  "name": "Compute Best K-means Batchcentroid",
  "description": "Basic script to use the best-kmeans library",
  "kind": "script",
  "source_code": "script.whizzml",
  "imports": ["../best-k-means"],
  "inputs": [
    {
      "name": "dataset",
      "type": "dataset-id",
      "description": "Dataset ID"
    },
    {
      "name": "cluster-args",
      "type": "map",
      "description": "Map of args for clustering (excluding dataset and k) for k search",
      "default": {}
    },
    {
      "name": "k-min",
      "type": "number",
      "description": "Minimum value of k for search"
    },
    {
      "name": "k-max",
      "type": "number",
      "description": "Maximum value of k for search"
    },
    {
      "name": "bestcluster-args",
      "type": "map",
      "description": "Map of args for clustering (excluding dataset and k) for optimal k",
      "default": {}
    },
    {
      "name": "clean",
      "type": "boolean",
      "description": "Delete intermediate objects created during computation"
    },
    {
      "name": "logf",
      "type": "boolean",
      "description": "Generate log entries"
    }
  ],
  "outputs": [
    {
      "name": "best-batchcentroid",
      "type": "string",
      "description": "Batchcentroid ID"
    }
  ]
}

Also, existing scripts or libraries can be downladed and stored in the user’s file system following the structure and conventions needed to be uploaded again to BigML. By using the option --from followed by the script or library ID and the --package-dir pointing to the storage folder.

bigmler whizzml --package-dir package_bck \
                --from script/5a3ae0f14006833a070003a4

If the script is self-contained, the previous command will create a package_bck folder where the corresponding metadata.json file will store all the attributes, like the name and description of the script, its inputs and outputs, the kind of resource (script or library) and a source_code attribute that will contain the name of the file where the source code will be placed.

Other complex scripts (and libraries) may not be self-contained and will be importing functions defined in WhizzML libraries. In that case, the package_bck folder will contain a list of subdirectories, one per script or imported library. Each subdirectory will contain the information about either the script or the library as described in the previous paragraph. The --package-dir containing folder will in this case also contain a metadata.json where the list of subfolders is stored in its components attribute so that each of them can be generated and imported correctly. It also contains the name and description of the downloaded script and the kind attribute will be set to package.

Whizzml Subcommand Options

--package-dir DIR

Directory that stores the package files

--embed-libs

It causes the subcommand to embed the libraries code in the package scripts instead of creating libraries and importing them

--from SCRIPT ID

ID of the script that is used as source for the package files

Retrain subcommand

This subcommand can be used to retrain an existing modeling resource (model, ensemble, deepnet, etc.) by adding new data to it. In BigML, resources are immutable to ensure traceability, but at the same time they are reproducible. Therefore, any model can be rebuilt using the data stored in a new consolidated dataset or even from a list of existing datasets. That’s retraining the model and the bigmler retrain subcommand provides a simple way to do it.

In the basic use case, different parameters and model types are tried and evaluated till the best performing model is found. Then you can call:

bigmler retrain --id model/5a3ae0f14006833a070003a4 --add data/iris.csv \
                --output-dir retrain_directory

so that the data in your local data/iris.csv file is uploaded to the platform and all the steps that led to your existing model are reproduced to create a new merged dataset that will be used to retrain your model. The command output will contain the URL that you need to call to ensure you always use the latest version of your model. The URL will look like:


https://bigml.io/andromeda/model?username=my_user;api_key=my_api_key;limit=1;full=yes;tags=retrain:model/5a3ae0f14006833a070003a4

Instead of using the original model ID, you can choose to add a unique tag to your modeling resource and use that as reference:

bigmler retrain --ensemble-tag my_ensemble --add data/iris.csv \
                --output-dir retrain_directory

in this case, the resource to retrain is an ensemble that has been previously tagged as my_ensemble. The bigmler retrain command will look for the newest ensemble that contains that tag and after uploading and consolidating your data with the one previously used in the ensemble, it will rebuild it. The reference used in the URL that will contain the latest version of the ensemble will use this tag also as reference:


https://bigml.io/andromeda/ensemble?username=my_user;api_key=my_api_key;limit=1;full=yes;tags=my_ensemble

In a different scenario, you might want to retrain your model from a list of datasets, for instance training an anomaly detector using the data of the last 6 months. This means that you don’t want your data to be merged. Rather you would like to use a window over the list of available datasets.

bigmler retrain --ensemble-tag my_ensemble --add data/iris.csv \
                --window-size 6 --output-dir retrain_directory

In this case, adding the --window-size option to your command will cause the dataset created by uploading your new data to be added to the list of datasets as a separate resource. Then model will be rebuilt using the number of datasets set as --window-size.

The operations run by bigmler retrain are mainly run in BigML’s servers using WhizzML scripts. This scripts are previously created in the user’s account the first time you run the command, but they can also be recreated by using the --upgrade flag in any bigmler retrain command call.

Retrain Subcommand Options

--id RESOURCE_ID

ID for the resource to be reified

--window-size SIZE

Maximum number of datasets to be used

--add PATH

Path to the file that contains the data to be added

--upgrade

Causes the scripts that generate the models rebuild to be recreated

--model-tag TAG

Retrieves models that were tagged with tag

--ensemble-tag TAG

Retrieves ensembles that were tagged with tag

--cluster-tag TAG

Retrieves clusters that were tagged with tag

--anomaly-tag TAG

Retrieves anomalies that were tagged with tag

--logistic-regression-tag TAG

Retrieves logistic regressions that were tagged with tag

--linear-regression-tag TAG

Retrieves linear regressions that were tagged with tag

--topic-model-tag TAG

Retrieves topic models that were tagged with tag

--time-series-tag TAG

Retrieves time series that were tagged with tag

--association-tag TAG

Retrieves associations that were tagged with tag

--deepnet-tag TAG

Retrieves deepnets that were tagged with tag