.. toctree:: :maxdepth: 2 :hidden: .. _bigmler-reify: Reify subcommand ================ This subcommand extracts the information in the existing resources to determine the arguments that were used when they were created, and generates scripts that could be used to reproduce them. Currently, the language used in the scripts will be ``Python``. The usual starting point for BigML resources is a ``source`` created from inline, local or remote data. Thus, the script keeps analyzing the chain of calls that led to a certain resource until the root ``source`` is found. The simplest example would be: .. code-block:: bash bigmler reify --id source/55d77ba60d052e23430027bb that will output: .. code-block:: python #!/usr/bin/env python # -​*- coding: utf-8 -*​- """Python code to reify source/5bd431db3980b574bb0145bf Generated by BigMLer """ def main(): from bigml.api import BigML api = BigML() source_url1 = "https://static.bigml.com/csv/iris.csv" source1 = api.create_source(source_url1) api.ok(source1) args = \ {'fields': {'000000': {'name': 'sepal length', 'optype': 'numeric'}, '000001': {'name': 'sepal width', 'optype': 'numeric'}, '000002': {'name': 'petal length', 'optype': 'numeric'}, '000003': {'name': 'petal width', 'optype': 'numeric'}, '000004': {'name': 'species', 'optype': 'categorical', 'term_analysis': {'enabled': True}}}} source2 = api.update_source(source1, args) api.ok(source2) if __name__ == "__main__": main() According to this output, the source was created from a remote file located at ``https://static.bigml.com/csv/iris.csv`` and the types of each of it's fields are described and stored to ensure that they match the ones in the resource. This script will be stored in the command output directory and named ``reify.py`` (you can specify a different name and location using the ``--output`` option). Other resources will have more complex workflows and more user-given attributes. Let's see for instance the script to generate an evaluation from a train/test split of a source that was created using the ``bigmler --train data/iris.csv --evaluate`` command: .. code-block:: bash bigmler reify --id evaluation/55d919850d052e234b000833 .. code-block:: python #!/usr/bin/env python # -​*- coding: utf-8 -*​- """Python code to reify evaluation/5be371a02774cb26da00061c Generated by BigMLer """ def main(): from bigml.api import BigML api = BigML() source1_file = "iris.csv" args = \ {'category': 12, 'description': 'Created using BigMLer', 'fields': {'000000': {'name': 'sepal length', 'optype': 'numeric'}, '000001': {'name': 'sepal width', 'optype': 'numeric'}, '000002': {'name': 'petal length', 'optype': 'numeric'}, '000003': {'name': 'petal width', 'optype': 'numeric'}, '000004': {'name': 'species', 'optype': 'categorical', 'term_analysis': {'enabled': True}}}, 'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']} source2 = api.create_source(source1_file, args) api.ok(source2) args = \ {'category': 12, 'description': 'Created using BigMLer', 'objective_field': {'id': '000004'}, 'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']} dataset1 = api.create_dataset(source2, args) api.ok(dataset1) args = \ {'category': 12, 'description': 'Created using BigMLer', 'sample_rate': 0.8, 'seed': 'BigML, Machine Learning made easy', 'split_candidates': 32, 'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']} model1 = api.create_model(dataset1, args) api.ok(model1) args = \ {'category': 12, 'description': 'Created using BigMLer', 'fields_map': {'000001': '000001', '000002': '000002', '000003': '000003', '000004': '000004'}, 'operating_kind': 'probability', 'out_of_bag': True, 'sample_rate': 0.8, 'seed': 'BigML, Machine Learning made easy', 'tags': ['BigMLer', 'BigMLer_ThuNov0818_001323']} evaluation1 = api.create_evaluation(model1, dataset1, args) api.ok(evaluation1) if __name__ == "__main__": main() As you can see, BigMLer has added a default ``category``, ``description`` and ``tags`` attributes, has built the model on 80% of the data and used the ``out_of_bag`` attribute for the evaluation to use the remaining part of the dataset test data. The ``bigmler reify`` command can generate also other types of output depending on the choice of the ``--language`` option. The available options are ``python`` (the one by default), ``nb`` and ``whizzml``. The ``nb`` option will generate a jupyter notebook file. .. code-block:: json { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Reified resource: evaluation/5be371a02774cb26da00061c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember to set your credentials in the BIGML_USERNAME and BIGML_API_KEY environment variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bigml.api import BigML\n", "api = BigML()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add the inputs for the workflow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source1_file = \"iris.csv\"" ] }, ... ] } We can also reify any resource and obtain the WhizzML script that would recreate it using ``--language whizzml``: .. code-block:: ;;Step 1 ;;WhizzML for resource: BigMLer_ThuNov0818_001323 ;;(5 fields (1 categorical, 4 numeric)) ;;source/5be371949252734ec7000938 ;;created by mmartin (define source2 (update-and-wait source1 {"fields" {"000000" {"name" "sepal length" "optype" "numeric"} "000001" {"name" "sepal width" "optype" "numeric"} "000002" {"name" "petal length" "optype" "numeric"} "000003" {"name" "petal width" "optype" "numeric"} "000004" {"name" "species" "optype" "categorical" "term_analysis" {"enabled" true}}} "category" 12 "description" "Created using BigMLer" "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"]})) ;;Step 2 ;;WhizzML for resource: BigMLer_ThuNov0818_001323 ;;(150 instances, 5 fields (1 categorical, 4 numeric)) ;;dataset/5be371972774cb26d5000954 ;;created by mmartin (define dataset1 (create-and-wait-dataset {"source" source2 "description" "Created using BigMLer" "category" 12 "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"] "objective_field" {"id" "000004"}})) ;;Step 3 ;;WhizzML for resource: BigMLer_ThuNov0818_001323 ;;(512-node, pruned, deterministic order, sample rate=0.8) ;;model/5be3719a2774cb26d60020fa ;;created by mmartin (define model1 (create-and-wait-model {"dataset" dataset1 "description" "Created using BigMLer" "category" 12 "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"] "sample_rate" 0.8 "seed" "BigML, Machine Learning made easy" "split_candidates" 32})) ;;Step 4 ;;WhizzML for resource: BigMLer_ThuNov0818_001323 ;;(512-node, pruned, deterministic order, sample rate=0.8, operating kind=probability, sample rate=0.2, out of bag) ;;evaluation/5be371a02774cb26da00061c ;;created by mmartin (define evaluation1 (create-and-wait-evaluation {"description" "Created using BigMLer" "category" 12 "tags" ["BigMLer" "BigMLer_ThuNov0818_001323"] "fields_map" {"000001" "000001" "000002" "000002" "000003" "000003" "000004" "000004"} "sample_rate" 0.8 "seed" "BigML, Machine Learning made easy" "operating_kind" "probability" "out_of_bag" true "dataset" dataset1 "model" model1})) (define output-evaluation evaluation1) Reify Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^ ===================================== ========================================= ``--id`` *RESOURCE_ID* ID for the resource to be reified ``--language`` *SCRIPTING_LANG* Language to be used for the script. Currently only Python is available ``--output`` *PATH* Path to the file where the script will be stored ``--add-fields`` Causes the fields information to be added to the source arguments ===================================== ========================================= .. _bigmler-execute: Execute subcommand ================== This subcommand creates and executes scripts in WhizzML (BigML's automation language). With WhizzML you can program any specific workflow that involves Machine Learning resources like datasets, models, etc. You just write a script using the directives in the `reference manual `_ and upload it to BigML, where it will be available as one more resource in your dashboard. Scripts can also be shared and published in the gallery, so you can reuse other users' scripts and execute them. These operations can also be done using the `bigmler execute` subcommand. The simplest example is executing some basic code, like adding two numbers: .. code-block:: bash bigmler execute --code "(+ 1 2)" --output-dir simple_exe With this command, bigmler will generate a script in BigML whose source code is the one given as a string in the ``--code`` option. The script ID will be stored in a file called ``scripts`` in the ``simple_text`` directory. After that, the script will be executed, so a new resource called ``execution`` will be created in BigML, and the corresponding ID will be stored in the ``execution`` file of the output directory. Similarly, the result of the execution will be stored in ``whizzml_results.txt`` and ``whizzml_results.json`` (in human-readable format and JSON respectively) in the directory set in the ``--output-dir`` option. You can also use the code stored in a file with the ``--code-file`` option. Adding the ``--no-execute`` flag to the command will cause the process to stop right after the script creation. You can also compile your code as a library to be used in many scripts by setting the ``--to-library`` flag. .. code-block:: bash bigmler execute --code-file my_library.whizzml --to-library Existing scripts can be referenced for execution with the ``--script`` option .. code-block:: bash bigmler execute --script script/50a2bb64035d0706db000643 or the script ID can be read from a file: .. code-block:: bash bigmler execute --scripts simple_exe/scripts The script we used as an example is very simple and needs no additional parameter. But, in general, scripts will have input parameters and output variables. The inputs define the script signature and must be declared in order to create the script. The outputs are optional and any variable in the script can be declared to be an output. Both inputs and outputs can be declared using the ``--declare-inputs`` and ``--declare-outputs`` options. These options must contain the path to the JSON file where the information about the inputs and outputs (respectively) is stored. .. code-block:: bash bigmler execute --code '(define addition (+ a b))' \ --declare-inputs my_inputs_dec.json \ --declare-outputs my_outputs_dec.json \ --no-execute in this example, the ``my_inputs_dec.json`` file could contain .. code-block:: json [{"name": "a", "default": 0, "type": "number"}, {"name": "b", "default": 0, "type": "number", "description": "second number to add"}] and ``my_outputs_dec.json`` .. code-block:: json [{"name": "addition", "type": "number"}] so that the value of the ``addition`` variable would be returned as output in the execution results. Additionally, a script can import libraries. The list of libraries to be used as imports can be added to the command with the option ``--imports`` followed by a comma-separated list of library IDs. Once the script has been created and its inputs and outputs declared, to execute it you'll need to provide a value for each input. This can be done using ``--inputs``, that will also point to a JSON file where each input should have its corresponding value. .. code-block:: bash bigmler execute --script script/50a2bb64035d0706db000643 \ --inputs my_inputs.json where the ``my_inputs.json`` file would contain: .. code-block:: json [["a", 1], ["b", 2]] For more details about the syntax to declare inputs and outputs, please refer to the `Developers documentation `_. You can also provide default configuration attributes for the resources generated in an execution. Add the ``--creation-defaults`` option followed by the path to a JSON file that contains a dictionary whose keys are the resource types to which the configuration defaults apply and whose values are the configuration attributes set by default. .. code-block:: bash bigmler execute --code-file my_script.whizzml \ --creation-defaults defaults.json For instance, if ``my_script.whizzml`` creates an ensemble from a remote file: .. code-block:: bash (define file "s3://bigml-public/csv/iris.csv") (define source (create-and-wait-source {"remote" file})) (define dataset (create-and wait-dataset {"source" source})) (define ensemble (create-and-wait-ensemble {"dataset" dataset})) and ``my_create_defaults.json`` contains .. code-block:: json { "source": { "project": "project/54d9553bf0a5ea5fc0000016" }, "ensemble": { "number_of_models": 100, "sample_rate": 0.9 } } the source created by the script will be associated to the given project and the ensemble will have 100 models and a 0.9 sample rate unless the source code in your script explicitly specifies a different value, in which case it takes precedence over these defaults. Execute Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^ =============================================== =============================== ``--code`` *SOURCE_CODE* WhizzML source code to be executed ``--code-file`` *PATH* Path to the file that contains Whizzml source code ``--creation-defaults`` *RESOURCE_DEFAULTS* Path to the JSON file with the default configurations for created resources. Please, see details in the `API Developers documentation `_ ``--declare-inputs`` *INPUTS_DECLARATION* Path to the JSON file with the description of the input parameters. Please, see details in the `API Developers documentation `_ ``--declare-outputs`` *OUTPUTS_DECLARATION* Path to the JSON file with the description of the script outputs. Please, see details in the `API Developers documentation `_ ``--embedded-libraries`` *PATH* Path to a file that contains the location of the files to be embedded in the script as libraries ``--execution`` *EXECUTION_ID* BigML execution ID ``--execution-file`` *EXECUTION_FILE* BigML execution JSON structure file ``--execution-tag`` *EXECUTION_TAG* Select executions tagged with EXECUTION_TAG ``--executions`` *EXECUTIONS* Path to a file containing execution/ids. Just one execution per line (e.g., execution/50a20697035d0706da0004a4) ``--imports`` *LIBRARIES* Comma-separated list of libraries IDs to be included as imports in scripts or other libraries ``--input-maps`` *INPUT_MAPS* Path to the JSON file with the description of the execution inputs for a list of scripts ``--inputs`` *INPUTS* Path to the JSON file with the description of the execution inputs. Please, see details in the `API Developers documentation `_ ``--libraries`` *LIBRARIES* Path to a file containing libraries/ids. Just one library per line (e.g., library/50a20697035d0706da0004a4) ``--library`` *LIBRARY* BigML library Id. ``--library-file`` *LIBRARY_FILE* BigML library JSON structure file. ``--library-tag`` *LIBRARY_TAG* Select libraries tagged with tag to be deleted ``--outputs`` *OUTPUTS* Path to the JSON file with the names of the output parameters. Please, see details in the `API Developers documentation `_ ``--script`` *SCRIPT* BigML script Id. ``--script-file`` *SCRIPT_FILE* BigML script JSON structure file. ``--script-tag`` *SCRIPT_TAG* Select script tagged with tag to be deleted ``--scripts`` *SCRIPTS* Path to a file containing script/ids. Just one script per line (e.g., script/50a20697035d0706da0004a4). ``--to-library`` Boolean that causes the code to be compiled and stored as a library =============================================== =============================== .. _bigmler-whizzml: Whizzml subcommand ================== This subcommand creates packages of scripts and libraries in WhizzML (BigML's automation language) based on the information provided by a ``metadata.json`` file. These operations can also be performed individually using the `bigmler execute` subcommand, but `bigmler whizzml` reads the components of the package, and for each component analyzes the corresponding ``metadata.json`` file to identify the kind of code (script or library) that it contains and creates the corresponding resource in BigML. The ``metadata.json`` is expected to contain the name, kind, description, inputs and outputs needed to create the script. As an example, .. code-block:: json { "name": "Example of whizzml script", "description": "Test example of a whizzml script that adds two numbers", "kind": "script", "source_code": "code.whizzml", "inputs": [ { "name": "a", "type": "number", "description": "First number" }, { "name": "b", "type": "number", "description": "Second number" } ], "outputs": [ { "name": "addition", "type": "number", "description": "Sum of the numbers" } ] } describes a script whose code is to be found in the ``code.whizzml`` file. The script will have two inputs ``a`` and ``b`` and one output: ``addition``. In order to create this script, you can type the following command: .. code-block:: bash bigmler whizzml --package-dir my_package --output-dir creation_log and bigmler will: - look for the ``metadata.json`` file located in the ``my_package`` directory. - parse the JSON, identify that it defines a script and look for its code in the ``code.whizzml`` file - create the corresponding BigML script resource, adding as arguments the ones provided in ``inputs``, ``outputs``, ``name`` and ``description``. Packages can contain more than one script. In this case, a nested directory structure is expected. The ``metadata.json`` file for a package with many components should include the name of the directories where these components can be found: .. code-block:: json { "name": "Best k", "description": "Library and scripts implementing Pham-Dimov-Nguyen k selection algorithm", "kind": "package", "components":[ "best-k-means", "cluster", "evaluation", "batchcentroid" ] } In this example, each string in the ``components`` attributes list corresponds to one directory where a new script or library (with its corresponding ``metadata.json`` descriptor) is stored. Then, using ``bigmler whizzml`` for this composite package will create each of the component scripts or libraries. It will also handle dependencies, using the IDs of the created libraries as imports for the scripts when needed. The ``metadata.json`` that corresponds to a library is simpler than the one used for the script, the difference being that ``kind`` in this case will be set to ``library`` and no inputs or outputs are provided. .. code-block:: json { "name": "Best K-Means", "description": "Best K-Means Clustering using the Pham, Dimov, and Nguyen Algorithm", "kind": "library", "source_code": "library.whizzml" } To include a library in the list of imports of a script, the ``imports`` attribute is used in the script's ``metadata.json``. The imports should be the list of folders that contain each library source code and metadata. .. code-block:: json { "name": "Compute Best K-means Batchcentroid", "description": "Basic script to use the best-kmeans library", "kind": "script", "source_code": "script.whizzml", "imports": ["../best-k-means"], "inputs": [ { "name": "dataset", "type": "dataset-id", "description": "Dataset ID" }, { "name": "cluster-args", "type": "map", "description": "Map of args for clustering (excluding dataset and k) for k search", "default": {} }, { "name": "k-min", "type": "number", "description": "Minimum value of k for search" }, { "name": "k-max", "type": "number", "description": "Maximum value of k for search" }, { "name": "bestcluster-args", "type": "map", "description": "Map of args for clustering (excluding dataset and k) for optimal k", "default": {} }, { "name": "clean", "type": "boolean", "description": "Delete intermediate objects created during computation" }, { "name": "logf", "type": "boolean", "description": "Generate log entries" } ], "outputs": [ { "name": "best-batchcentroid", "type": "string", "description": "Batchcentroid ID" } ] } Also, existing scripts or libraries can be downladed and stored in the user's file system following the structure and conventions needed to be uploaded again to BigML. By using the option ``--from`` followed by the script or library ID and the ``--package-dir`` pointing to the storage folder. .. code-block:: bash bigmler whizzml --package-dir package_bck \ --from script/5a3ae0f14006833a070003a4 If the script is self-contained, the previous command will create a ``package_bck`` folder where the corresponding ``metadata.json`` file will store all the attributes, like the name and description of the script, its inputs and outputs, the kind of resource (script or library) and a ``source_code`` attribute that will contain the name of the file where the source code will be placed. Other complex scripts (and libraries) may not be self-contained and will be importing functions defined in WhizzML libraries. In that case, the ``package_bck`` folder will contain a list of subdirectories, one per script or imported library. Each subdirectory will contain the information about either the script or the library as described in the previous paragraph. The ``--package-dir`` containing folder will in this case also contain a ``metadata.json`` where the list of subfolders is stored in its ``components`` attribute so that each of them can be generated and imported correctly. It also contains the name and description of the downloaded script and the ``kind`` attribute will be set to ``package``. Whizzml Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^ =============================================== =============================== ``--package-dir`` *DIR* Directory that stores the package files ``--embed-libs`` It causes the subcommand to embed the libraries code in the package scripts instead of creating libraries and importing them ``--from`` *SCRIPT ID* ID of the script that is used as source for the package files =============================================== =============================== .. _bigmler-retrain: Retrain subcommand ================== This subcommand can be used to retrain an existing modeling resource (model, ensemble, deepnet, etc.) by adding new data to it. In BigML, resources are immutable to ensure traceability, but at the same time they are reproducible. Therefore, any model can be rebuilt using the data stored in a new consolidated dataset or even from a list of existing datasets. That's retraining the model and the ``bigmler retrain`` subcommand provides a simple way to do it. In the basic use case, different parameters and model types are tried and evaluated till the best performing model is found. Then you can call: .. code-block:: bash bigmler retrain --id model/5a3ae0f14006833a070003a4 --add data/iris.csv \ --output-dir retrain_directory so that the data in your local ``data/iris.csv`` file is uploaded to the platform and all the steps that led to your existing model are reproduced to create a new merged dataset that will be used to retrain your model. The command output will contain the URL that you need to call to ensure you always use the latest version of your model. The URL will look like: .. code-block:: bash https://bigml.io/andromeda/model?username=my_user;api_key=my_api_key;limit=1;full=yes;tags=retrain:model/5a3ae0f14006833a070003a4 Instead of using the original model ID, you can choose to add a unique ``tag`` to your modeling resource and use that as reference: .. code-block:: bash bigmler retrain --ensemble-tag my_ensemble --add data/iris.csv \ --output-dir retrain_directory in this case, the resource to retrain is an ensemble that has been previously tagged as ``my_ensemble``. The ``bigmler retrain`` command will look for the newest ensemble that contains that tag and after uploading and consolidating your data with the one previously used in the ensemble, it will rebuild it. The reference used in the URL that will contain the latest version of the ensemble will use this tag also as reference: .. code-block:: bash https://bigml.io/andromeda/ensemble?username=my_user;api_key=my_api_key;limit=1;full=yes;tags=my_ensemble In a different scenario, you might want to retrain your model from a list of datasets, for instance training an anomaly detector using the data of the last 6 months. This means that you don't want your data to be merged. Rather you would like to use a window over the list of available datasets. .. code-block:: bash bigmler retrain --ensemble-tag my_ensemble --add data/iris.csv \ --window-size 6 --output-dir retrain_directory In this case, adding the ``--window-size`` option to your command will cause the dataset created by uploading your new data to be added to the list of datasets as a separate resource. Then model will be rebuilt using the number of datasets set as ``--window-size``. The operations run by ``bigmler retrain`` are mainly run in BigML's servers using WhizzML scripts. This scripts are previously created in the user's account the first time you run the command, but they can also be recreated by using the ``--upgrade`` flag in any ``bigmler retrain`` command call. Retrain Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^ ===================================== ========================================= ``--id`` *RESOURCE_ID* ID for the resource to be reified ``--window-size`` *SIZE* Maximum number of datasets to be used ``--add`` *PATH* Path to the file that contains the data to be added ``--upgrade`` Causes the scripts that generate the models rebuild to be recreated ``--model-tag`` *TAG* Retrieves models that were tagged with tag ``--ensemble-tag`` *TAG* Retrieves ensembles that were tagged with tag ``--cluster-tag`` *TAG* Retrieves clusters that were tagged with tag ``--anomaly-tag`` *TAG* Retrieves anomalies that were tagged with tag ``--logistic-regression-tag`` *TAG* Retrieves logistic regressions that were tagged with tag ``--linear-regression-tag`` *TAG* Retrieves linear regressions that were tagged with tag ``--topic-model-tag`` *TAG* Retrieves topic models that were tagged with tag ``--time-series-tag`` *TAG* Retrieves time series that were tagged with tag ``--association-tag`` *TAG* Retrieves associations that were tagged with tag ``--deepnet-tag`` *TAG* Retrieves deepnets that were tagged with tag ===================================== =========================================