Dataset subcommand

In addition to the main BigMLer capabilities explained so far, there’s a subcommand bigmler dataset that can be used to create datasets either from data files and sources or by transforming datasets.

bigmler dataset --file iris.csv \
                --output-dir my_directory

will create a source and a dataset by uploading the iris.csv file to BigML.

You can also create datasets by applying many transformations to one or several existing datasets.

To merge datasets, you can use the --merge option

bigmler dataset --datasets my_datasets/dataset \
                --merge \
                --output-dir my_directory

The file my_datasets/dataset should contain dataset IDs, one per line. The datasets to be merged are expected to share the same fields structure and their rows will be just added in a single resulting dataset, whose ID will be stored in a my_directory/dataset_multi file.

Datasets can also be juxtaposed.

bigmler dataset --datasets my_datasets/dataset \
                --juxtapose \
                --output-dir my_directory

In this case, the generated dataset ID will be stored in the my_directory/dataset_gen file. Each row of the new dataset will contain all the fields of the datasets found in my_datasets/dataset.

If you need to join datasets, you can do so by using an SQL expression like:

bigmler dataset --datasets-json "[{\"id\": \"dataset/5357eb2637203f1668000004\", \"id\": \"dataset/5357eb2637203f1668000007\"}]" \
                --sql-query "select A.*,B.* from A join B on A.\`000000\` = \`B.000000\`" \
                --output-dir my_directory

the --datasets-json option should contain a JSON string that describes the datasets to be used in the SQL query. Letters from A to Z are used to refer to these datasets in the SQL expression. First dataset in the list is represented by A, the second by B, etc.

Similarly, the SQL expression can be used to generate an aggregation.

bigmler dataset --dataset dataset/5357eb2637203f1668000004 \
                --sql-query "select A.\`species\`, avg(\`petal length\`) as apl from A group by A.\`species\`" \
                --output-dir my_directory

or to use for pivoting

bigmler dataset --dataset dataset/5357eb2637203f1668000004 \
                --sql-query "select cat_avg(\`petal length\`, \`species\`, 'Iris-setosa') from A group by A.\`petal width\`" \
                --output-dir my_directory

that will create the average of the petal length field value for the rows whose species field contains the Iris-setosa category.

Dataset subcommand Options

--file

Path to the data file

--merge

Causes the datasets in the command to be merged

--juxtapose

Causes the rows in the datasets referenced in the command to be juxtaposed

--sql-query QUERY

SQL expression describing the transformation

--json-query PATH

Path to a JSON file that contains the SQL query describing the transformation

--sql-output-fields PATH

Path to a JSON file describing the fields types and properties created as output of the SQL transformation created with --sql-query or --json-query