Dataset subcommand
In addition to the main BigMLer capabilities explained so far, there’s a
subcommand bigmler dataset that can be used to create datasets either
from data files and sources or by transforming datasets.
bigmler dataset --file iris.csv \
--output-dir my_directory
will create a source and a dataset by uploading the iris.csv file to
BigML.
You can also create datasets by applying many transformations to one or several existing datasets.
To merge datasets, you can use the --merge option
bigmler dataset --datasets my_datasets/dataset \
--merge \
--output-dir my_directory
The file my_datasets/dataset should contain dataset IDs, one per line.
The datasets to be merged are expected to share the same fields structure and
their rows will be just added in a single resulting dataset, whose ID will
be stored in a my_directory/dataset_multi file.
Datasets can also be juxtaposed.
bigmler dataset --datasets my_datasets/dataset \
--juxtapose \
--output-dir my_directory
In this case, the generated dataset ID will be stored in the
my_directory/dataset_gen file. Each row of the new dataset
will contain all the fields of the datasets found in my_datasets/dataset.
If you need to join datasets, you can do so by using an SQL expression like:
bigmler dataset --datasets-json "[{\"id\": \"dataset/5357eb2637203f1668000004\", \"id\": \"dataset/5357eb2637203f1668000007\"}]" \
--sql-query "select A.*,B.* from A join B on A.\`000000\` = \`B.000000\`" \
--output-dir my_directory
the --datasets-json option should contain a JSON string that describes the
datasets to be used in the SQL query. Letters from A to Z are used
to refer to these datasets in the SQL expression. First dataset in the list is
represented by A, the second by B, etc.
Similarly, the SQL expression can be used to generate an aggregation.
bigmler dataset --dataset dataset/5357eb2637203f1668000004 \
--sql-query "select A.\`species\`, avg(\`petal length\`) as apl from A group by A.\`species\`" \
--output-dir my_directory
or to use for pivoting
bigmler dataset --dataset dataset/5357eb2637203f1668000004 \
--sql-query "select cat_avg(\`petal length\`, \`species\`, 'Iris-setosa') from A group by A.\`petal width\`" \
--output-dir my_directory
that will create the average of the petal length field value for the rows
whose species field contains the Iris-setosa category.
Dataset subcommand Options
|
Path to the data file |
|
Causes the datasets in the command to be merged |
|
Causes the rows in the datasets referenced in the command to be juxtaposed |
|
SQL expression describing the transformation |
|
Path to a JSON file that contains the SQL query describing the transformation |
|
Path to a JSON file describing the fields
types and properties created as output
of the SQL transformation created with
|