.. toctree:: :maxdepth: 2 :hidden: .. _bigmler-dataset: Dataset subcommand ================== In addition to the main BigMLer capabilities explained so far, there's a subcommand ``bigmler dataset`` that can be used to create datasets either from data files and sources or by transforming datasets. .. code-block:: bash bigmler dataset --file iris.csv \ --output-dir my_directory will create a source and a dataset by uploading the ``iris.csv`` file to BigML. You can also create datasets by applying many transformations to one or several existing datasets. To merge datasets, you can use the ``--merge`` option .. code-block:: bash bigmler dataset --datasets my_datasets/dataset \ --merge \ --output-dir my_directory The file ``my_datasets/dataset`` should contain dataset IDs, one per line. The datasets to be merged are expected to share the same fields structure and their rows will be just added in a single resulting dataset, whose ID will be stored in a ``my_directory/dataset_multi`` file. Datasets can also be juxtaposed. .. code-block:: bash bigmler dataset --datasets my_datasets/dataset \ --juxtapose \ --output-dir my_directory In this case, the generated dataset ID will be stored in the ``my_directory/dataset_gen`` file. Each row of the new dataset will contain all the fields of the datasets found in ``my_datasets/dataset``. If you need to join datasets, you can do so by using an SQL expression like: .. code-block:: bash bigmler dataset --datasets-json "[{\"id\": \"dataset/5357eb2637203f1668000004\", \"id\": \"dataset/5357eb2637203f1668000007\"}]" \ --sql-query "select A.*,B.* from A join B on A.\`000000\` = \`B.000000\`" \ --output-dir my_directory the ``--datasets-json`` option should contain a JSON string that describes the datasets to be used in the SQL query. Letters from ``A`` to ``Z`` are used to refer to these datasets in the SQL expression. First dataset in the list is represented by ``A``, the second by ``B``, etc. Similarly, the SQL expression can be used to generate an aggregation. .. code-block:: bash bigmler dataset --dataset dataset/5357eb2637203f1668000004 \ --sql-query "select A.\`species\`, avg(\`petal length\`) as apl from A group by A.\`species\`" \ --output-dir my_directory or to use for pivoting .. code-block:: bash bigmler dataset --dataset dataset/5357eb2637203f1668000004 \ --sql-query "select cat_avg(\`petal length\`, \`species\`, 'Iris-setosa') from A group by A.\`petal width\`" \ --output-dir my_directory that will create the average of the ``petal length`` field value for the rows whose ``species`` field contains the ``Iris-setosa`` category. Dataset subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^ ===================================== ========================================= ``--file`` Path to the data file ``--merge`` Causes the datasets in the command to be merged ``--juxtapose`` Causes the rows in the datasets referenced in the command to be juxtaposed ``--sql-query`` *QUERY* SQL expression describing the transformation ``--json-query`` *PATH* Path to a JSON file that contains the SQL query describing the transformation ``--sql-output-fields`` *PATH* Path to a JSON file describing the fields types and properties created as output of the SQL transformation created with ``--sql-query`` or ``--json-query`` ===================================== =========================================