.. toctree:: :maxdepth: 2 :hidden: .. _bigmler-sample: Sample subcommand ================= You can extract samples from your datasets in BigML using the ``bigmler sample`` subcommand. When a new sample is requested, a copy of the dataset is stored in a special format in an in-memory cache. This sample can then be used, before its expiration time, to extract data from the related dataset by setting some options like the number of rows or the fields to be retrieved. You can either begin from scratch uploading your data to BigML, creating the corresponding source and dataset and extracting your sample from it .. code-block:: bash bigmler sample --train data/iris.csv --rows 10 --row-offset 20 This command will create a source, a dataset, a sample object, whose id will be stored in the ``samples`` file in the output directory, and extract 10 rows of data starting from the 21st that will be stored in the ``sample.csv`` file. You can reuse an existing sample by using its id in the command. .. code-block:: bash bigmler sample --sample sample/53b1f71437203f5ac303d5c0 \ --sample-header --row-order-by="-petal length" \ --row-fields "petal length,petal width" --mode linear will create a new ``sample.csv`` file with a headers row where only the ``petal length`` and ``petal width`` are retrieved. The ``--mode linear`` option will cause the first available rows to be returned and the ``--row-order-by="-petal length"`` option returns these rows sorted in descending order according to the contents of ``petal length``. You can also add to the sample rows some statistical information by using the ``--stat-field`` or ``--stat-fields`` options. Adding them to the command will generate a ``stat-info.json`` file where the Pearson's and Spearman's correlations, and linear regression terms will be stored in a JSON format. You can also apply a filter to select the sample rows by the values in their fields using the ``--fields-filter`` option. This must be set to a string containing the conditions that must be met using field ids and values. .. code-block:: bash bigmler sample --sample sample/53b1f71437203f5ac303d5c0 \ --fields-filter "000001=&!000004=Iris-setosa" With this command, only rows where field id ``000001`` is missing and field id ``000004`` is not ``Iris-setosa`` will be retrieved. You can check the available operators and syntax in the `samples' developers doc `_ . More available options can be found in the `Samples subcommand Options <#samples-option>`_ section. .._sample_options: Samples Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^ ============================================= ================================= ``--sample`` *SAMPLE* BigML sample Id ``--samples`` *PATH* Path to a file containing sample/ids. One sample per line (e.g., sample/4f824203ce80051) ``--no-sample`` No sample will be generated ``--sample-fields`` *FIELD_NAMES* Comma-separated list of fields that will be used in the sample detector construction ``--sample-attributes`` *PATH* Path to a JSON file containing attributes (any of the updatable attributes described in the `developers section `_ ) to be used in the sample creation call ``--fields-filter`` *QUERY* Query string that will be used as filter before selecting the sample rows. The query string can be built using the field ids, their values and the usual operators. You can see some examples in the `developers section `_ ``--sample-header`` Adds a headers row to the sample.csv output ``--row-index`` Prepends acolumn to the sample rows with the absolute row number ``--occurrence`` Prepends a column to the sample rows with the number of occurences of each row. When used with --row-index, the occurrence column will be placed after the index column ``--precision`` Decimal numbers precision ``--rows SIZE`` Number of rows returned ``--row-offset`` *OFFSET* Skip the given number of rows ``--row-order-by`` *FIELD_NAME* Field name whose values will be used to sort the returned rows ``--row-fields`` *FIELD_NAMES* Comma-separated list of fields that will be returned in the sample ``--stat-fields`` *FIELD_NAME,FIELD_NAME* Two comma-separated numeric field names that will be used to compute their Pearson's and Spearman's correlations and linear regression terms ``--stat-field`` *FIELD_NAME* Numeric field that will be used to compute Pearson's and Spearman's correlations and linear regression terms against the rest of numeric fields in the sample ``--unique`` Repeated rows are removed from the sample ============================================= =================================