Sample subcommand

You can extract samples from your datasets in BigML using the bigmler sample subcommand. When a new sample is requested, a copy of the dataset is stored in a special format in an in-memory cache. This sample can then be used, before its expiration time, to extract data from the related dataset by setting some options like the number of rows or the fields to be retrieved. You can either begin from scratch uploading your data to BigML, creating the corresponding source and dataset and extracting your sample from it

bigmler sample --train data/iris.csv --rows 10 --row-offset 20

This command will create a source, a dataset, a sample object, whose id will be stored in the samples file in the output directory, and extract 10 rows of data starting from the 21st that will be stored in the sample.csv file.

You can reuse an existing sample by using its id in the command.

bigmler sample --sample sample/53b1f71437203f5ac303d5c0 \
               --sample-header --row-order-by="-petal length" \
               --row-fields "petal length,petal width" --mode linear

will create a new sample.csv file with a headers row where only the petal length and petal width are retrieved. The --mode linear option will cause the first available rows to be returned and the --row-order-by="-petal length" option returns these rows sorted in descending order according to the contents of petal length.

You can also add to the sample rows some statistical information by using the --stat-field or --stat-fields options. Adding them to the command will generate a stat-info.json file where the Pearson’s and Spearman’s correlations, and linear regression terms will be stored in a JSON format.

You can also apply a filter to select the sample rows by the values in their fields using the --fields-filter option. This must be set to a string containing the conditions that must be met using field ids and values.

bigmler sample --sample sample/53b1f71437203f5ac303d5c0 \
               --fields-filter "000001=&!000004=Iris-setosa"

With this command, only rows where field id 000001 is missing and field id 000004 is not Iris-setosa will be retrieved. You can check the available operators and syntax in the samples’ developers doc . More available options can be found in the Samples subcommand Options section.

.._sample_options:

Samples Subcommand Options

`--sample` SAMPLE	BigML sample Id
`--samples` PATH	Path to a file containing sample/ids. One sample per line (e.g., sample/4f824203ce80051)
`--no-sample`	No sample will be generated
`--sample-fields` FIELD_NAMES	Comma-separated list of fields that will be used in the sample detector construction
`--sample-attributes` PATH	Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the sample creation call
`--fields-filter` QUERY	Query string that will be used as filter before selecting the sample rows. The query string can be built using the field ids, their values and the usual operators. You can see some examples in the developers section
`--sample-header`	Adds a headers row to the sample.csv output
`--row-index`	Prepends acolumn to the sample rows with the absolute row number
`--occurrence`	Prepends a column to the sample rows with the number of occurences of each row. When used with –row-index, the occurrence column will be placed after the index column
`--precision`	Decimal numbers precision
`--rows SIZE`	Number of rows returned
`--row-offset` OFFSET	Skip the given number of rows
`--row-order-by` FIELD_NAME	Field name whose values will be used to sort the returned rows
`--row-fields` FIELD_NAMES	Comma-separated list of fields that will be returned in the sample
`--stat-fields` FIELD_NAME,FIELD_NAME	Two comma-separated numeric field names that will be used to compute their Pearson’s and Spearman’s correlations and linear regression terms
`--stat-field` FIELD_NAME	Numeric field that will be used to compute Pearson’s and Spearman’s correlations and linear regression terms against the rest of numeric fields in the sample
`--unique`	Repeated rows are removed from the sample