Cluster subcommand

Just as the simple bigmler command can generate all the resources leading to finding models and predictions for a supervised learning problem, the bigmler cluster subcommand will follow the steps to generate clusters and predict the centroids associated to your test data. To mimic what we saw in the bigmler command section, the simplest call is

bigmler cluster --train data/diabetes.csv

This command will upload the data in the data/diabetes.csv file and generate the corresponding source, dataset and cluster objects in BigML. You can use any of the generated objects to produce new clusters. For instance, you could set a subgroup of the fields of the generated dataset to produce a different cluster by using

bigmler cluster --dataset dataset/53b1f71437203f5ac30004ed \
                --cluster-fields="-blood pressure"

that would exclude the field blood pressure from the cluster creation input fields.

Similarly to the models and datasets, the generated clusters can be shared using the --shared option, e.g.

bigmler cluster --source source/53b1f71437203f5ac30004e0 \
                --shared

will generate a secret link for both the created dataset and cluster that can be used to share the resource selectively.

As models were used to generate predictions (class names in classification problems and an estimated number for regressions), clusters can be used to predict the subgroup of data that our input data is more similar to. Each subgroup is represented by its centroid, and the centroid is labelled by a centroid name. Thus, a cluster would classify our test data by assigning to each input an associated centroid name. The command

bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \
                --test data/my_test.csv

would produce a file centroids.csv with the centroid name associated to each input. When the command is executed, the cluster information is downloaded to your local computer and the centroid predictions are computed locally, with no more latencies involved. Just in case you prefer to use BigML to compute the centroid predictions remotely, you can do so too

bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \
                --test data/my_test.csv --remote

would create a remote source and dataset from the test file data, generate a batch centroid also remotely and finally download the result to your computer. If you prefer the result not to be dowloaded but to be stored as a new dataset remotely, add --no-csv and to-dataset to the command line. This can be specially helpful when dealing with a high number of scores or when adding to the final result the original dataset fields with --prediction-info full, that may result in a large CSV to be created as output.

The k-means algorithm used in clustering can only use training data that has no missing values in their numeric fields. Any data that does not comply with that is discarded in cluster construction, so you should ensure that enough number of rows in your training data file has non-missing values in their numeric fields for the cluster to be built and relevant. Similarly, the cluster cannot issue a centroid prediction for input data that has missing values in its numeric fields, so centroid predictions will give a “-” string as output in this case.

You can change the number of centroids used to group the data in the clustering procedure

bigmler cluster --dataset dataset/53b1f71437203f5ac30004ed \
                --k 3

And also generate the datasets associated to each centroid of a cluster. Using the --cluster-datasets option

bigmler cluster –cluster cluster/53b1f71437203f5ac30004f0
–cluster-datasets “Cluster 1,Cluster 2”

you can generate the datasets associated to a comma-separated list of centroid names. If no centroid name is provided, all datasets are generated.

Similarly, you can generate the models to predict if one instance is associated to each centroid of a cluster. Using the --cluster-models option

bigmler cluster –cluster cluster/53b1f71437203f5ac30004f0
–cluster-models “Cluster 1,Cluster 2”

you can generate the models associated to a comma-separated list of centroid names. If no centroid name is provided, all models are generated. Models can be useful to see which features are important to determine whether a certain instance belongs to a concrete cluster.

Cluster Specific Subcommand Options

`--cluster` CLUSTER	BigML cluster Id
`--clusters` PATH	Path to a file containing cluster/ids. One cluster per line (e.g., cluster/4f824203ce80051)
`--k` NUMBER_OF_CENTROIDS	Number of final centroids in the clustering
`--no-cluster`	No cluster will be generated
`--cluster-fields`	Comma-separated list of fields that will be used in the cluster construction
`--cluster-attributes` PATH	Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the cluster creation call
`--cluster-datasets` CENTROID_NAMES	Comma-separated list of centroid names to generate the related datasets from a cluster. If no CENTROID_NAMES argument is provided all datasets are generated
`--cluster-file` PATH	Path to a JSON file containing the cluster info
`--cluster-seed` SEED	Seed to generate deterministic clusters
`--centroid-attributes` PATH	Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the centroid creation call
`--batch-centroid-attributes` PATH	Path to a JSON file containing attributes (any of the updatable attributes described in the developers section ) to be used in the batch centroid creation call
`--cluster-models` CENTROID_NAMES	Comma-separated list of centroid names to generate the related models from a cluster. If no CENTROID_NAMES argument is provided all models are generated
`--summary-fields` SUMMARY_FIELDS	Comma-separated list of fields to be kept for reference but not used in the cluster bulding process
`--default-numeric-value` DEFAULT	The value used by default if a numeric field is missing. Spline interpolation is used by default and other options are “mean”, “median”, “minimum”, “maximum” and “zero”