.. toctree:: :maxdepth: 2 :hidden: .. _bigmler-cluster: Cluster subcommand ================== Just as the simple ``bigmler`` command can generate all the resources leading to finding models and predictions for a supervised learning problem, the ``bigmler cluster`` subcommand will follow the steps to generate clusters and predict the centroids associated to your test data. To mimic what we saw in the ``bigmler`` command section, the simplest call is .. code-block:: bash bigmler cluster --train data/diabetes.csv This command will upload the data in the ``data/diabetes.csv`` file and generate the corresponding ``source``, ``dataset`` and ``cluster`` objects in BigML. You can use any of the generated objects to produce new clusters. For instance, you could set a subgroup of the fields of the generated dataset to produce a different cluster by using .. code-block:: bash bigmler cluster --dataset dataset/53b1f71437203f5ac30004ed \ --cluster-fields="-blood pressure" that would exclude the field ``blood pressure`` from the cluster creation input fields. Similarly to the models and datasets, the generated clusters can be shared using the ``--shared`` option, e.g. .. code-block:: bash bigmler cluster --source source/53b1f71437203f5ac30004e0 \ --shared will generate a secret link for both the created dataset and cluster that can be used to share the resource selectively. As models were used to generate predictions (class names in classification problems and an estimated number for regressions), clusters can be used to predict the subgroup of data that our input data is more similar to. Each subgroup is represented by its centroid, and the centroid is labelled by a centroid name. Thus, a cluster would classify our test data by assigning to each input an associated centroid name. The command .. code-block:: bash bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \ --test data/my_test.csv would produce a file ``centroids.csv`` with the centroid name associated to each input. When the command is executed, the cluster information is downloaded to your local computer and the centroid predictions are computed locally, with no more latencies involved. Just in case you prefer to use BigML to compute the centroid predictions remotely, you can do so too .. code-block:: bash bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \ --test data/my_test.csv --remote would create a remote source and dataset from the test file data, generate a ``batch centroid`` also remotely and finally download the result to your computer. If you prefer the result not to be dowloaded but to be stored as a new dataset remotely, add ``--no-csv`` and ``to-dataset`` to the command line. This can be specially helpful when dealing with a high number of scores or when adding to the final result the original dataset fields with ``--prediction-info full``, that may result in a large CSV to be created as output. The k-means algorithm used in clustering can only use training data that has no missing values in their numeric fields. Any data that does not comply with that is discarded in cluster construction, so you should ensure that enough number of rows in your training data file has non-missing values in their numeric fields for the cluster to be built and relevant. Similarly, the cluster cannot issue a centroid prediction for input data that has missing values in its numeric fields, so centroid predictions will give a "-" string as output in this case. You can change the number of centroids used to group the data in the clustering procedure .. code-block:: bash bigmler cluster --dataset dataset/53b1f71437203f5ac30004ed \ --k 3 And also generate the datasets associated to each centroid of a cluster. Using the ``--cluster-datasets`` option bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \ --cluster-datasets "Cluster 1,Cluster 2" you can generate the datasets associated to a comma-separated list of centroid names. If no centroid name is provided, all datasets are generated. Similarly, you can generate the models to predict if one instance is associated to each centroid of a cluster. Using the ``--cluster-models`` option bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \ --cluster-models "Cluster 1,Cluster 2" you can generate the models associated to a comma-separated list of centroid names. If no centroid name is provided, all models are generated. Models can be useful to see which features are important to determine whether a certain instance belongs to a concrete cluster. Cluster Specific Subcommand Options ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ========================================= ===================================== ``--cluster`` *CLUSTER* BigML cluster Id ``--clusters`` *PATH* Path to a file containing cluster/ids. One cluster per line (e.g., cluster/4f824203ce80051) ``--k`` *NUMBER_OF_CENTROIDS* Number of final centroids in the clustering ``--no-cluster`` No cluster will be generated ``--cluster-fields`` Comma-separated list of fields that will be used in the cluster construction ``--cluster-attributes`` *PATH* Path to a JSON file containing attributes (any of the updatable attributes described in the `developers section `_ ) to be used in the cluster creation call ``--cluster-datasets`` *CENTROID_NAMES* Comma-separated list of centroid names to generate the related datasets from a cluster. If no CENTROID_NAMES argument is provided all datasets are generated ``--cluster-file`` *PATH* Path to a JSON file containing the cluster info ``--cluster-seed`` *SEED* Seed to generate deterministic clusters ``--centroid-attributes`` *PATH* Path to a JSON file containing attributes (any of the updatable attributes described in the `developers section `_ ) to be used in the centroid creation call ``--batch-centroid-attributes`` *PATH* Path to a JSON file containing attributes (any of the updatable attributes described in the `developers section `_ ) to be used in the batch centroid creation call ``--cluster-models`` *CENTROID_NAMES* Comma-separated list of centroid names to generate the related models from a cluster. If no CENTROID_NAMES argument is provided all models are generated ``--summary-fields`` *SUMMARY_FIELDS* Comma-separated list of fields to be kept for reference but not used in the cluster bulding process ``--default-numeric-value`` *DEFAULT* The value used by default if a numeric field is missing. Spline interpolation is used by default and other options are "mean", "median", "minimum", "maximum" and "zero" ========================================= =====================================