PCA subcommand
The bigmler pca subcommand generates all the
resources needed to buid
a PCA model and use it to predict.
The PCA model is an unsupervised
learning method for dimensionality reduction that tries to find new features
that can maximize the description of the data variation. The new features
are built as linear combinations of the original features in the dataset.
The simplest call to build a PCA is:
bigmler pca --train data/iris.csv
uploads the data in the data/iris.csv file and generates
the corresponding source, dataset and pca
objects in BigML. You
can use any of the generated objects to produce new PCAs.
For instance, you could set a subgroup of the fields of the generated dataset
to produce a different PCA model by using
bigmler pca --dataset dataset/53b1f71437203f5ac30004ed \
--pca-fields="-sepal length"
that would exclude the field sepal length from the PCA
model creation input fields.
Similarly to the models and datasets, the generated PCAs
can be shared using the --shared option, e.g.
bigmler pca --source source/53b1f71437203f5ac30004e0 \
--shared
will generate a secret link for both the created dataset and PCA, that can be used to share the resource selectively.
The PCA can be used to assign a projection (a set of new components) to each input data set. The command
bigmler pca \
--pca pca/5331f71435203f5ac30005c0 \
--test data/test_iris.csv \
--output projections.csv
would produce a file projections.csv with the projections associated
to each input. It’s important to remark that to build projections for
a supervised learning problem the objective field should never be part of
the PCA input fields. Including the objective in the PCA would cause leakage.
In order to remove the objective field, you can use the --exclude-objective
flag. Also, the train/test split should be done before creating the PCA from
the training dataset to avoid leakage from the test set data
in the new components.
You can also change some parameters in the
PCA model, like the --max-components or --variance-threshold
to select the number of components to be used in the projection.
Please check the PCA section
of the API documentation for a detailed
description of the available arguments.
bigmler pca --dataset dataset/53b1f71437203f5ac30004ed \
--max-components 4 \
--test data/test_iris.csv \
--output projections.csv
with this code, only the first 4 components of the PCA are used to generate projections, reducing thus the dimensionality of the dataset to 4.
When previous command is executed, the PCA information is downloaded to your local computer and the PCA projections are computed locally, with no more latencies involved. Just in case you prefer to use BigML to compute the projections remotely, you can do so too
bigmler pca
--pca pca/53b1f71435203f5ac30005c0 \
--test data/my_test.csv --remote
would create a remote source and dataset from the test file data,
generate a batch projection also remotely and finally
download the result to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. Some output format configurations can
be controlled using the --projection-header option, that causes
the headers of the fields to be placed as a first row in the projections file,
or the --projection-fields option, that can be set to all or to
a comma-separated list of fields of the original dataset that will be included
in the projections file before the projection components.
Other output configurations can be
set by using the --batch-projection-attributes option pointing to a JSON
file that contains the desired attributes, like:
{"output_fields": ["petal length", "sepal length"],
"all_fields": true}
PCA Subcommand Options
|
BigML PCA Id |
|
Path to a file containing PCA/ids. One PCA per line (e.g., pca/4f824203ce80051) |
|
No PCA will be generated |
|
Path to a file containing a JSON PCA structure. |
|
Comma-separated list of fields that will be used in the PCA construction |
|
Path to a JSON file that contains the attributes to configure the PCA |
|
Maximum number of components to be used in projections |
|
Maximum variance covered with the subset of components to be used in projections |
|
When set, excludes the objective field in the dataset from the PCA input fields |
|
Path to a JSON file that contains the attributes to configure the batch projection |
|
When set, adds a headers row at the top of the generated projections file |
|
Comma-separated list of field
in the test set to be added
to the projections file. Use
|
|
PCA will be generated |