PCA subcommand

The bigmler pca subcommand generates all the resources needed to buid a PCA model and use it to predict. The PCA model is an unsupervised learning method for dimensionality reduction that tries to find new features that can maximize the description of the data variation. The new features are built as linear combinations of the original features in the dataset.

The simplest call to build a PCA is:

bigmler pca --train data/iris.csv

uploads the data in the data/iris.csv file and generates the corresponding source, dataset and pca objects in BigML. You can use any of the generated objects to produce new PCAs. For instance, you could set a subgroup of the fields of the generated dataset to produce a different PCA model by using

bigmler pca --dataset dataset/53b1f71437203f5ac30004ed \
            --pca-fields="-sepal length"

that would exclude the field sepal length from the PCA model creation input fields.

Similarly to the models and datasets, the generated PCAs can be shared using the --shared option, e.g.

bigmler pca --source source/53b1f71437203f5ac30004e0 \
            --shared

will generate a secret link for both the created dataset and PCA, that can be used to share the resource selectively.

The PCA can be used to assign a projection (a set of new components) to each input data set. The command

bigmler pca \
        --pca pca/5331f71435203f5ac30005c0 \
        --test data/test_iris.csv \
        --output projections.csv

would produce a file projections.csv with the projections associated to each input. It’s important to remark that to build projections for a supervised learning problem the objective field should never be part of the PCA input fields. Including the objective in the PCA would cause leakage. In order to remove the objective field, you can use the --exclude-objective flag. Also, the train/test split should be done before creating the PCA from the training dataset to avoid leakage from the test set data in the new components.

You can also change some parameters in the PCA model, like the --max-components or --variance-threshold to select the number of components to be used in the projection. Please check the PCA section of the API documentation for a detailed description of the available arguments.

bigmler pca --dataset dataset/53b1f71437203f5ac30004ed \
            --max-components 4 \
            --test data/test_iris.csv \
            --output projections.csv

with this code, only the first 4 components of the PCA are used to generate projections, reducing thus the dimensionality of the dataset to 4.

When previous command is executed, the PCA information is downloaded to your local computer and the PCA projections are computed locally, with no more latencies involved. Just in case you prefer to use BigML to compute the projections remotely, you can do so too

bigmler pca
        --pca pca/53b1f71435203f5ac30005c0 \
        --test data/my_test.csv --remote

would create a remote source and dataset from the test file data, generate a batch projection also remotely and finally download the result to your computer. If you prefer the result not to be dowloaded but to be stored as a new dataset remotely, add --no-csv and to-dataset to the command line. Some output format configurations can be controlled using the --projection-header option, that causes the headers of the fields to be placed as a first row in the projections file, or the --projection-fields option, that can be set to all or to a comma-separated list of fields of the original dataset that will be included in the projections file before the projection components. Other output configurations can be set by using the --batch-projection-attributes option pointing to a JSON file that contains the desired attributes, like:

{"output_fields": ["petal length", "sepal length"],
 "all_fields": true}

PCA Subcommand Options

--pca PCA

BigML PCA Id

--pcas PATH

Path to a file containing PCA/ids. One PCA per line (e.g., pca/4f824203ce80051)

--no-pca

No PCA will be generated

--pca-file PATH

Path to a file containing a JSON PCA structure.

--pca-fields PCA_FIELDS

Comma-separated list of fields that will be used in the PCA construction

--pca-attributes PATH

Path to a JSON file that contains the attributes to configure the PCA

--max-components INTEGER

Maximum number of components to be used in projections

--variance-threshold NUMBER

Maximum variance covered with the subset of components to be used in projections

--exclude-objective

When set, excludes the objective field in the dataset from the PCA input fields

--batch-projection-attributes PATH

Path to a JSON file that contains the attributes to configure the batch projection

--projection-header

When set, adds a headers row at the top of the generated projections file

--projection-fields FIELDS

Comma-separated list of field in the test set to be added to the projections file. Use all to include all fields

--no-no-pca

PCA will be generated