Source subcommand

The bigmler source subcommand can be used to create sources from data files either with or without images. Regular CSV files will generate a Source resource in BigML whereas collections of images will generate composite sources.

Composite Sources

A Composite Source is an arbitrary collection of other BigML Sources. The Sources in a composite are called components. When all the components have the same fields, the composite itself will inherit those fields, and you will be able to create a dataset from it: the result will just be the concatenation of all the rows extracted from each component source inside the composite.

You could put together a list of CSV sources, or maybe a couple of CSV files and an ARFF file with the same exact fields, and the resulting composite will inherit those fields and behave like a single source for all practical purposes.

As any other source, a (possibly empty) composite is created open, meaning that you can modify it. In the case of composites, modifying it means performing one of the following operations:

  • Adding components

bigmler source --source source/4f603fe203ce89bb2d000000 \
               --add-sources source/4f603fe203ce89bb2d000001,source/4f603fe203ce89bb2d000002 \
               --output-dir final-composite
  • Removing components

bigmler source --source source/4f603fe203ce89bb2d000000 \
               --remove-sources source/4f603fe203ce89bb2d000001,source/4f603fe203ce89bb2d000002 \
               --output-dir final-composite
  • Replacing the full list of components

bigmler source --source source/4f603fe203ce89bb2d000000 \
               --replace-sources source/4f603fe203ce89bb2d000001,source/4f603fe203ce89bb2d000002 \
               --output-dir final-composite

A source can belong to as many composites as you wish, and composites can be nested, with the only limitation that a composite can only be a component if it’s closed (non-editable).

When a source belongs to one or more composites, it cannot be modified, regardless of whether it’s open or closed. That way all composites see the same version of the source all the time.

As you add or remove components to a composite, it will check the compatibility of the fields of all its components, and update its own set of fields. Thus, adding and removing sources to a composite is in this sense analogous to changing the parsing specification of, say, a CSV, in the sense that that is also an operation that can potentially change the collection of fields (and even the number of rows) extracted to the CSV.

Once you have finished adding components to a composite and want to use it to create datasets, you must close it. When you close a composite, all its components will be automatically closed for you.

Unlike all other kinds of source, composites created this way must be explicitly closed by an API call or UI action in order to create a dataset. That is mainly to avoid accidentally closing a composite that is being worked on by several collaborators, or by mistake. Since composites can have a huge number of components and closing them also closes all of them, it may be relatively slow.

As an alternative to combining pre-existing sources into a composite, one can also upload a zip or tar file containing more than one file. BigML will then automatically create one source for each file inside the archive, and put them all together in a composite source.

Image Feature Extraction

BigML provides configurable Image Analysis extraction capabilities for Composites built on images. The Composite configuration options include automatically computing as new features the dimensions, average pixels, level histogram, histogram of gradients, wavelet subbands and even using features derived from pre-trained CNNs. For more detail, you can check the Image Analysis API documentation for composites.

All these options are in turn configurable when creating sources using BigMLer.

bigmler source --data cats.zip --dimensions --HOG \
               --pretrained-cnn mobilenet \
               --output-dir final-composite

Thanks to those new features, all kind of models (not only Deepnets) can be built taking advantage of the images information. To learn the options available for image extraction configuration, see :ref:image-feature-extraction .

Annotated images as Composite Sources

BigML allows to use images too to build your Machine Learning models. In order to use images in BigML, each image file needs to be uploaded and transformed in a Source object, and the collection of images that will become your training data is handled in BigML as a collection of Sources. However, this collection of sources is in turn a Source (to be precise, a Composite Source). Each row in a Composite Source can contain one or more images, but it can also contain other fields related to those images, like labels, used in classification, or regions, used in object detection.

When storing images in a repository, is common practice to keep them in directories or compressed files. The related fields, like labels or regions, are usually stored as additional files where some attribute points to the image they refer to. In BigML Composite Sources, though, images and annotations can be consolidated as different fields of the composite source, so that every row of data in the composite source contains the source created by uploading the related image plus the annotation fields associated to it.

As there’s not a single standard procedure to create and store these image and annotation files, BigMLer tries to give options that encompass most of the usual scenarios. We’ll see some examples using the specific bigmler source subcommand.

First scenario: We only need to upload images and they are already stored in a single compressed file.

bigmler source --train my_images.zip --output-dir output

In this case, the my_images.zip is uploaded and a new composite source is created containing the images.

Second scenario: Images are stored in a directory.

bigmler source --train ./my_images_directory --output-dir output

The BigMLer command creates a local compressed file that contains the images stored in the directory given as a --train option. The compressed file is stored in the output directory and then is uploaded to BigML, resulting in a composite source.

Third scenario: The images are stored in a directory and they have associated annotations which have been stored in an annotations JSON file.

bigmler source --train ./my_images_directory \
               --annotations-file annotations.json \
               --output-dir output

BigML uses a BigML-COCO syntax to provide labels associated to images. The annotations file should contain a list of dictionaries and each dictionary corresponds to one of the images. The reference to the annotated image is provided in the file attribute.

[{"file": "my_images/image1.jpg",
  "label": "label1"}.
 {"file": "my_images/image2.jpg",
  "label": "label1"},
 {"file": "my_images/image3.jpg",
  "label": "label2"}]

In this case, the previous bigmler source command will zip the images contained in the my_images_directory, upload them and create the corresponding composite source, and finally add a new field named label to the composite source where the labels provided in the annotations.json file will be updated.

These are the basic scenarios, but other annotations syntaxes, like VOC, YOLO or COCO files are also accepted. As for the first two the annotations are provided separately, in one file per image, you would need to provide the directory where these files are stored and the annotations language as options:

bigmler source --train ./my_images_directory \
               --annotations-dir ./annotations_directory \
               --annotations-language VOC
               --output-dir output

Each annotation file can contain some folder attribute. That will be interpreted as a subfolder information that will be added to the given --train path on a per image basis.

On the contrary, COCO annotations are provided in a single file. In that case, you can point to the file using the --annotations-file option.

bigmler source --train ./my_images_directory \
               --annotations-file ./my_coco_annotations.json \
               --annotations-language COCO
               --output-dir output

The created composite sources are editable up until you close them explicitly or you create a dataset from them. While editable, more annotations can be added to an existing source. For instance, to add annotations to the source generated in the third scenario, source/61373ea6520f903f48000001, we could use:

bigmler source --source source/61373ea6520f903f48000001 \
               --images-file my_images.zip \
               --annotations-file new_annotations.json \
               --output-dir output

Sometimes, dataset’s annotations are created on a set of images and need to be applied to a different one. For instance, map images can be better annotated if their image contains existing information layers that will not be available at prediction time. In those cases, annotations can be exported by downloading the dataset information as a CSV. Then, they can be added to a new composite source created from raw map images. In the example below, a new source is created in a BigML Organization project, specified using --org-project, and the --annotations-file option points to the downladed dataset information, which contains a regions field called objects. Raw images should be named identically to the ones used in the original dataset and stored in the directory provided by the --data option. Using the --annotated-only flag will ensure that only images that have been annotated are uploaded to the new source.

bigmler source --org-project project/672546c1e2bd384653292ab0 \
               --annotations-file ./download/annotated_dataset.csv \
               --data "./raw-images" \
               --annotations-language CSV \
               --output-dir ./upload-raw \
               --annotations-field objects \
               --annotated-only

Source subcommand Options

--data PATH

Path to the data file or directory (if more than one file should be uploaded)

--images-file PATH

Path to a compressed file that contains images

--annotations-file PATH

Path to the file that contains the annotations for images

--annotations-dir DIRECTORY

Path to a directory that contains annotation files, one per image

--annotations-language LANGUAGE

Language that sets the syntax of the annotations. Options: VOC or YOLO

--source SOURCE ID

Id for the source that will be updated

--source-in PATH

Path to the file that contains source Ids (one per line) and uses the last one as source ID for updates

--sources-in PATH

Path to the file that contains source Ids (one per line) and uses them all as the list of sources to create composite sources

--close

Causes a source to be closed for editing

--open

If the source is closed, clones the source into a new one open for editing

--add-sources STRING

Adds a comma-separater list of sources to a composite source

--delete-sources STRING

Deletes a comma-separated list of sources from the composite source and also individually if they don’t belong to another composite

--remove-sources STRING

Deletes a comma-separated list of sources from the composite source keeping them as individual sources

--rows-values-json PATH

Path to a JSON file that contains the values for some rows and fields

--rows-indices STRING

Comma-separated list of indices of the rows that will be affected by the --rows-values-json option

--rows-components STRING

Comma-separated list of source Ids that will be affected by the --rows-values-json option

Image Analysis Specific Options

--no-image-analysis

Disables the Image Feature Extraction (only Deepnets will be able to use images information)

--dimensions

Enables Image dimensions extraction

--average-pixels

Enables Image average pixels extraction

--level-histogram

Enables color level histogram extraction

--HOG

Enables histogram of gradients extraction

--ws-level

Enables wavelet subbands extraction and sets the number of iterations.

--pretrained-cnn

Enables extraction of particular pretrained CNN features. The available options for CNNs are: mobilenet, mobilenetv2 and resnet18