Get Started with Neutrino

Neutrino is a deep learning library for optimizing and accelerating deep neural networks to make them faster, smaller and more energy-efficient. Neural network designers can specify a variety of pre-trained models, datasets and target computation constraints and ask the engine to optimize the network. High-level APIs are provided to make the optimization process easy and transparent to the user. Neutrino can be biased to concentrate on compression (relative to disk size taken by the model) or latency (forward call’s execution time) optimization.

_images/engine_figure.png

Note

Currently we support MLP/CNN-based deep learning architectures.

Follow these simple steps to learn how to use Neutrino in your project.

Choose a Framework

Neutrino supports PyTorch (and TensorFlow soon) framework. This comes as a separate package and once installed, the framework object needs to be instantiated and given to the engine.

from neutrino.framework.torch_framework import TorchFramework
framework = TorchFramework()

Choose a Dataset

The engine expects you to provide your dataset as data_splits dictionary format from keys string names to dataloader values. The engine always refers to train in data_splits to access training data. However, you can determine which split is being used by engine for validation by passing eval_split argument to the Neutrino Configuration. Alternatively, you can use one of the formatted and available benchmark datasets from deeplite-torch-zoo.

Example:

def get_cifar100_dataset(dataroot, batch_size):
    trainset = torchvision.datasets.CIFAR100(root=dataroot,
                                             train=True,
                                             download=True,
                                             transform=transforms.Compose([
                                                 transforms.RandomCrop(32, padding=4),
                                                 transforms.RandomHorizontalFlip(),
                                                 transforms.ToTensor(),
                                                 transforms.Normalize((0.4914, 0.4822, 0.4465),
                                                                      (0.2023, 0.1994, 0.2010))
                                             ]))
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                              shuffle=True, num_workers=4, pin_memory=True)

    testset = torchvision.datasets.CIFAR100(root=dataroot,
                                            train=False,
                                            download=True,
                                            transform=transforms.Compose([
                                                transforms.ToTensor(),
                                                transforms.Normalize((0.4914, 0.4822, 0.4465),
                                                                     (0.2023, 0.1994, 0.2010))
                                            ]))
    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                             shuffle=False, num_workers=4, pin_memory=True)

    return {
            'train': trainloader,
            'test': testloader
            }

Note

You must use the same splits for both training and optimizing your model. If you use a subset of training data for validation set, you need to use the same training/validation set for optimization process.

Note

Please use the same batch size as you have used to train the original network for the optimization process.

Choose a Model

The next step is defining the pre-trained model as the reference model you want to optimize. You can take your own pre-trained custom model or use a model that is publicly available. We assume the model you will use is also compatible with the framework you choose, for example a torch model will be a subclass of torch.nn.Module. Alternatively, you can use one of the pretrained models from deeplite-torch-zoo.

Example:

# Option 1: load a pre-trained model
reference_model = TheModelClass(*args, **kwargs)
reference_model.load_state_dict(torch.load(PATH))

# Option 2: use torchvision model zoo
import torchvision.models as models
reference_model = models.resnet18(pretrained=True)

# Option 3: use Neutrino zoo
from deeplite_torch_zoo.wrappers.wrapper import get_model_by_name
reference_model = get_model_by_name(model_name=args.arch,
                                    dataset_name=args.dataset,
                                    pretrained=True,
                                    progress=True)

Run Optimization Engine

We provide a simple yet powerful process with multiple user-guided controls to optimize your models. First, you need to instantiate from Neutrino class and pass the required arguments data_splits, reference_model and framework. Furthermore, a config dictionary needs to be supplied with the optimization parameter and any other parameters which configure the optimization and training process.

There are three optimization modes provided by neutrino: compression, latency, and quantization. Each makes use of common Neutrino config parameters and other mode-specific parameters. If you are just getting started, the compression mode is our recommended first step.

Neutrino Configuration

You can pass several parameters to the Neutrino engine through the config. Every Neutrino job makes use of a config dictionary with parameters described below.

optimization

Select which optimization mode the engine should use. The engine currently supports:

compression: maximizes reduction of the bytes the model will occupy in terms of disk size

latency: maximizes reduction of the model execution time

quantization: compresses the model with quantization and reduces execution time when deployed with Deeplite RT

Keep in mind compression mode may also improve latency and latency mode may reduce model size.

Note

The default behavior is compression. Currently, the quantization mode is available only for the Production version of Deeplite Neutrino. Refer, how to upgrade.

device

Whether to use GPU or CPU for the optimization process. This is typically the same machine you would use to train your model. For modern deep learning and computer vision models/datasets, we recommend to use GPU. Keep in mind that ‘device’ does NOT dictate the device you deploy your model on for inference. Once you start a job, it is not possible to switch from CPU to GPU after initializing the engine on CPU.

use_horovod

Activates distributed training through Horovod. Please read Running on multi-gpu on a single machine for more information. Neutrino will linearly scale the learning rate by the number of GPUs

Important

Currently, the multi-GPU support is available only for the Production version of Deeplite Neutrino. Refer, how to upgrade.

eval_key

Name of the evaluation metric the engine listens to while optimizing for delta (e.g. ‘accuracy’, ‘mAP’). More details are here Types of Tasks and when creating customized evaluation function Going Deeper with Neutrino.

from deeplite.torch_profiler.torch_inference import TorchEvaluationFunction

class EvalAccuracy(TorchEvaluationFunction):
    def _compute_inference(self, model, data_loader, **kwargs):
        total_acc = ...foo accuracy calculation...
        return {'accuracy': 100. * (total_acc / float(len(data_loader)))}

eval_key = 'accuracy' # matches with the dictionary key returned by EvalAccuracy()
optimized_model = Neutrino(eval_func=EvalAccuracy(),
                           ...foo other arguments...)

eval_split

Name of the key in the data_splits dictionary on which to run the evaluation function and fetch the evaluation metric.

data_splits = {'train': foo_trainloader,
               'test': foo_testloader}

eval_split = 'test' # matches with the dictionary key of data_splits for validation dataset
optimized_model = Neutrino(data=data_splits,
                           ...foo other arguments...)

Compression Configuration

The compression optimization mode makes use of the following config parameters:

delta

The acceptable performance drop for your model. Delta must be in the same range as your performance metric. For example, you must use a delta between 0 and 1.0 if your performance metric is between 0 and 1.0 (e.g. your model has 0.758 mAP) or you must use a delta between 0 and 100 if your performance metric is between 0 and 100 (e.g. 78% Top1 accuracy).

level

The engine has two levels of optimization for you to control how much computing resources you want to allocate to the processs: Level 1 and Level 2. By default it is on level 1. Please note that level 2 may take roughly twice as long to complete than level 1, but level 2 will produce a more compressed result. Currently, the engine only supports level 1 for object detection tasks.

deepsearch

In conjunction with levels, it is possible to use the deepsearch flag. This is a powerful function that will produce even more optimized results. It activates a more fine grained optimization search to consume the most of the allotted delta, however it will make the optimization process longer .

Latency Configuration

The latency optimization mode makes use of the delta parameter in the same way as the compression mode.

Quantization Configuration

The quantization optimization mode is activated by adding key 'custom_compression' to the config dictionary with a dictionary defining the quantization parameters. There are two methods for configuring quantization: rules-based quantization with quantization_args or a layerwise configuration with layers

quantization_args

Passing a dictionary under the key 'quant_args' activates rules-based model quantization. The parameters manually control which layers of the network are quantized to ultra low precision.

'quantize_conv11': bool, default=False. Activates quantization of pointwise convolution layers

'skip_layers': list[int], default=None. Skips quantization of layers with given indices. Layers indexed by traversal of computational graph of the model

'skip_layers_ratio': float in range [0.0, 1.0], default=0.0. Skips first skip_layers_ratio * n_layers layers

config = {
    'custom_compression': {
        'quantization_args': {
            'quantize_conv11': True,
            'skip_layers': [8, 9, 10],
            'skip_layers_ratio': 0.1
        }
}

layers

Passing a dictionary under the key 'layers' enables a layerwise quantization configuration. Any layer not specified in the dictionary will remain at FP32 precision. The layer names are defined by the underlying framework. For torch this correlates to the names returned by model.named_modules().

This custom_compression dictionary is formatted as follows:

config = {
    'custom_compression': {
        'layers': {
            'model.block.0.conv1': {
                'precision': 2
            },
            'model.block.1.conv2': {
                'precision': 2
            }...
    }
}

Export

A dictionary with the desired export format(s). By default, the optimized models will be exported in Neutrino Pickle Format. Additionally, we support other export formats including PyTorch TorchScript, ONNX, Tensorflow Lite (TFLite), and dlrt The optimized model can be exported to more than one format: ['onnx', 'jit', 'tflite', 'dlrt']. Quantized models are only exported to our proprietary dlrt format. You can also specify a customized path of the exported model file.

Important

Currently, exporting to jit, onnx, and dlrt is supported by default in Neutrino. If you would like to use tflite export, additionally install pip install deeplite-model-converter[all]

'export': {
    'format': ['onnx'],
    'kwargs': {
        'root_path': <your_dir>,
        'precision': 'fp32' # ('fp32' or 'fp16'), only for onnx, dlrt formats
        'resolutions': [(32, 32), (36, 36)] # list of tuples, only for onnx, dlrt formats
    }
}

ONNX/DLRT Export Options

resolutions

By default, the onnx model is exported with both dynamic input image resolution, and fixed input resolution matching the training dataset resolution. If you wish to deploy the model with a different input resolution, you can specify the desired resolution(s) as shown in the export example.

precision

Set the ‘precision’ keyword argument to ‘fp16’ if you want the engine to export the optimized model in FP16. Please note that some operations need FP32 and onnx cannot convert them to FP16. Currently, this option is only available for classification tasks and the onnx export format.

Output

The python object of the optimized model is returned by the Neutrino.run() function call. The following output is obtained when the export format is provided as ['onnx', 'jit']. The engine exports the reference model in FP32 and the optimized model in FP32 or FP16 (See precision) in onnx format with both dynamic input resolution and fixed input input resolution. The dynamic input model is also exported to pytorch script format and a proprietary Neutrino pickle format, as follows:

Reference Model has been exported to Neutrino pickle format: /WORKING_DIR/ref_model.pkl
Reference Model has been exported to pytorch jit format: /WORKING_DIR/ref_model_jit.pt
Reference Model has been exported to onnx format: /WORKING_DIR/ref_modelfp32_dynamic_shape.onnx
Reference Model, fixed input resolution, exported to onnx format: /WORKING_DIR/ref_modelfp32_dynamic_shape.onnx
Optimized Model has been exported to Neutrino pickle format: /WORKING_DIR/opt_model.pkl
Optimized Model has been exported to pytorch jit format: /WORKING_DIR/opt_model_jit.pt
Optimized Model has been exported to onnx format: /WORKING_DIR/opt_modelfp32_dynamic_shape.onnx
Optimized Model, fixed input resolution, exported to onnx format: /WORKING_DIR/opt_model32x32fp32.onnx
OR
Optimized Model has been exported to onnx format: /WORKING_DIR/opt_modelfp16_dynamic_shape.onnx (if fp16 is enabled)
Optimized Model, fixed input resolution, exported to onnx format: /WORKING_DIR/opt_model32x32fp16.onnx (if fp16 is enabled)

Important

For classification models, the community version returns the second best opt_model at the end of the optimization process. Consider upgrading to the production version to obtain the most optimized model produced by Deeplite Neutrino. Refer how to upgrade.

Important

For object detection and segmentation models, the community version displays the results of the optimization process, including all the optimized metric values. To obtain the optimized model produced by Deeplite Neutrino, consider upgrading to the production version. Refer how to upgrade.

Neutrino Pickle Format

Neutrino saves, on the disk, both the provided reference model and the optimized model in an encrypted proprietary pickle format. This will be available in the following paths: /WORKING_DIR/ref_model.pkl and /WORKING_DIR/opt_model.pkl. One can load the Neutrino pickle format using our custom load function, as follows,

from neutrino.framework.torch_framework import TorchFramework
from neutrino.job import Neutrino

# load original model
original_model = TheModelClass(*args, **kwargs)

# load Neutrino pickle format model
pytorch_optimized_model = Neutrino.load_from_pickle(TorchFramework(),
                                                    '/WORKING_DIR/opt_model.pkl',
                                                    original_model)

The Neutrino.load_from_pickle function will load the model in pickle format and return a Pytorch native object. This model can be used for further processing using Neutrino, or for profiling using Deeplite Profiler, or for any downstream applications.

Running a Job

Finally, you just need to call run function from Neutrino class to start the optimization process.

from neutrino.framework.torch_framework import TorchFramework
from neutrino.job import Neutrino
config = {
    'deepsearch': args.deepsearch, #(boolean), (default = False)
    'delta': args.delta, #(between 0 to 100), (default = 1)
    'device': args.device, # 'GPU' or 'CPU' (default = 'GPU')
    'use_horovod': args.horovod, #(boolean), (default = False)
    'level': args.level, # int {1, 2}, (default = 1)
    'export':{'format': ['onnx'], # ['onnx', 'jit', 'tflite'] (default = None)
              'kwargs': {'precision': precision}, # ('fp16' or 'fp32') (default = 'fp32')
             }
}

data_splits = {'train': trainloader,
               'test': testloader}

reference_model = TheModelClass(*args, **kwargs)
reference_model.load_state_dict(torch.load(PATH))

opt_model = Neutrino(framework=TorchFramework(),
                     data=data_splits,
                     model=reference_model,
                     config=config).run(dryrun=args.dryrun) #dryrun is boolean and it is False by default

Note

It is recommended to run the engine in dryrun mode to check everything runs properly on your machines. It forces the engine to run till the end without running any heavy and time consuming computation.

Types of Tasks

By default, Neutrino is wired for optimizing a classification task that has a fairly simple setup. This imposes tight constraints on the assumed structure of how tensors flow from the data loader, to the model, to the loss function and to the evaluation. For example, the classification task assumes the loss is CrossEntropy, the evaluation is GetAccuracy and the eval_key in the config is ‘accuracy’. For more details and how to use Neutrino on more intricate tasks, please read Going Deeper with Neutrino.

Performance Considerations

Important

The optimization process may take several hours depending on the model complexity, constraints and dataset.

  • Tighter constraints make the optimization process harder. For instance, it is harder to find a good optimized model with delta=%1 comparing to delta=%5. This is due to the nature of optimization process, where there are less possible solutions under tighter constraints. Therefore, the engine needs more time to explore and find those solutions.

  • Dataset size also impacts on the optimization time. High resolution images or large datasets may slow down the optimization process.

  • Number of classes in dataset can impact the optimization process. When we have more classes, we need to use more capacity of the network to learn, which means less opportunity to shrink the network.

  • Model complexity can also impact on the optimization time as well.

Environment Variables

Optional environment variables that can be set to configure the Neutrino engine.

  • NEUTRINO_HOME- The absolute path to the directory where the engine stores its data (such as checkpoints, logs, etc.) [default=~/.neutrino]

  • NEUTRINO_LICENSE- Contains the license key.

  • NEUTRINO_LICENSE_FILE- The absolute path where the license file can be found.

Code Examples

To make it quick and easy for you to test Neutrino, we provide some pre-defined scenarios. It is recommended to run the example codes on different pre-defined models/dataset to ensure the engine works on your machines before you optimize your custom model/dataset.