Barracuda AI网络推理
Unity实验室致力于改进一项最先进的研究,并开发出一个高效的神经系统推理引擎——Barracuda。深度学习长期以来一直局被限于超级计算机和离线计算之中,但由于计算能力的不断提高,它们在消费者级硬件上的实时可用性也在快速提升。有了Barracuda,Unity实验室希望Barracuda能快速的进入到创作者手中。得益于ML-Agents,神经网络已经被用于一些游戏开发中的人工智能应用中,但还有许多应用需要在实时游戏引擎中演示。比如:深度学习超采样,环境遮挡,全局光照,风格变换等等。
目前网络上看到基于 Barracuda 实现的效果有 画面风格迁移(类似后处理), 人脸识别, 面部捕捉,手势识别等, 更多相关的案例参考网页。
1. ONNX
1.1 简介
Open Neural Network Exchange(ONNX,开放神经网络交换)格式,是一个用于表示深度学习模型的标准,可使模型在不同框架之间进行转移。
ONNX是一种针对机器学习所设计的开放式的文件格式,用于存储训练好的模型。它使得不同的人工智能框架(如Pytorch, MXNet)可以采用相同格式存储模型数据并交互。 ONNX的规范及代码主要由微软,亚马逊 ,Facebook 和 IBM 等公司共同开发,以开放源代码的方式托管在Github上。
目前官方支持加载ONNX模型并进行推理的深度学习框架有: Caffe2, PyTorch, MXNet,ML.NET,TensorRT 和 Microsoft CNTK,并且 TensorFlow 也非官方的支持ONNX。例如 Pytorch 导出onnx:
torch.onnx.export(
model,
(input1,input2),
"./trynet.onnx",
verbose=True,
input_names=input_names,
output_names=output_names
)
ONNX中的一些信息都被可视化展示了出来,例如文件格式ONNX v7,该文件的导出方pytorch1.10等等,这些信息都保存在ONNX格式的文件中。 如下面是使用 Netron 打开模型:
ONNX 文件内部是以Protobuf记录节点信息的, 当加载了一个ONNX之后,我们获得的就是一个ModelProto,它包含了一些版本信息,生产者信息和一个GraphProto。在GraphProto里面又包含了四个repeated数组,它们分别是node(NodeProto类型),input(ValueInfoProto类型),output(ValueInfoProto类型)和initializer(TensorProto类型),其中node中存放了模型中所有的计算节点,input存放了模型的输入节点,output存放了模型中所有的输出节点,initializer存放了模型的所有权重参数。
- ModelProto
- GraphProto
- NodeProto
- ValueInfoProto
- TensorProto
- AttributeProto
2.1 Barracuda 处理 onnx
Barracuda 内置了ONNX的解析, 并可以为之预览信息。 当在Unity中选中一个.onnx或者.nn后缀的一个模型文件之后, 会首先调用onnx convert的接口解析出onnx的内部结构, 然后在 ONNXModelImporterEditor 类里进行EditorGUI的绘制。
此界面可以清晰看出网络的输入、输出、版本以及内存开销等信息。
算子扩展
遍历 Barracuda对算子的支持, 像主流的 Dense, Conv, Upsample, MaxPool, Normalization, LSTM都得到了支持。 详细的支持列表如下:
public enum Type
{
/// <summary>
/// Dense layer
/// </summary>
Dense = 1,
/// <summary>
/// Matrix multiplication layer
/// </summary>
MatMul = 2,
/// <summary>
/// Rank-3 Dense Layer
/// </summary>
Dense3 = 3,
/// <summary>
/// 2D Convolution layer
/// </summary>
Conv2D = 20,
/// <summary>
/// Depthwise Convolution layer
/// </summary>
DepthwiseConv2D = 21,
/// <summary>
/// Transpose 2D Convolution layer
/// </summary>
Conv2DTrans = 22,
/// <summary>
/// Upsampling layer
/// </summary>
Upsample2D = 23,
/// <summary>
/// Max Pool layer
/// </summary>
MaxPool2D = 25,
/// <summary>
/// Average Pool layer
/// </summary>
AvgPool2D = 26,
/// <summary>
/// Global Max Pool layer
/// </summary>
GlobalMaxPool2D = 27,
/// <summary>
/// Global Average Pool layer
/// </summary>
GlobalAvgPool2D = 28,
/// <summary>
/// Border / Padding layer
/// </summary>
Border2D = 29,
/// <summary>
/// 3D Convolution layer
/// </summary>
Conv3D = 30,
/// <summary>
/// Transpose 3D Convolution layer (not yet implemented)
/// </summary>
Conv3DTrans = 32, // TODO: NOT IMPLEMENTED
/// <summary>
/// 3D Upsampling layer
/// </summary>
Upsample3D = 33,
/// <summary>
/// 3D Max Pool layer (not yet implemented)
/// </summary>
MaxPool3D = 35, // TODO: NOT IMPLEMENTED
/// <summary>
/// 3D Average Pool layer (not yet implemented)
/// </summary>
AvgPool3D = 36, // TODO: NOT IMPLEMENTED
/// <summary>
/// 3D Global Max Pool layer (not yet implemented)
/// </summary>
GlobalMaxPool3D = 37, // TODO: NOT IMPLEMENTED
/// <summary>
/// 3D Global Average Pool layer (not yet implemented)
/// </summary>
GlobalAvgPool3D = 38, // TODO: NOT IMPLEMENTED
/// <summary>
/// 3D Border / Padding layer
/// </summary>
Border3D = 39,
/// <summary>
/// Activation layer, see `Activation` enum for activation types
/// </summary>
Activation = 50,
/// <summary>
/// Scale + Bias layer
/// </summary>
ScaleBias = 51,
/// <summary>
/// Normalization layer
/// </summary>
Normalization = 52,
/// <summary>
/// LRN (Local Response Normalization) layer
/// </summary>
LRN = 53,
/// <summary>
/// Dropout layer (does nothing in inference)
/// </summary>
Dropout = 60,
/// <summary>
/// Random sampling from normal distribution layer
/// </summary>
RandomNormal = 64,
/// <summary>
/// Random sampling from uniform distribution layer
/// </summary>
RandomUniform = 65,
/// <summary>
/// Random sampling from multinomial distribution layer
/// </summary>
Multinomial = 66,
/// <summary>
/// OneHot layer
/// </summary>
OneHot = 67,
/// <summary>
/// TopK indices layer
/// </summary>
TopKIndices = 68,
/// <summary>
/// TopK values layer
/// </summary>
TopKValues = 69,
/// <summary>
/// NonZero layer
/// </summary>
NonZero = 70,
/// <summary>
/// Range layer
/// </summary>
Range = 71,
/// <summary>
/// Addition layer
/// </summary>
Add = 100,
/// <summary>
/// Subtraction layer
/// </summary>
Sub = 101,
/// <summary>
/// Multiplication layer
/// </summary>
Mul = 102,
/// <summary>
/// Division layer
/// </summary>
Div = 103,
/// <summary>
/// Power layer
/// </summary>
Pow = 104,
/// <summary>
/// Min layer
/// </summary>
Min = 110,
/// <summary>
/// Max layer
/// </summary>
Max = 111,
/// <summary>
/// Mean layer
/// </summary>
Mean = 112,
/// <summary>
/// Reduce L1 layer (not yet implemented)
/// </summary>
ReduceL1 = 120, // TODO: NOT IMPLEMENTED
/// <summary>
/// Reduce L2 layer (not yet implemented)
/// </summary>
ReduceL2 = 121, // TODO: NOT IMPLEMENTED
/// <summary>
/// Reduce LogSum layer (not yet implemented)
/// </summary>
ReduceLogSum = 122, // TODO: NOT IMPLEMENTED
/// <summary>
/// Reduce LogSumExp layer (not yet implemented)
/// </summary>
ReduceLogSumExp = 123, // TODO: NOT IMPLEMENTED
/// <summary>
/// Reduce with Max layer
/// </summary>
ReduceMax = 124,
/// <summary>
/// Reduce with Mean layer
/// </summary>
ReduceMean = 125,
/// <summary>
/// Reduce with Min layer
/// </summary>
ReduceMin = 126,
/// <summary>
/// Reduce with Prod layer
/// </summary>
ReduceProd = 127,
/// <summary>
/// Reduce with Sum layer
/// </summary>
ReduceSum = 128,
/// <summary>
/// Reduce with SumSquare layer (not yet implemented)
/// </summary>
ReduceSumSquare = 129, // TODO: NOT IMPLEMENTED
/// <summary>
/// Logic operation: Greater layer
/// </summary>
Greater = 140,
/// <summary>
/// Logic operation: GreaterEqual layer
/// </summary>
GreaterEqual = 141,
/// <summary>
/// Logic operation: Less layer
/// </summary>
Less = 142,
/// <summary>
/// Logic operation: LessEqual layer
/// </summary>
LessEqual = 143,
/// <summary>
/// Logic operation: Equal layer
/// </summary>
Equal = 144,
/// <summary>
/// Logic operation: LogicalOr layer
/// </summary>
LogicalOr = 145,
/// <summary>
/// Logic operation: LogicalAnd layer
/// </summary>
LogicalAnd = 146,
/// <summary>
/// Logic operation: LogicalNot layer
/// </summary>
LogicalNot = 147,
/// <summary>
/// Logic operation: LogicalXor layer
/// </summary>
LogicalXor = 148,
/// <summary>
/// Logic operation: Where layer
/// </summary>
Where = 149,
/// <summary>
/// Logic operation: Sign layer
/// </summary>
Sign = 150,
/// <summary>
/// Reflection padding layer
/// </summary>
Pad2DReflect = 160,
/// <summary>
/// Symmetric padding layer
/// </summary>
Pad2DSymmetric = 161,
/// <summary>
/// Edge padding layer
/// </summary>
Pad2DEdge = 162,
/// <summary>
/// ArgMax layer
/// </summary>
ArgMax = 163,
/// <summary>
/// ArgMin layer
/// </summary>
ArgMin = 164,
/// <summary>
/// ConstantOfShape layer
/// </summary>
ConstantOfShape = 199,
/// <summary>
/// Flatten layer
/// </summary>
Flatten = 200,
/// <summary>
/// Reshape layer
/// </summary>
Reshape = 201,
/// <summary>
/// Transpose layer
/// </summary>
Transpose = 202,
/// <summary>
/// Squeeze layer (not fully supported)
/// </summary>
Squeeze = 203, // TODO: NOT IMPLEMENTED
/// <summary>
/// Unsqueeze layer (not fully supported)
/// </summary>
Unsqueeze = 204, // TODO: NOT IMPLEMENTED
/// <summary>
/// Gather layer
/// </summary>
Gather = 205,
/// <summary>
/// Depth to space layer
/// </summary>
DepthToSpace = 206,
/// <summary>
/// Space to depth layer
/// </summary>
SpaceToDepth = 207,
/// <summary>
/// Expand layer
/// </summary>
Expand = 208,
/// <summary>
/// 2D Resample layer
/// </summary>
Resample2D = 209,
/// <summary>
/// Concat layer
/// </summary>
Concat = 210,
/// <summary>
/// Strided slice layer
/// </summary>
StridedSlice = 211,
/// <summary>
/// Tile layer
/// </summary>
Tile = 212,
/// <summary>
/// Shape layer
/// </summary>
Shape = 213,
/// <summary>
/// Non max suppression layer
/// </summary>
NonMaxSuppression = 214,
/// <summary>
/// LSTM
/// </summary>
LSTM = 215,
/// <summary>
/// Constant load layer (for internal use)
/// </summary>
Load = 255
}
Barracuda 对每个算子提供了不同版本的实现, 有的基于 CPU 也有基于GPU的, cpu还有基于 Burst的版本, Gpu 基于 Compute Shader去实现。 比如说 Conv2D 这个算子, 这里可以看到一共九种实现方式:
注意: VerboseOps并不是实现, 里面只是打印Layer的信息, 比如说 Weights/权重 和 bias/偏移 这些参数。 也许为了更好的调式<打印>吧。打印>
枚举值里也并不是所有的都实现了, 比如说: ReduceL1, ReduceL2, ReduceLogSum等
if (l.type == Layer.Type.ReduceL1 ||
l.type == Layer.Type.ReduceL2 ||
l.type == Layer.Type.ReduceLogSum ||
l.type == Layer.Type.ReduceLogSumExp ||
l.type == Layer.Type.ReduceSumSquare)
{
throw new NotImplementedException("This reduction operation is not implemented yet!");
}
如果想要扩展想要的算子, 可以在IOps.cs 定义相关的接口, 然后在不同的类中实现,比如使用Burst, 就在 bURSTCPUOps类中实现该接口, 如果在GPU上运行,就不妨再 ComputeOps 类中实现。 然后添加对应的枚举值, 在GenericWorker类中根据对应的枚举值, 创建对应算子的Layer。
网络模型
Barracuda 更擅长处理的图像, 模型的输入一般 RenderTexture 或者 RenderTextureArray,barracuda提供了一个接口 去实现RT和Tensor之间的相互转换。
网络的执行
使用WorkerFactory创建一个网络, 调用 Execute 执行就可以了。
model.layers = layerList;
Model.Input input = model.inputs[1];
input.shape[0] = 0;
input.shape[1] = 1080;//TODO get framebuffer size rather than hardcoded value
input.shape[2] = 1920;
input.shape[3] = 3;
model.inputs = new List<Model.Input> { model.inputs[1] };
// 创建网络, model里包含了所有Layer参数
worker = WorkerFactory.CreateWorker(WorkerFactory.ValidateType(internalSetup.workerType), model, verbose);
Dictionary<string, Tensor> temp = new Dictionary<string, Tensor>();
var inputTensor = new Tensor(input.shape, input.name);
temp.Add("frame", inputTensor);
// 执行
worker.Execute(temp);
获取结果:
如果知道layer的名字的话, 每个Layer执行的结果都可以获取到, 这也包含了网络的最后一层的输出:
var tensors = worker.PeekConstants(layerNameToPatch[i]);
这里拿到的是Tensor类型, 如果用来显示获取计算的话, 还需要转换成Texture/RT。
结语
Barracuda可以看成unity版本的 Tensorflow, Pytorch 框架, 并且由于unity引擎的跨平台特性, 所以天生对多个平台的支持, 特别是手机(Android\IOS)平台。 但与其他tensorflow等不同的是, tf维护一个session, 直接在gpu上跑整个网络, 不像barracuda这样每个layer 都存在cpu和gpu交互, 一定程度降低了运行效率。