The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  • Imperative Programs Tend to be More Flexible
  • Symbolic Programs Tend to be More Efficient
  1. Deep Learning
  2. Deep Learning Frameworks

Gluon HybridBlocks Walk-Through

https://gluon.mxnet.io/chapter07_distributed-learning/hybridize.html

The tutorials we saw so far adopt the imperative, or define-by-run, programming paradigm. It might not even occur to you to give a name to this style of programming because it’s how we always write Python programs.

Take for example a prototypical program written below in pseudo-Python. We grab some input arrays, we compute upon them to produce some intermediate values, and finally we produce the result that we actually care about.

def our_function(A, B, C, D):
    # Compute some intermediate values
    E = basic_function1(A, B)
    F = basic_function2(C, D)

    # Finally, produce the thing you really care about
    G = basic_function3(E, F)
    return G

# Load up some data
W = some_stuff()
X = some_stuff()
Y = some_stuff()
Z = some_stuff()

result = our_function(W, X, Y, Z)

As you might expect when we compute E, we’re actually performing some numerical operation, like multiplication, and returning an array that we assign to the variable E. Same for F. And if we want to do a similar computation many times by putting these lines in a function, each time our program will have to step through these three lines of Python.

The advantage of this approach is it’s so natural that it might not even occur to some people that there is another way. But the disadvantage is that it’s slow.

That’s because we are constantly engaging the Python execution environment (which is slow) even though our entire function performs the same three low-level operations in the same sequence every time.

It’s also holding on to all the intermediate values D and E until the function returns even though we can see that they’re not needed. We might have made this program more efficient by re-using memory from either E or F to store the result G.

There actually is a different way to do things. It’s called symbolic programming and most of the early deep learning libraries, including Theano and Tensorflow, embraced this approach exclusively. You might have also heard this approach referred to as declarative programming or define-then-run programming. These all mean the exact same thing. The approach consists of three basic steps:

  • Define a computation workflow, like a pass through a neural network, using placeholder data

  • Compile the program from a front-end language, e.g. Python, independent format

  • Invoke the compiled function, feeding it real data

Revisiting our previous pseudo-Python example, a symbolic version of the same program might look something like this:

# Create some placeholders to stand in for real data that might be supplied to the compiled function.
A = placeholder()
B = placeholder()
C = placeholder()
D = placeholder()

# Compute some intermediate values
E = symbolic_function1(A, B)
F = symbolic_function2(C, D)

# Finally, produce the thing you really care about
G = symbolic_function3(E, F)

our_function = library.compile(inputs=[A, B, C, D], outputs=[G])

# Load up some data
W = some_stuff()
X = some_stuff()
Y = some_stuff()
Z = some_stuff()

result = our_function(W, X, Y, Z)

Here, when we run the line E = symbolic_function1(A, B), no numerical computation actually happens. Instead, the symbolic library notes the way that E is related to A and B and records this information. We don’t do actual computation, we just make a roadmap for how to go from inputs to outputs. Because we can draw all of the variables and operations (both inputs and intermediate values) a nodes, and the relationships between nodes with edges, we call the resulting roadmap a computational graph. In the symbolic approach, we first define the entire graph, and then compile it.

Imperative Programs Tend to be More Flexible

When you’re using an imperative-style library from Python, you are writing in Python. Nearly anything that would be intuitive to write in Python, you could accelerate by calling down in the appropriate places to the imperative deep learning library. On the other hand, when you write a symbolic program, you may not have access to all the familiar Python constructs, like iteration. It’s also easy to debug an imperative program. For one, because all the intermediate values hang around, it’s easy to introspect the program later. Imperative programs are also much easier to debug because we can just stick print statements in between operations.

In short, from a developer’s standpoint, imperative programs are just better. They’re a joy to work with. You don’t have the tricky indirection of working with placeholders. You can do anything that you can do with native Python. And faster debugging, means you get to try out more ideas. But the catch is that imperative programs are comparatively slow.

Symbolic Programs Tend to be More Efficient

The main reason is efficiency, both in terms of memory and speed. Let’s revisit our toy example from before. Consider the following program:

import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
d = c + 1
...

Assume that each cell in the array occupies 8 bytes of memory. How much memory do we need to execute this program in the Python console? As an imperative program we need to allocate memory at each line. That leaves us allocating 4 arrays of size 10. So we’ll need 4∗10∗8=3204∗10∗8=320bytes. On the other hand, if we built a computation graph, and knew in advance that we only needed d, we could reuse the memory originally allocated for intermediate values. For example, by performing computations in-place, we might recycle the bits allocated for b to store c. And we might recycle the bits allocated for c to store d. In the end we could cut our memory requirement in half, requiring just 2∗10∗82∗10∗8 = 160 bytes.

Symbolic programs can also perform another kind of optimization, called operation folding. Returning to our toy example, the multiplication and addition operations can be folded into one operation. If the computation runs on a GPU processor, one GPU kernel will be executed, instead of two. In fact, this is one way we hand-craft operations in optimized libraries, such as CXXNet and Caffe. Operation folding improves computation efficiency. Note, you can’t perform operation folding in imperative programs, because the intermediate values might be referenced in the future. Operation folding is possible in symbolic programs because we get the entire computation graph in advance, before actually doing any calculation, giving us a clear specification of which values will be needed and which will not.

Getting the best of both worlds with MXNet Gluon’s HybridBlocks

Most libraries deal with the imperative / symbolic design problem by simply choosing a side. Theano and those frameworks it inspired, like TensorFlow, run with the symbolic way. And because the first versions of MXNet optimized performance, they also went symbolic. Chainer and its descendants like PyTorch are fully imperative way. In designing MXNet Gluon, we asked the following question. Is it possible to get all of the benefits of imperative programming but to still exploit, whenever possible, the speed and memory efficiency of symbolic programming. In other words, a user should be able to use Gluon fully imperatively. And if they never want their lives to be more complicated then they can get on just fine imagining that the story ends there. But when a user needs production-level performance, it should be easy to compile the entire compute graph, or at least to compile large subsets of it.

MXNet accomplishes this through the use of HybridBlocks. Each HybridBlock can run fully imperatively defining their computation with real functions acting on real inputs. But they’re also capable of running symbolically, acting on placeholders. Gluon hides most of this under the hood so you’ll only need to know how it works when you want to write your own layers. Given a HybridBlock whose forward computation consists of going through other HybridBlocks, you can compile that section of the network by calling the HybridBlocks .hybridize() method.

All of MXNet’s predefined layers are HybridBlocks. This means that any network consisting entirely of predefined MXNet layers can be compiled and run at much faster speeds by calling .hybridize().

HybridSequential

We already learned how to use Sequential to stack the layers. The regular Sequential can be built from regular Blocks and so it too has to be a regular Block. However, when you want to build a network using sequential and run it at crazy speeds, you can construct your network using HybridSequential instead. The functionality is the same Sequential:

In [1]:
import mxnet as mx
from mxnet.gluon import nn
from mxnet import nd

def get_net():
    # construct a MLP
    net = nn.HybridSequential()
    with net.name_scope():
        net.add(nn.Dense(256, activation="relu"))
        net.add(nn.Dense(128, activation="relu"))
        net.add(nn.Dense(2))
    # initialize the parameters
    net.collect_params().initialize()
    return net

# forward
x = nd.random_normal(shape=(1, 512))
net = get_net()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.16526183 -0.14005636]]
<NDArray 1x2 @cpu(0)>

To compile and optimize the HybridSequential, we can then call its hybridize method. Only HybridBlocks, e.g. HybridSequential, can be compiled. But you can still call hybridize on normal Block and its HybridBlock children will be compiled instead. We will talk more about HybridBlocks later.

In [2]:
net.hybridize()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.16526183 -0.14005636]]
<NDArray 1x2 @cpu(0)>

Performance

To get a sense of the speedup from hybridizing, we can compare the performance before and after hybridizing by measuring in either case the time it takes to make 1000 forward passes through the network.

In [3]:
from time import time
def bench(net, x):
    mx.nd.waitall()
    start = time()
    for i in range(1000):
        y = net(x)
    mx.nd.waitall()
    return time() - start

net = get_net()
print('Before hybridizing: %.4f sec'%(bench(net, x)))
net.hybridize()
print('After hybridizing: %.4f sec'%(bench(net, x)))
Before hybridizing: 0.4646 sec
After hybridizing: 0.2424 sec

As you can see, hybridizing gives a significant performance boost, almost 2x the speed.

Get the symbolic program

Previously, we feed net with NDArray data x, and then net(x) returned the forward results. Now if we feed it with a Symbol placeholder, then the corresponding symbolic program will be returned.

In [4]:
from mxnet import sym
x = sym.var('data')
print('=== input data holder ===')
print(x)

y = net(x)
print('\n=== the symbolic program of net===')
print(y)

y_json = y.tojson()
print('\n=== the according json definition===')
print(y_json)
=== input data holder ===
<Symbol data>

=== the symbolic program of net===
<Symbol hybridsequential1_dense2_fwd>

=== the according json definition===
{
  "nodes": [
    {
      "op": "null",
      "name": "data",
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense0_weight",
      "attr": {
        "__dtype__": "0",
        "__lr_mult__": "1.0",
        "__shape__": "(256, 0)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense0_bias",
      "attr": {
        "__dtype__": "0",
        "__init__": "zeros",
        "__lr_mult__": "1.0",
        "__shape__": "(256,)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "FullyConnected",
      "name": "hybridsequential1_dense0_fwd",
      "attr": {"num_hidden": "256"},
      "inputs": [[0, 0, 0], [1, 0, 0], [2, 0, 0]]
    },
    {
      "op": "Activation",
      "name": "hybridsequential1_dense0_relu_fwd",
      "attr": {"act_type": "relu"},
      "inputs": [[3, 0, 0]]
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense1_weight",
      "attr": {
        "__dtype__": "0",
        "__lr_mult__": "1.0",
        "__shape__": "(128, 0)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense1_bias",
      "attr": {
        "__dtype__": "0",
        "__init__": "zeros",
        "__lr_mult__": "1.0",
        "__shape__": "(128,)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "FullyConnected",
      "name": "hybridsequential1_dense1_fwd",
      "attr": {"num_hidden": "128"},
      "inputs": [[4, 0, 0], [5, 0, 0], [6, 0, 0]]
    },
    {
      "op": "Activation",
      "name": "hybridsequential1_dense1_relu_fwd",
      "attr": {"act_type": "relu"},
      "inputs": [[7, 0, 0]]
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense2_weight",
      "attr": {
        "__dtype__": "0",
        "__lr_mult__": "1.0",
        "__shape__": "(2, 0)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense2_bias",
      "attr": {
        "__dtype__": "0",
        "__init__": "zeros",
        "__lr_mult__": "1.0",
        "__shape__": "(2,)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "FullyConnected",
      "name": "hybridsequential1_dense2_fwd",
      "attr": {"num_hidden": "2"},
      "inputs": [[8, 0, 0], [9, 0, 0], [10, 0, 0]]
    }
  ],
  "arg_nodes": [0, 1, 2, 5, 6, 9, 10],
  "node_row_ptr": [
    0,
    1,
    2,
    3,
    4,
    5,
    6,
    7,
    8,
    9,
    10,
    11,
    12
  ],
  "heads": [[11, 0, 0]],
  "attrs": {"mxnet_version": ["int", 1001]}
}

Now we can save both the program and parameters onto disk, so that it can be loaded later not only in Python, but in all other supported languages, such as C++, R, and Scala, as well.

In [5]:
y.save('model.json')
net.save_params('model.params')

HybridBlock

Now let’s dive deeper into how hybridize works. Remember that gluon networks are composed of Blocks each of which subclass gluon.Block. With normal Blocks, we just need to define a forward function that takes an input x and computes the result of the forward pass through the network. MXNet can figure out the backward pass for us automatically with autograd.

To define a HybridBlock, we instead have a hybrid_forward function:

In [6]:
from mxnet import gluon

class Net(gluon.HybridBlock):
    def __init__(self, **kwargs):
        super(Net, self).__init__(**kwargs)
        with self.name_scope():
            self.fc1 = nn.Dense(256)
            self.fc2 = nn.Dense(128)
            self.fc3 = nn.Dense(2)

    def hybrid_forward(self, F, x):
        # F is a function space that depends on the type of x
        # If x's type is NDArray, then F will be mxnet.nd
        # If x's type is Symbol, then F will be mxnet.sym
        print('type(x): {}, F: {}'.format(
                type(x).__name__, F.__name__))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

The hybrid_forward function takes an additional input, F, which stands for a backend. This exploits one awesome feature of MXNet. MXNet has both a symbolic API (mxnet.symbol) and an imperative API (mxnet.ndarray). In this book, so far, we’ve only focused on the latter. Owing to fortuitous historical reasons, the imperative and symbolic interfaces both support roughly the same API. They have many of same functions (currently about 90% overlap) and when they do, they support the same arguments in the same order. When we define hybrid_forward, we pass in F. When running in imperative mode, hybrid_forward is called with F as mxnet.ndarray and xas some ndarray input. When we compile with hybridize, F will be mxnet.symbol and x will be some placeholder or intermediate symbolic value. Once we call hybridize, the net is compiled, so we’ll never need to call hybrid_forward again.

Let’s demonstrate how this all works by feeding some data through the network twice. We’ll do this for both a regular network and a hybridized net. You’ll see that in the first case, hybrid_forward is actually called twice.

In [7]:
net = Net()
net.collect_params().initialize()
x = nd.random_normal(shape=(1, 512))
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): NDArray, F: mxnet.ndarray
=== 2nd forward ===
type(x): NDArray, F: mxnet.ndarray

Now run it again after hybridizing.

In [8]:
net.hybridize()
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): Symbol, F: mxnet.symbol
=== 2nd forward ===

It differs from the previous execution in two aspects:

  1. the input data type now is Symbol even when we fed an NDArray into net, because gluonimplicitly constructed a symbolic data placeholder.

  2. hybrid_forward is called once at the first time we run net(x). It is because gluon will construct the symbolic program on the first forward, and then keep it for reuse later.

One main reason that the network is faster after hybridizing is because we don’t need to repeatedly invoke the Python forward function, while keeping all computations within the highly efficient C++ backend engine.

But the potential drawback is the loss of flexibility to write the forward function. In other ways, inserting print for debugging or control logic such as if and for into the forward function is not possible now.

Conclusion

Through HybridSequental and HybridBlock, we can convert an imperative program into a symbolic program by calling hybridize.

PreviousGluon Hybrid IntroNextA quick tour of Torch internals

Last updated 6 years ago