Skip to content

Commit cc28fb4

Browse files
committed
Merge remote-tracking branch 'upstream/develop' into mkldnn_pool
2 parents 971acff + b5a448f commit cc28fb4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1630
-645
lines changed

doc/design/block.md

Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
# Design Doc: Block and Scope
2+
3+
## The Representation of Computation
4+
5+
Both deep learning systems and programming languages help users describe computation procedures. These systems use various representations of computation:
6+
7+
- Caffe, Torch, and Paddle: sequences of layers.
8+
- TensorFlow, Caffe2, Mxnet: graphs of operators.
9+
- PaddlePaddle: nested blocks, like C++ and Java programs.
10+
11+
## Block in Programming Languages and Deep Learning
12+
13+
In programming languages, a block is a pair of curly braces that includes local variables definitions and a sequence of instructions, or operators.
14+
15+
Blocks work with control flow structures like `if`, `else`, and `for`, which have equivalents in deep learning:
16+
17+
| programming languages | PaddlePaddle |
18+
|-----------------------|-----------------------|
19+
| for, while loop | RNN, WhileOp |
20+
| if, if-else, switch | IfElseOp, SwitchOp |
21+
| sequential execution | a sequence of layers |
22+
23+
A key difference is that a C++ program describes a one pass computation, whereas a deep learning program describes both the forward and backward passes.
24+
25+
## Stack Frames and the Scope Hierarchy
26+
27+
The existence of the backward makes the execution of a block of traditional programs and PaddlePaddle different to each other:
28+
29+
| programming languages | PaddlePaddle |
30+
|-----------------------|-------------------------------|
31+
| stack | scope hierarchy |
32+
| stack frame | scope |
33+
| push at entering block| push at entering block |
34+
| pop at leaving block | destroy at minibatch completes|
35+
36+
1. In traditional programs:
37+
38+
- When the execution enters the left curly brace of a block, the runtime pushes a frame into the stack, where it realizes local variables.
39+
- After the execution leaves the right curly brace, the runtime pops the frame.
40+
- The maximum number of frames in the stack is the maximum depth of nested blocks.
41+
42+
1. In PaddlePaddle
43+
44+
- When the execution enters a block, PaddlePaddle adds a new scope, where it realizes variables.
45+
- PaddlePaddle doesn't pop a scope after the execution of the block because variables therein are to be used by the backward pass. So it has a stack forest known as a *scope hierarchy*.
46+
- The height of the highest tree is the maximum depth of nested blocks.
47+
- After the process of a minibatch, PaddlePaddle destroys the scope hierarchy.
48+
49+
## Use Blocks in C++ and PaddlePaddle Programs
50+
51+
Let us consolidate the discussion by presenting some examples.
52+
53+
### Blocks with `if-else` and `IfElseOp`
54+
55+
The following C++ programs shows how blocks are used with the `if-else` structure:
56+
57+
```c++
58+
int x = 10;
59+
int y = 20;
60+
int out;
61+
bool cond = false;
62+
if (cond) {
63+
int z = x + y;
64+
out = softmax(z);
65+
} else {
66+
int z = fc(x);
67+
out = z;
68+
}
69+
```
70+
71+
An equivalent PaddlePaddle program from the design doc of the [IfElseOp operator](./if_else_op.md) is as follows:
72+
73+
```python
74+
import paddle as pd
75+
76+
x = var(10)
77+
y = var(20)
78+
cond = var(false)
79+
ie = pd.create_ifelseop(inputs=[x], output_num=1)
80+
with ie.true_block():
81+
x = ie.inputs(true, 0)
82+
z = operator.add(x, y)
83+
ie.set_output(true, 0, operator.softmax(z))
84+
with ie.false_block():
85+
x = ie.inputs(false, 0)
86+
z = layer.fc(x)
87+
ie.set_output(true, 0, operator.softmax(z))
88+
out = b(cond)
89+
```
90+
91+
In both examples, the left branch computes `softmax(x+y)` and the right branch computes `fc(x)`.
92+
93+
A difference is that variables in the C++ program contain scalar values, whereas those in the PaddlePaddle programs are mini-batches of instances. The `ie.input(true, 0)` invocation returns instances in the 0-th input, `x`, that corresponds to true values in `cond` as the local variable `x`, where `ie.input(false, 0)` returns instances corresponding to false values.
94+
95+
### Blocks with `for` and `RNNOp`
96+
97+
The following RNN model from the [RNN design doc](./rnn.md)
98+
99+
```python
100+
x = sequence([10, 20, 30])
101+
m = var(0)
102+
W = tensor()
103+
U = tensor()
104+
105+
rnn = create_rnn(inputs=[input])
106+
with rnn.stepnet() as net:
107+
x = net.set_inputs(0)
108+
h = net.add_memory(init=m)
109+
fc_out = pd.matmul(W, x)
110+
hidden_out = pd.matmul(U, h.pre(n=1))
111+
sum = pd.add_two(fc_out, hidden_out)
112+
act = pd.sigmoid(sum)
113+
h.update(act) # update memory with act
114+
net.set_outputs(0, act, hidden_out) # two outputs
115+
116+
o1, o2 = rnn()
117+
print o1, o2
118+
```
119+
120+
has its equivalent C++ program as follows
121+
122+
```c++
123+
int* x = {10, 20, 30};
124+
int m = 0;
125+
int W = some_value();
126+
int U = some_other_value();
127+
128+
int mem[sizeof(x) / sizeof(x[0]) + 1];
129+
int o1[sizeof(x) / sizeof(x[0]) + 1];
130+
int o2[sizeof(x) / sizeof(x[0]) + 1];
131+
for (int i = 1; i <= sizeof(x)/sizeof(x[0]); ++i) {
132+
int x = x[i-1];
133+
if (i == 1) mem[0] = m;
134+
int fc_out = W * x;
135+
int hidden_out = Y * mem[i-1];
136+
int sum = fc_out + hidden_out;
137+
int act = sigmoid(sum);
138+
mem[i] = act;
139+
o1[i] = act;
140+
o2[i] = hidden_out;
141+
}
142+
143+
print_array(o1);
144+
print_array(o2);
145+
```
146+
147+
148+
## Compilation and Execution
149+
150+
Like TensorFlow programs, a PaddlePaddle program is written in Python. The first part describes a neural network as a protobuf message, and the rest part executes the message for training or inference.
151+
152+
The generation of this protobuf message is like what a compiler generates a binary executable file. The execution of the message that the OS executes the binary file.
153+
154+
## The "Binary Executable File Format"
155+
156+
The definition of the protobuf message is as follows:
157+
158+
```protobuf
159+
message BlockDesc {
160+
repeated VarDesc vars = 1;
161+
repeated OpDesc ops = 2;
162+
}
163+
```
164+
165+
The step net in above RNN example would look like
166+
167+
```
168+
BlockDesc {
169+
vars = {
170+
VarDesc {...} // x
171+
VarDesc {...} // h
172+
VarDesc {...} // fc_out
173+
VarDesc {...} // hidden_out
174+
VarDesc {...} // sum
175+
VarDesc {...} // act
176+
}
177+
ops = {
178+
OpDesc {...} // matmul
179+
OpDesc {...} // add_two
180+
OpDesc {...} // sigmoid
181+
}
182+
};
183+
```
184+
185+
Also, the RNN operator in above example is serialized into a protobuf message of type `OpDesc` and would look like:
186+
187+
```
188+
OpDesc {
189+
inputs = {0} // the index of x
190+
outputs = {5, 3} // indices of act and hidden_out
191+
attrs {
192+
"memories" : {1} // the index of h
193+
"step_net" : <above step net>
194+
}
195+
};
196+
```
197+
198+
This `OpDesc` value is in the `ops` field of the `BlockDesc` value representing the global block.
199+
200+
201+
## The Compilation of Blocks
202+
203+
During the generation of the Protobuf message, the Block should store VarDesc (the Protobuf message which describes Variable) and OpDesc (the Protobuf message which describes Operator).
204+
205+
VarDesc in a block should have its name scope to avoid local variables affect parent block's name scope.
206+
Child block's name scopes should inherit the parent's so that OpDesc in child block can reference a VarDesc that stored in parent block. For example
207+
208+
```python
209+
a = pd.Varaible(shape=[20, 20])
210+
b = pd.fc(a, params=["fc.w", "fc.b"])
211+
212+
rnn = pd.create_rnn()
213+
with rnn.stepnet() as net:
214+
x = net.set_inputs(a)
215+
# reuse fc's parameter
216+
fc_without_b = pd.get_variable("fc.w")
217+
net.set_outputs(fc_without_b)
218+
219+
out = rnn()
220+
```
221+
the method `pd.get_variable` can help retrieve a Variable by a name, a Variable may store in a parent block, but might be retrieved in a child block, so block should have a variable scope that supports inheritance.
222+
223+
In compiler design, the symbol table is a data structure created and maintained by compilers to store information about the occurrence of various entities such as variable names, function names, classes, etc.
224+
225+
To store the definition of variables and operators, we define a C++ class `SymbolTable`, like the one used in compilers.
226+
227+
`SymbolTable` can do the following stuff:
228+
229+
- store the definitions (some names and attributes) of variables and operators,
230+
- to verify if a variable was declared,
231+
- to make it possible to implement type checking (offer Protobuf message pointers to `InferShape` handlers).
232+
233+
234+
```c++
235+
// Information in SymbolTable is enough to trace the dependency graph. So maybe
236+
// the Eval() interface takes a SymbolTable is enough.
237+
class SymbolTable {
238+
public:
239+
SymbolTable(SymbolTable* parent) : parent_(parent) {}
240+
241+
OpDesc* NewOp(const string& name="");
242+
243+
// TODO determine whether name is generated by python or C++
244+
// currently assume that a unique name will be generated by C++ if the
245+
// argument name left default.
246+
VarDesc* NewVar(const string& name="");
247+
248+
// find a VarDesc by name, if recursive true, find parent's SymbolTable
249+
// recursively.
250+
// this interface is introduced to support InferShape, find protobuf messages
251+
// of variables and operators, pass pointers into InferShape.
252+
// operator
253+
//
254+
// NOTE maybe some C++ classes such as VarDescBuilder and OpDescBuilder should
255+
// be proposed and embedded into pybind to enable python operate on C++ pointers.
256+
VarDesc* FindVar(const string& name, bool recursive=true);
257+
258+
OpDesc* FindOp(const string& name);
259+
260+
BlockDesc Compile() const;
261+
262+
private:
263+
SymbolTable* parent_;
264+
265+
map<string, OpDesc> ops_;
266+
map<string, VarDesc> vars_;
267+
};
268+
```
269+
270+
After all the description of variables and operators is added into SymbolTable,
271+
the block has enough information to run.
272+
273+
The `Block` class takes a `BlockDesc` as input, and provide `Run` and `InferShape` functions.
274+
275+
276+
```c++
277+
namespace {
278+
279+
class Block : OperatorBase {
280+
public:
281+
Block(const BlockDesc& desc) desc_(desc) {}
282+
283+
void InferShape(const framework::Scope& scope) const override {
284+
if (!symbols_ready_) {
285+
CreateVariables(scope);
286+
CreateOperators();
287+
}
288+
// should run InferShape first.
289+
for (auto& op : runtime_table_.ops()) {
290+
op->InferShape(scope);
291+
}
292+
}
293+
294+
void Run(const framework::Scope& scope,
295+
const platform::DeviceContext& dev_ctx) const override {
296+
PADDLE_ENFORCE(symbols_ready_, "operators and variables should be created first.");
297+
for (auto& op : runtime_table_.ops()) {
298+
op->Run(scope, dev_ctx);
299+
}
300+
}
301+
302+
void CreateVariables(const framework::Scope& scope);
303+
void CreateOperators();
304+
305+
// some other necessary interfaces of NetOp are list below
306+
// ...
307+
308+
private:
309+
BlockDesc desc_;
310+
bool symbols_ready_{false};
311+
};
312+
```
313+
314+
## The Execution of Blocks
315+
316+
Block inherits from OperatorBase, which has a Run method.
317+
Block's Run method will run its operators sequentially.
318+
319+
There is another important interface called `Eval`, which take some arguments called targets, and generate a minimal graph which takes targets as the end points and creates a new Block,
320+
after `Run`, `Eval` will get the latest value and return the targets.
321+
322+
The definition of Eval is as follows:
323+
324+
```c++
325+
// clean a block description by targets using the corresponding dependency graph.
326+
// return a new BlockDesc with minimal number of operators.
327+
// NOTE not return a Block but the block's description so that this can be distributed
328+
// to a cluster.
329+
BlockDesc Prune(const BlockDesc& desc, vector<string> targets);
330+
331+
void Block::Eval(const vector<string>& targets,
332+
const framework::Scope& scope,
333+
const platform::DeviceContext& dev_ctx) {
334+
BlockDesc min_desc = Prune(desc_, targets);
335+
Block min_block(min_desc);
336+
min_block.Run(scope, dev_ctx);
337+
}
338+
```

paddle/cuda/include/hl_cuda_cudnn.h

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ limitations under the License. */
2222
*/
2323
typedef enum {
2424
HL_POOLING_MAX = 0,
25-
// average includes padded values
26-
HL_POOLING_AVERAGE = 1,
2725
// average does not include padded values
28-
HL_POOLING_AVERAGE_EXCLUDE_PADDING = 2,
26+
HL_POOLING_AVERAGE = 1,
27+
// average includes padded values
28+
HL_POOLING_AVERAGE_INCLUDE_PADDING = 2,
2929
HL_POOLING_END
3030
} hl_pooling_mode_t;
3131

paddle/cuda/include/hl_tensor_ops.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -461,7 +461,7 @@ class add<float32x4_t> {
461461
public:
462462
INLINE float32x4_t operator()(const float32x4_t a,
463463
const float32x4_t b) const {
464-
return vmulq_f32(a, b);
464+
return vaddq_f32(a, b);
465465
}
466466
};
467467

0 commit comments

Comments
 (0)