-
Notifications
You must be signed in to change notification settings - Fork 181
Description
I'm trying to solve OCR tasks based on this code.
So what shape input to LSTM should have, suppose we have images [batch_size, height, width, channels] how should they be reshaped to be used as input? Like [batch_size, width, height*channels], so width is like time dimension?
What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or batch_size should be 1)
What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be [batch_size, feature_map_height, feature_map_width, feature_map_channels], how should blob be reshaped to be used as input to LSTM? Like [batch_size, feature_map_width, feature_map_height*feature_map_channels] ? Can we reshape it just to single row like [batch_size, feature_map_width*feature_map_height*feature_map_channels] it will be like sequence of pixels and we loose some spartial information, will it work?
Here is definition of input, but I'm not sure what it's mean in your case [batch_size, max_stepsize, num_features]:
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90
And how output of LSTM depends on input size and max sequence length?
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110
BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples.
https://github.com/mrgloom/Char-sequence-recognition