Skip to content

OCR: clarification about input and output #20

@mrgloom

Description

@mrgloom

I'm trying to solve OCR tasks based on this code.

So what shape input to LSTM should have, suppose we have images [batch_size, height, width, channels] how should they be reshaped to be used as input? Like [batch_size, width, height*channels], so width is like time dimension?

What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or batch_size should be 1)

What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be [batch_size, feature_map_height, feature_map_width, feature_map_channels], how should blob be reshaped to be used as input to LSTM? Like [batch_size, feature_map_width, feature_map_height*feature_map_channels] ? Can we reshape it just to single row like [batch_size, feature_map_width*feature_map_height*feature_map_channels] it will be like sequence of pixels and we loose some spartial information, will it work?

Here is definition of input, but I'm not sure what it's mean in your case [batch_size, max_stepsize, num_features]:
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90

And how output of LSTM depends on input size and max sequence length?
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110

BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples.
https://github.com/mrgloom/Char-sequence-recognition

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions