convolutional neural net in hardware

Alright I've had my logibone sitting idle for a while now, but I just had a crazy idea that seems entirely plausible, but I'm limited in my understanding of how to program the fpga.

Let's say I trained a convnet (using tensorflow for instance) for a specific task and once trained extracted all the computed weights and somehow fed them into a the conv3x3 primitive in hard-cv.

So for this to work I would need to have many conv3x3 blocks. The image would be fed to say 64 conv3x3 with different weights creating 64 'feature maps'. To reduce x,y dimension I would probably need to implement the 'max pool' operation which seems pretty simple.
I imagine I would eventually be limited by the amount of gates to run a full neural network, or memory to store weights. The idea would then be to run the end of the network (fully connected network) onto the cpu at this point.

I'd love to hear your thoughts on this, I think it's totally feasible but I'm more qualified to train the net than feed it to this hardware / software hybrid I'm thinking about.
Tagged:

Comments

  • Hi mtourne,

    Very good question. This sounds like a very interesting project.  @jpiat specializes in image processing, but has been out of town.  He will address this when he gets back into town soon.  Thanks.  
  • There is some existing work to map convolutional network trained in the caffe framework straight to FPGA logic. What you mention (only running the computer-vision on the FPGA) is a good idea, and could easily be performed in logic. One problem with our Spartan6 based-platform is the limited amount of DSP blocks which would limit the number of conv3x3 operators you could run in parallel. The training of your DNN is likely to produce floating point convolution windows which requires a lot of resources in hardware. If you can translate those floating point convolution windows to fixed point (or train a fixed-point DNN) you could probably run up to 16 convolution in parallel for VGA resolution images and limited frame-rate. While i would really like a proof of concept running on our platform, i think that at a bigger FPGA i likely to bring your better performance (look a ZynQ platform for instance).
  • Thank you for the answer J.

    I'll look at some reference material on convnets using int weights instead of floats, if not realtime maybe this could still make a nice PoC on piece of hardware I already own with a reasonable framerate
Sign In or Register to comment.