Technical Article

Deep Learning Algorithms for Vision Systems

January 26, 2022 by Anish Devasia

Which deep learning neural networks are best suited for industrial vision systems and why?

Human vision results from millions of years of evolution, starting from organisms gaining mutations that respond to light. The complex system is one of the most advanced vision systems among all living organisms. Human vision is controlled by the region in the brain called the occipital lobe. Though we still do not know the complete mechanism of human vision, it is an incredibly complex task. It becomes evident when we try to make computers see.

 

Need for Deep Learning

Images are made of thousands of pixels. Each pixel is simply a small, single-color square. The color of a pixel is represented as a set of numerical values corresponding to the amount of red, green, and blue (RGB) colors in the image. For example, the image color for figure 1 is [198 226 255] in the RGB scheme.

 

Figure 1. A block of color with an [198 226 255] pixel color.

 

The computer understands an image as a set of RGB values corresponding to each pixel square in the image. The unique RGB colors used on Control Automation’s website (figure 2) are the following (obtained via a Python script).

 

Figure 2. Control Automation’s logo.

 

[[  0   0   0]

 [ 49  12  12]

 [197  48  48]

 [197  48  49]

 [197  49  48]

 [197  49  49]

 [198  48  48]

 [198  48  49]

 [198  49  48]

 [198  49  49]

 [199  48  48]

 [199  48  49]

 [199  49  48]

 [199  49  49]

 [199  49  50]

 [199  50  49]

 [199  50  50]

 [200  48  49]

 [200  49  48]

 [200  49  49]

 [200  49  50]

 [200  50  49]

 [200  50  50]

 [201  49  49]

 [201  49  50]

 [201  50  49]

 [201  50  50]

 [201  50  51]

 [201  51  50]

 [201  51  51]

 [202  49  49]

 [202  49  50]

 [202  50  49]

 [202  50  50]

 [202  50  51]

 [202  51  50]

 [202  51  51]

 [203  50  50]

 [203  50  51]

 [203  51  50]

 [203  51  51]

 [204  50  50]

 [204  50  51]

 [204  51  50]

 [204  51  51]

 [204  51  52]

 [204  52  51]

 [204  52  52]

 [205  51  51]

 [205  51  52]

 [205  52  51]

 [205  52  52]

 [206  51  51]

 [206  51  52]

 [206  52  51]

 [206  52  52]

 [206  52  53]

 [206  53  52]

 [206  53  53]

 [207  52  52]

 [207  52  53]

 [207  53  52]

 [207  53  53]

 [208  52  52]

 [208  52  53]

 [208  53  52]

 [208  53  53]

 [208  53  54]

 [208  54  53]

 [208  54  54]

 [209  52  53]

 [209  53  52]

 [209  53  53]

 [209  53  54]

 [209  54  53]

 [209  54  54]

 [210  53  53]

 [210  53  54]

 [210  54  53]

 [210  54  54]

 [210  54  55]

 [210  55  54]

 [210  55  55]

 [211  53  53]

 [211  53  54]

 [211  54  53]

 [211  54  54]

 [211  54  55]

 [211  55  54]

 [211  55  55]

 [212  54  54]

 [212  54  55]

 [212  55  54]

 [212  55  55]

 [213  54  54]

 [213  54  55]

 [213  55  54]

 [213  55  55]

 [213  55  56]

 [213  56  55]

 [213  56  56]

 [214  55  55]

 [214  55  56]

 [214  56  55]

 [214  56  56]

 [215  55  55]

 [215  55  56]

 [215  56  55]

 [215  56  56]

 [215  56  57]

 [215  57  56]

 [215  57  57]

 [216  56  56]

 [216  56  57]

 [216  57  56]

 [216  57  57]

 [217  56  56]

 [217  56  57]

 [217  57  56]

 [217  57  57]

 [217  57  58]

 [217  58  57]

 [217  58  58]

 [218  56  56]

 [218  56  57]

 [218  57  56]

 [218  57  57]

 [218  57  58]

 [218  58  57]

 [218  58  58]

 [219  57  57]

 [219  57  58]

 [219  57  59]

 [219  58  57]

 [219  58  58]

 [219  58  59]

 [219  58  60]

 [219  59  58]

 [220  57  57]

 [220  57  58]

 [220  57  59]

 [220  57  60]

 [220  58  57]

 [220  58  58]

 [220  58  59]

 [220  58  60]

 [220  59  58]

 [220  59  59]

 [220  59  60]

 [220  60  59]

 [221  57  59]

 [221  57  60]

 [221  58  58]

 [221  58  59]

 [221  58  60]

 [221  59  58]

 [221  59  59]

 [221  59  60]

 [221  60  58]

 [221  60  59]

 [222  58  58]

 [222  58  59]

 [222  59  58]

 [222  59  59]

 [222  59  60]

 [222  60  59]

 [222  60  60]

 [223  58  59]

 [223  59  58]

 [223  59  59]

 [223  59  60]

 [223  60  59]

 [223  60  60]

 [224  59  59]

 [224  59  60]

 [224  60  59]

 [224  60  60]

 [224  60  61]

 [224  61  60]

 [224  61  61]

 [225  59  60]

 [225  60  59]

 [225  60  60]

 [225  60  61]

 [225  61  60]

 [225  61  61]

 [226  60  60]

 [226  60  61]

 [226  61  60]

 [226  61  61]

 [226  61  62]

 [226  62  61]

 [226  62  62]

 [227  60  60]

 [227  60  61]

 [227  61  60]

 [227  61  61]

 [227  61  62]

 [227  62  61]

 [227  62  62]

 [228  61  61]

 [228  61  62]

 [228  62  61]

 [228  62  62]

 [229  61  61]

 [229  61  62]

 [229  62  61]

 [229  62  62]]

 

The number of times each of those colors appears in the image is as follows.

 

[ 1368    16 39287   434   405   190  9274  2239  1045  1028   145   498

   537  3662   127    65    11     3    97  3877   554   965   387   379

   788  1146  3682    20    17     2    12    30    76  3570   532   540

   260  1115   807   839  9476    15   252   107  3496   574   250    90

  2658   589   460   553    83   214   204  4593   220   199    40  2830

   675  1046  4206   147   347   363  1005    43    32     4    10    11

  2228   392   741   210   739   755   734  2145     8     5     1    21

   105  1557  3787   454   139  5369   295   337   321   785    27    74

    96  3845   232   230    67  1640  8482  1049   971   187   354   290

  5721    54    57    10  1236   321   305   252   280   614  1152  4126

   225    28     1     3    68    56  4220   390   446   200   694   834

     2   884  1987    46     2     1    39  2193   218     8   195  5844

   451     5   432   145     1     1     3     4  1887  1005     6   701

  1353     2     5     2   112  1430   343  9064   147   135    25     2

     2  3175   795   985   699   253  2042   617  2680    37    29     7

    11    20  5327   785   945  1610   845  1438  1059  3187     2     1

     1    31   134   124  7809   767  8255   284  1621  1139  1129  1400

    66   208   217  1774]

 

This is all a computer knows about the image. It does not understand the words in the image are “CONTROL” and “AUTOMATION.” The computer also does not understand “O” in the image is shaped to resemble gear teeth. Computers cannot understand what a logo is. For that, machine learning and deep learning are required.

Note: The image used for the python script was a .webp file with no background. This is the reason for the absence of white RGB ([256 256 256]) in the result. But the white background is visible with the image on this page is the color of the webpage. There is no white color in the actual image.

 

Convolutional Neural Network (CNN)

The first step for vision systems is understanding objects in an image. To do so, convolutional neural networks (CNN) are required. Using deep learning with artificial neural networks (ANN) for image recognition is highly resource-intensive. CNN is a better solution for image recognition.

Take, for example, a computer wanting to identify “T” in the logo of this website. We know that T is made by a horizontal line perched atop a vertical line. We can use filters with neural networks to identify the contents of an image. To identify the letter T, we need three filters.

  •    A horizontal line
  •    A vertical line
  •    The horizontal line is atop the vertical line

CNN performs a convolutional operation to identify these three filters. The CNN algorithm takes a grid of pixels and contrasts them with each filter to create a feature map. This feature map can identify whether a particular filter is present in the image.

If all three filters are present, the letter T is identified. Similarly, a well-trained CNN can identify all letters from images. Scaled-up CNNs can also be trained to detect various objects.

CNNs require fewer resources compared to simple ANNs to accomplish the same task. In simple words, CNN creates a shortcut to ANN methodology. Using CNN, computers can understand objects in an image. Applying CNNs on the logo of this website computer can:

  • Identify the letters in the logo of this website.
  • Understand that it is made up of two separate words.
  • The two words are CONTROL and AUTOMATION
  • The first O in the logo is shaped like a gear, etc.

 

Recurrent Neural Network (RNN)

Vision systems are not solely based on identifying images. They should also understand what is happening in a video clip or a video stream. Luckily, videos are simply a lot of pictures in sequence. But, identifying the objects in each image is not sufficient to interpret a video.

 

Figure 3. An adjustable guard with an open/close function so that operators can provide maintenance. Image used courtesy of OSHA

 

A video with a machine guard moving up and down comprises several images. The position of the guard changes in every image. Identifying the guard in every image is insufficient.

When all images are analyzed, the position of the guard changes from image to image. We can conclude the guard is moving. The correct image sequence and position changes helps identify the guard is moving from the top to the bottom.

CNNs, however, can only identify objects in each image. A recurrent neural network (RNN) is needed to string the insights from all the images and identify the video’s context. CNNs can only identify spatial data, which helps identify objects in the data. RNN can work with temporal data—data with a time component.

RNNs are used in datasets where sequences are important. CNN analyzes each of the video frames. The output from CNN is fed to another model with RNN to handle the temporal information attached with videos.

The RNN utilized the CNN’s output. In the context of vision systems, RNN handles video data, but most use cases for RNN are in natural language processing (NLP) applications.

Mimicking human vision is one of the most challenging tasks AI systems can undertake. It is an immense and complex procedure to identify each pixel in the image, recognize objects, and string it together with the context from the rest of the images in the video stream. CNN can recognize objects in each video frame, while RNN can bring together the video’s context.

As with every machine learning model, CNNs and RNNs have to be trained with data to be usable. In most applications, the amount of data required to train these deep learning models is humongous, which is why advanced vision systems are not yet a reality.


Featured image used courtesy of TELEDYNE FLIR