Intro to Image Classification


Image Classification is categorizing pictures into different classes. The most common example here is classifying an image into Cat or Dog. Notice, however, that the computer is completely ignorant of the definition of Cat or Dog in such process, and it would make absolutely no difference from the perspective of the computer to label the Cat instead as Cute Beings and the Dog as Loyal Creature. The training, validation and testing accuracy would stay exactly the same regardless of what labels we put on each category. To capture such abstraction, we usually represent different label as distinct integer value. Each image, along the same line of thought, is nothing more than a 3-D array(height * width * channels) of integer values ranging from 0 to 255.


Image Classification in its own form is not the most powerful CV algorithm ever designed. It is a supervised learning method that works well only with limited number of discrete outputs. However, Image Classification is the foundation of numerous other tasks, including Object Detection and Segmentation. Being able to classify an image accurately not only boosts the accuracy for all tasks built on top of it, but also helps us understand the scope and mechanism of machine vision a whole lot more. A lot of CV breakthroughs happen exactly when researchers spend a long time delving into basic tasks like Edge Detection and Image Classification.

Although sounds simple and naive, Image Classification has various applications in its most basic form. Just to name some, emotion analysis can capture emotions through facial features ; gesture classification relies on hand cues to capture poser’s intention; Amazon Go utilizes object classification for instant check out, etc.


As a starter task in CV, Image Classification is already a very daunting problem to attempt.  To give a sense of its difficulty, consider the following scenario.

You are in a Guess It game and behind the curtain, there is an animal that could be a cat, a dog, a monkey or a Guinea pig. The host gives you information on its price, weight and adorability(which is subjective and pretty much random in this context). With these information, you are given two attempts to guess the animal right.

This sounds hard, but not impossible, right? Except in the context of Image Classification, the host would give a couple hundred thousands more pieces of information, the possible number of animals behind the curtain go from 3 to 1000, the distinction between possible animals goes from cat and dog to Cairn Terrier and Norwich Terrier, and there is only for chance for you to guess right.

Particular challenges in this context include:

  • Occlusion
  • Different Illumination Condition
  • Scale Variation
  • Deformation
  • Viewpoint Variation
  • Intra-class Variaion

It is also a huge challenge for the computer to figure out which information are relevant to labels and which are complete noise.



Image Classification is categorizing pictures. If someone shows us this picture, we can immediately think of the word ‘dog’, and those dog experts can relate it to ‘Boston Terrier’.

Target Image: 300x169x3


Memorizing and making use of these names are not hard for human to do, but way too complicated for a computer. Instead, we ask the computer to perform easier tasks, which is to categorize this picture into Category A or B.  In common examples, A would be dog and B would be cat. We’ll try something different here.

Category A
Category B

Image Classification is the process where computer tries to associate the first image with Category B above.

Contrary to our first impression, computer is completely ignorant of the label ‘dog’, or ‘robot’, or any idea about ‘category’. In the eyes of computer, the target image is a 3-D array of integers, with dimension 300x169x3 and integer value ranging from 0 to 255. The computer then needs to choose one option out of a finite number of choices to associate this array to. Upon making its choice, the computer would signal us by passing back a numerical indicator.