Joint Detection & RecognitionResearch < Jerod Weinman < CompSci < Grinnell

Motivation

Figure 1. An example object class hierarchy for images.

Object detection and recognition systems, such as face detectors and face recognizers, are often trained separately and operated in a feed-forward fashion. Selecting a small number of features for these tasks is important to prevent over-fitting and reduce computation. However, when a system has such related or sequential tasks, selecting features for these tasks independently may not be optimal.

For instance, the figure at right, detection corresponds to finding instances in the middle column, while recognition is finding instances in the rightmost column. Previous work has focused on sharing features for category detection, but not for sub-category recognition.

We propose a framework for choosing features to be shared between detection and recognition tasks. The result is a system that achieves better performance by joint training and is faster because some features for identification have already been computed for detection.

Example Problem

Characters	Background


Cars

We specifically apply this to the distinct, but highly related problems of text detection and character recognition, simultaneously with car detection and recognition. The detection tasks must only discriminate characters and cars in general from background, while the recognitition task must identify the center character in each window or the particular type of car (e.g., SUV, sedan, van, truck).

Comparison

Figure 1. Recognition accuracy with increasing numbers of features selected independently or by our joint method.

With an independent method, features are selected for the categorization (detection) task and sub-categorization (recognition) separately. In our method, the features are selected jointly and simultaneously for both classifiers. This yields either better performance for the same number of total features used, or faster runtime for the same overall performance.

When features are complex or time-consuming to compute, employing features that are reusable among different system components is highly desirable. This becomes especially important as tasks for vision systems grow in breadth.

Example Results

Two example detection results show that text can be reliably located with our model and the ~20 features it selects.

Motivation

Example Problem

Comparison

Example Results

Related Papers