Deep CNNs With Spatially Weighted Pooling for Fine-Grained Car Recognition

Deep CNNs With Spatially Weighted Pooling for Fine-Grained Car Recognition

Fine-grained car recognition aims to recognize thecategory information of a car, such as car make, car model,or even the year of manufacture. A number of recent studieshave shown that a deep convolutional neural network (DCNN)trained on a large-scale data set can achieve impressive resultsat a range of generic object classification tasks. In this paper,we propose a spatially weighted pooling (SWP) strategy, whichconsiderably improves the robustness and effectiveness of the fea-ture representation of most dominant DCNNs. More specifically,the SWP is a novel pooling layer, which contains a predefinednumber of spatially weighted masks or pooling channels. TheSWP pools the extracted features of DCNNs with the guidanceof its learnt masks, which measures the importance of thespatial units in terms of discriminative power. As the existingmethods that apply uniform grid pooling on the convolutionalfeature maps of DCNNs, the proposed method can extract theconvolutional features and generate the pooling channels froma single DCNN. Thus minimal modification is needed in termsof implementation. Moreover, the parameters of the SWP layercan be learned in the end-to-end training process of the DCNN.By applying our method to several fine-grained car recognitiondata sets, we demonstrate that the proposed method can achievebetter performance than recent approaches in the literature.We advance the state-of-the-art results by improving the accuracyfrom 92.6% to 93.1% on the Stanford Cars-196 data set and91.2% to 97.6% on the recent CompCars data set. We have alsotested the proposed method on two additional large-scale datasets with impressive results observed.

Fine grained object recognition, exemplified by fine-grained car recognition, has attracted much attentionrecently. Many works and datasets have been proposed in thisresearch field [1]–[4]. Compared to other objects, cars havesome unique properties, which provides a range of challengingresearch topics in object recognition. The enormous numberof car models makes car a rich object class. Moreover, carshave a large intra-class variation due to unconstrained posesand multiple viewpoints. Cars also have a unique hierarchicalstructure, which contains three levels from top to bottom

This structurepresents a direction to address the fine-grained car recognitionin a hierarchical way which targets at recognizing the identityof a car, such as car make, car model, even the year of manu-facture. In contrast to generic object classification [5]–[7], thefine-grained car classification aims to distinguish subcategorieswithin the same car category. Car model classification is aintra-class classification task which is made difficult by thesmall visual differences between subcategories, unconstrainedposes, different illuminations, and cluttered backgrounds.In this paper, we mainly focus on the car model classification.A common approach for fine-grained classification tasks isthe parts-based pooling strategy [8]. In this approach, variousdiscriminative parts of theobject are firstly localized, eachcorresponding to a human-specified object part. Then localfeatures falling into each partare pooled together to obtain apooled feature vector used for classification. Those parts areoften defined manually based on domain-specific knowledgeand part-based detectors are trained in a supervised method.However, there is usually no human-specified parts annotationin many fine-grained classification tasks. Annotating parts issignificantly more challenging than collecting image labels.Furthermore, these human-specified parts may not be optimalfor the specific task. Another line of research focuses on therobust feature representation of images, such as the VLAD [9],Fisher vector [10] with SIFT features [11]. Recently, deepconvolutional neural networks (DCNNs) have been shownto significantly outperform comparable methods on a widevariety of vision problems [12]–[14]. By replacing the SIFTwith features extracted from convolutional layers of a DCNNpre-trained on ImageNet [15], the fisher vector with DCNNfeatures [16] achieve state-of-the-art results on a number ofclassification tasks.Although DCNNs achieve good results in generic objectclassification, their performances are still below the afore-mentioned methods in fine-grained classification tasks. TheseDCNNs under-perform mainly because their architectures arenot optimal for fine-grained objects, especially when objectsare small and appear in clutter. A breakthrough was maderecently by a cross-convolutional-layer pooling method [17].This method extracts subarrarys of convolutional featuremaps (CFMs) of a convolutional layer as local features anduses the CFMs of the successive convolutional layer as poolingchannels. Then, the extracted features are pooled with thesepooling channels to generate more robust image represen-tations. This method achieves the state-of-the-art results onseveral popular visual classification tasks. Later, a bilinearCNN framework [2] has been proposed for the fine-grained

admin