Sketch-based image retrieval via CAT loss with elastic net regularization
-
Abstract
Fine-grained sketch-based image retrieval (FG-SBIR) is an important problem that uses free-hand human sketch as queries to perform instance-level retrieval of photos. Human sketches are generally highly abstract and iconic, which makes FG-SBIR a challenging task. Existing FG-SBIR approaches using triplet loss with $ \ell_2 $ regularization or higher-order energy function to conduct retrieval performance, which neglect the feature gap between different domains (sketches, photos) and need to select the weight layer matrix. This yields high computational complexity. In this paper, we define a new CAT loss function with elastic net regularization based on attention model. It can close the feature gap between different subnetworks and embody the sparsity of the sketches. Experiments demonstrate that the proposed approach is competitive with state-of-the-art methods.
-
Keywords:
- Sketch-based image retrieval,
- CAT loss,
- attention model,
- elastic net,
- feature gap.
Mathematics Subject Classification: Primary: 68T45; Secondary: 68T05.Citation: -
Table 1. Network structure
$ Index $ Layer Type Filter size Filter number Stride Pad Output size $ 0 $ $Input$ $-$ $-$ $-$ $-$ $225\times225$ $1$ $L1$ $Conv$ $15\times15$ 64 3 0 $71\times71$ $2$ $ ReLU $ $ - $ $ - $ $ - $ $ - $ $ 71\times71 $ $ 3 $ Maxpool $3\times3$ $-$ 2 0 $35\times35$ $4$ $L2$ $Conv$ $5\times5$ 128 1 0 $31\times31$ $5$ $ ReLU $ $ - $ $ - $ $ - $ $ - $ $ 31\times31 $ $ 6 $ Maxpool $3\times3$ $-$ 2 0 $15\times15$ $7$ $L3$ $Conv$ $3\times3$ 256 1 1 $15\times15$ $8$ $ ReLU $ $ - $ $ - $ $ - $ $ - $ $ 15\times15 $ $ 9 $ $ L4 $ $ Conv $ $ 3\times3 $ 256 1 1 $ 15\times15 $ $ 10 $ $ReLU$ $-$ $-$ $-$ $-$ $15\times15$ $11$ $L5$ $Conv$ $3\times3$ 256 1 1 $15\times15$ $12$ $ ReLU $ $ - $ $ - $ $ - $ $ - $ $ 15\times15 $ $ 13 $ Maxpool $3\times3$ $-$ 2 0 $7\times7$ $14$ $L6$ $Conv( = FC)$ $7\times7$ 512 1 $0$ $1\times1$ $15$ $ ReLU $ $ - $ $ - $ $ - $ $ - $ $ 1\times1 $ $ 16 $ Dropout (0.55) $-$ $-$ $-$ $-$ $1\times1$ $17$ $L7$ $Conv( = FC)$ $1\times1$ 256 1 $0$ $1\times1$ $18$ $ ReLU $ $ - $ $ - $ $ - $ $ - $ $ 1\times1 $ $ 19 $ Dropout (0.55) $-$ $-$ $-$ $-$ $1\times1$ Table 2. Comparative results against baselines on QMUL-shoe dataset
QMUL-shoe $ Acc.@1 $ $ Acc.@10 $ HOG+BoW+RankSVM 17.39% 67.83% Deep ISN 20.00% 62.61% Triplet SN 52.17% 92.17% Triplet DSSA 61.74% 94.78% Our model 56.52% 96.52% Table 3. Comparative results against baselines on QMUL-chair dataset
QMUL-chair $ Acc.@1 $ $ Acc.@10 $ HOG+BoW+RankSVM 28.87% 67.01% Deep ISN 47.42% 82.47% Triplet SN 72.16% 98.96% Triplet DSSA 81.44% 95.88% Our model 81.44% 98.97% Table 4. Comparative results against baselines on QMUL-handbag dataset
QMUL-handbag $ Acc.@1 $ $ Acc.@10 $ HOG+BoW+RankSVM 2.38% 10.71% Deep ISN 9.52% 44.05% Triplet SN 39.88% 82.14% Triplet DSSA 49.40% 82.74% Our model 54.76% 88.69% Table 5. Contributions of different components
QMUL-shoe $ Acc.@1 $ $ Acc.@10 $ Triplet loss+data aug 50.43% 93.91% CAT loss+no data aug 49.57% 94.78% Our model 54.78% 96.52% QMUL-chair $ Acc.@1 $ $ Acc.@10 $ Triplet loss+data aug 78.35% 97.94% CAT loss+no data aug 76.29% 96.91% Our model 81.44% 98.97% QMUL-handbag $ Acc.@1 $ $ Acc.@10 $ Triplet loss+data aug 51.19% 86.31% CAT loss+no data aug 51.79% 86.90% Our model 54.76% 88.69% -
References
[1] Y. Cao, C. Wang, L. Zhang and L. Zhang, Edgel index for large-scale sketch-based image search, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011,761–768. doi: 10.1109/CVPR.2011.5995460. [2] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang and L. Zhang, Mindfinder: Interactive sketch-based image search on millions of images, in Proceedings of the 18th ACM International Conference on Multimedia, 2010, 1605–1608. doi: 10.1145/1873951.1874299. [3] S. Chopra, R. Hadsell and Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005,539–546. doi: 10.1109/CVPR.2005.202. [4] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005,886–893. doi: 10.1109/CVPR.2005.177. [5] A. Del Bimbo and P. Pala, Visual image retrieval by elastic matching of user sketches, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (1997), 121-132. [6] M. Eitz, J. Hays and M. Alexa, How do humans sketch objects?, ACM Transactions on Graphics (TOG), 31 (2012), 1-10. doi: 10.1145/2185520.2185540. [7] R. Hadsell, S. Chopra and Y. LeCun, Dimensionality reduction by learning an invariant mapping, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, 1735–1742. doi: 10.1109/CVPR.2006.100. [8] R. Hu, M. Barnard and J. Collomosse, Gradient field descriptor for sketch based retrieval and localization, in Proceedings of IEEE International Conference on Image Processing, 2010, 1025–1028. doi: 10.1109/ICIP.2010.5649331. [9] R. Hu and J. Collomosse, A performance evaluation of gradient field hog descriptor for sketch based image retrieval, Computer Vision and Image Understanding, 117 (2013), 790-806. doi: 10.1016/j.cviu.2013.02.005. [10] T. Kato, T. Kurita, N. Otsu and K. Hirata, A sketch retrieval method for full color image database-query by visual example, in Proceedings of 11th IAPR International Conference on Pattern Recognition, 1992,530–533. doi: 10.1109/ICPR.1992.201616. [11] B. Klare, Z. Li and A. K. Jain, Matching forensic sketches to mug shot photos, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2010), 639-646. doi: 10.1109/TPAMI.2010.180. [12] A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, (2012), 1097–1105. doi: 10.1145/3065386. [13] Y. Li, T. M. Hospedales, Y. Z. Song and S. Gong, Fine-grained sketch-based image retrieval by matching deformable part models, 2014. [14] Y. Li, T. M. Hospedales, Y. Z. Song and S. Gong, Free-hand sketch recognition by multi-kernel feature learning, Computer Vision and Image Understanding, 137 (2015), 1-11. doi: 10.1016/j.cviu.2015.02.003. [15] K. Li, K. Pang, Y. Z. Song, T. Hospedales, H. Zhang and Y. Hu, Fine-grained sketch-based image retrieval: The role of part-aware attributes, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, 1–9. doi: 10.1109/WACV.2016.7477615. [16] Y. L. Lin, C. Y. Huang, H. J. Wang and W. Hsu, 3d sub-query expansion for improving sketch-based multi-view image retrieval, in Proceedings of the IEEE International Conference on Computer Vision, 2013, 3495–3502. doi: 10.1109/ICCV.2013.434. [17] L. Liu, F. Shen, Y. Shen, X. Liu and L. Shao, Deep sketch hashing: Fast free-hand sketch-based image retrieval, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 2862–2871. doi: 10.1109/CVPR.2017.247. [18] S. Ouyang, T. M. Hospedales, Y. Z. Song and X. Li, Forgetmenot: Memory-aware forensic facial sketch matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 5571–5579. doi: 10.1109/CVPR.2016.601. [19] K. Pang, Y. Z. Song, T. Xiang and T. M. Hospedales, Cross-domain generative learning for fine-grained sketch-based image retrieval, in Proceedings of the British Machine Vision Conference (BMVC), 2017, 1–12. doi: 10.5244/C.31.46. [20] P. Sangkloy, N. Burnell, C. Ham and J. Hays, The sketchy database: Learning to retrieve badly drawn bunnies, ACM Transactions on Graphics (TOG), 35 (2016), 1-12. doi: 10.1145/2897824.2925954. [21] R. G. Schneider and T. Tuytelaars, Sketch classification and classification-driven analysis using Fisher vectors, ACM Transactions on Graphics (TOG), 33 (2014), 1-9. doi: 10.1145/2661229.2661231. [22] S. Sclaroff, Deformable prototypes for encoding shape categories in image databases, Pattern Recognition, 30 (1997), 627-641. doi: 10.1016/S0031-3203(96)00108-2. [23] O. Seddati, S. Dupont and S. Mahmoudi, Triplet networks feature masking for sketch-based image retrieval, in International Conference Image Analysis and Recognition, 2017,296–303. doi: 10.1007/978-3-319-59876-5_33. [24] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, preprint, arXiv: 1409.1556. [25] J. Song, Y. Z. Song, T. Xiang, T. M. Hospedales and X. Ruan, Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval, in Proceedings of the British Machine Vision Conference (BMVC), 2016,132.1–132.11. doi: 10.5244/C.30.132. [26] J. Song, Q. Yu, Y. Z. Song, T. Xiang and T. M. Hospedales, Deep spatial-semantic attention for fine-grained sketch-based image retrieval, in Proceedings of the IEEE International Conference on Computer Vision, 2017, 5551–5560. doi: 10.1109/ICCV.2017.592. [27] C. Szegedy, et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1–9. doi: 10.1109/CVPR.2015.7298594. [28] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show and tell: A neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 3156–3164. doi: 10.1109/CVPR.2015.7298935. [29] F. Wang, L. Kang and Y. Li, Sketch-based 3d shape retrieval using convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1875–1883. doi: 10.1109/CVPR.2015.7298797. [30] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen and Y. Wu, Learning fine-grained image similarity with deep ranking, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 1386–1393. doi: 10.1109/CVPR.2014.180. [31] A. Yu and K. Grauman, Fine-grained visual comparisons with local learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014,192–199. doi: 10.1109/CVPR.2014.32. [32] Q. Yu, F. Liu, Y. Z. Song, T. Xiang, T. M. Hospedales and C. C. Loy, Sketch me that shoe, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,799–807. doi: 10.1007/s11263-016-0932-3. [33] Q. Yu, Y. Yang, Y. Z. Song, T. Xiang and T. Hospedales, Sketch-a-net that beats humans, in Proceedings of the British Machine Vision Conference (BMVC), 2015, 7–10. doi: 10.1007/s11263-016-0932-3. [34] H. Zhang, C. Zhang, and M. Wu, Sketch-based cross-domain image retrieval via heterogeneous network, in 2017 IEEE Visual Communications and Image Processing (VCIP), 2017, 1–4. doi: 10.1109/VCIP.2017.8305153. [35] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. Roy. Statist. Soc. Ser. B, 67 (2005), 301-320. doi: 10.1111/j.1467-9868.2005.00503.x. -
Access History
- Figure 1. Architecture of the model
- Figure 2. Examples of stroke removal