Sketch-based image retrieval via CAT loss with elastic net regularization

Jia Cai; Guanglong Xu; Zhensheng Hu

doi:10.3934/mfc.2020013

Article Contents

2020, Volume 3, Issue 4: 219-227. Doi: 10.3934/mfc.2020013

This issue Previous Article Preface of the special issue on analysis in data science: Methods and applications Next Article Inpainting via sparse recovery with directional constraints

Sketch-based image retrieval via CAT loss with elastic net regularization

1.
School of Statistics and Mathematics, Big Data and Educational Statistics Application Laboratory, Collaborative Innovation Development Center of Pearl River Delta Science & Technology Finance Industry, Guangdong University of Finance & Economics, Guangzhou, Guangdong, 510320, China
2.
School of Statistics and Mathematics, Guangdong University of Finance & Economics, Guangzhou, Guangdong, 510320, China
3.
Information Science School, Guangdong University of Finance & Economics, Guangzhou, Guangdong, 510320, China

^* Corresponding author: Jia Cai
^* Corresponding author: Jia Cai

Received: December 2019

Revised: March 2020

Early access: June 2020

Published: November 2020

The first author is supported partially by National Natural Science Foundation of China (11871167,11671171), Science and Technology Program of Guangzhou (201707010228), Special Support Plan for High-Level Talents of Guangdong Province (2019TQ05X571), Foundation of Guangdong Educational Committee (2019KZDZX1023), Project of Collaborative Innovation Development Center of Pearl River Delta Science & Technology Finance Industry (19XT01), National Social Science Foundation (19AJY027), Natural Science Foundation of Guangdong (2016A030313710)

Abstract / Introduction Full Text(HTML) Figure(2) / Table(5) Related Papers Cited by

Abstract

Fine-grained sketch-based image retrieval (FG-SBIR) is an important problem that uses free-hand human sketch as queries to perform instance-level retrieval of photos. Human sketches are generally highly abstract and iconic, which makes FG-SBIR a challenging task. Existing FG-SBIR approaches using triplet loss with $ \ell_2 $ regularization or higher-order energy function to conduct retrieval performance, which neglect the feature gap between different domains (sketches, photos) and need to select the weight layer matrix. This yields high computational complexity. In this paper, we define a new CAT loss function with elastic net regularization based on attention model. It can close the feature gap between different subnetworks and embody the sparsity of the sketches. Experiments demonstrate that the proposed approach is competitive with state-of-the-art methods.

Keywords:

Mathematics Subject Classification: Primary: 68T45; Secondary: 68T05.

Citation:

Full Text(HTML)

Figure 1. Architecture of the model

Download: Full-size image PowerPoint slide

Figure 2. Examples of stroke removal

Download: Full-size image PowerPoint slide

Table 1. Network structure

$ Index $	Layer	Type	Filter size	Filter number	Stride	Pad	Output size
$ 0 $		$Input$	$-$	$-$	$-$	$-$	$225\times225$
$1$	$L1$	$Conv$	$15\times15$	64	3	0	$71\times71$
$2$		$ ReLU $	$ - $	$ - $	$ - $	$ - $	$ 71\times71 $
$ 3 $		Maxpool	$3\times3$	$-$	2	0	$35\times35$
$4$	$L2$	$Conv$	$5\times5$	128	1	0	$31\times31$
$5$		$ ReLU $	$ - $	$ - $	$ - $	$ - $	$ 31\times31 $
$ 6 $		Maxpool	$3\times3$	$-$	2	0	$15\times15$
$7$	$L3$	$Conv$	$3\times3$	256	1	1	$15\times15$
$8$		$ ReLU $	$ - $	$ - $	$ - $	$ - $	$ 15\times15 $
$ 9 $	$ L4 $	$ Conv $	$ 3\times3 $	256	1	1	$ 15\times15 $
$ 10 $		$ReLU$	$-$	$-$	$-$	$-$	$15\times15$
$11$	$L5$	$Conv$	$3\times3$	256	1	1	$15\times15$
$12$		$ ReLU $	$ - $	$ - $	$ - $	$ - $	$ 15\times15 $
$ 13 $		Maxpool	$3\times3$	$-$	2	0	$7\times7$
$14$	$L6$	$Conv( = FC)$	$7\times7$	512	1	$0$	$1\times1$
$15$		$ ReLU $	$ - $	$ - $	$ - $	$ - $	$ 1\times1 $
$ 16 $		Dropout (0.55)	$-$	$-$	$-$	$-$	$1\times1$
$17$	$L7$	$Conv( = FC)$	$1\times1$	256	1	$0$	$1\times1$
$18$		$ ReLU $	$ - $	$ - $	$ - $	$ - $	$ 1\times1 $
$ 19 $		Dropout (0.55)	$-$	$-$	$-$	$-$	$1\times1$

| Show Table

DownLoad: CSV

Table 2. Comparative results against baselines on QMUL-shoe dataset

QMUL-shoe	$ Acc.@1 $	$ Acc.@10 $
HOG+BoW+RankSVM	17.39%	67.83%
Deep ISN	20.00%	62.61%
Triplet SN	52.17%	92.17%
Triplet DSSA	61.74%	94.78%
Our model	56.52%	96.52%

| Show Table

DownLoad: CSV

Table 3. Comparative results against baselines on QMUL-chair dataset

QMUL-chair	$ Acc.@1 $	$ Acc.@10 $
HOG+BoW+RankSVM	28.87%	67.01%
Deep ISN	47.42%	82.47%
Triplet SN	72.16%	98.96%
Triplet DSSA	81.44%	95.88%
Our model	81.44%	98.97%

| Show Table

DownLoad: CSV

Table 4. Comparative results against baselines on QMUL-handbag dataset

QMUL-handbag	$ Acc.@1 $	$ Acc.@10 $
HOG+BoW+RankSVM	2.38%	10.71%
Deep ISN	9.52%	44.05%
Triplet SN	39.88%	82.14%
Triplet DSSA	49.40%	82.74%
Our model	54.76%	88.69%

| Show Table

DownLoad: CSV

Table 5. Contributions of different components

QMUL-shoe	$ Acc.@1 $	$ Acc.@10 $
Triplet loss+data aug	50.43%	93.91%
CAT loss+no data aug	49.57%	94.78%
Our model	54.78%	96.52%
QMUL-chair	$ Acc.@1 $	$ Acc.@10 $
Triplet loss+data aug	78.35%	97.94%
CAT loss+no data aug	76.29%	96.91%
Our model	81.44%	98.97%
QMUL-handbag	$ Acc.@1 $	$ Acc.@10 $
Triplet loss+data aug	51.19%	86.31%
CAT loss+no data aug	51.79%	86.90%
Our model	54.76%	88.69%

| Show Table

DownLoad: CSV

Related Papers

Cited by

References

[1]	Y. Cao, C. Wang, L. Zhang and L. Zhang, Edgel index for large-scale sketch-based image search, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011,761–768. doi: 10.1109/CVPR.2011.5995460.
[2]	Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang and L. Zhang, Mindfinder: Interactive sketch-based image search on millions of images, in Proceedings of the 18th ACM International Conference on Multimedia, 2010, 1605–1608. doi: 10.1145/1873951.1874299.
[3]	S. Chopra, R. Hadsell and Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005,539–546. doi: 10.1109/CVPR.2005.202.
[4]	N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005,886–893. doi: 10.1109/CVPR.2005.177.
[5]	A. Del Bimbo and P. Pala, Visual image retrieval by elastic matching of user sketches, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (1997), 121-132.
[6]	M. Eitz, J. Hays and M. Alexa, How do humans sketch objects?, ACM Transactions on Graphics (TOG), 31 (2012), 1-10. doi: 10.1145/2185520.2185540.
[7]	R. Hadsell, S. Chopra and Y. LeCun, Dimensionality reduction by learning an invariant mapping, in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, 1735–1742. doi: 10.1109/CVPR.2006.100.
[8]	R. Hu, M. Barnard and J. Collomosse, Gradient field descriptor for sketch based retrieval and localization, in Proceedings of IEEE International Conference on Image Processing, 2010, 1025–1028. doi: 10.1109/ICIP.2010.5649331.
[9]	R. Hu and J. Collomosse, A performance evaluation of gradient field hog descriptor for sketch based image retrieval, Computer Vision and Image Understanding, 117 (2013), 790-806. doi: 10.1016/j.cviu.2013.02.005.
[10]	T. Kato, T. Kurita, N. Otsu and K. Hirata, A sketch retrieval method for full color image database-query by visual example, in Proceedings of 11th IAPR International Conference on Pattern Recognition, 1992,530–533. doi: 10.1109/ICPR.1992.201616.
[11]	B. Klare, Z. Li and A. K. Jain, Matching forensic sketches to mug shot photos, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2010), 639-646. doi: 10.1109/TPAMI.2010.180.
[12]	A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, (2012), 1097–1105. doi: 10.1145/3065386.
[13]	Y. Li, T. M. Hospedales, Y. Z. Song and S. Gong, Fine-grained sketch-based image retrieval by matching deformable part models, 2014.
[14]	Y. Li, T. M. Hospedales, Y. Z. Song and S. Gong, Free-hand sketch recognition by multi-kernel feature learning, Computer Vision and Image Understanding, 137 (2015), 1-11. doi: 10.1016/j.cviu.2015.02.003.
[15]	K. Li, K. Pang, Y. Z. Song, T. Hospedales, H. Zhang and Y. Hu, Fine-grained sketch-based image retrieval: The role of part-aware attributes, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, 1–9. doi: 10.1109/WACV.2016.7477615.
[16]	Y. L. Lin, C. Y. Huang, H. J. Wang and W. Hsu, 3d sub-query expansion for improving sketch-based multi-view image retrieval, in Proceedings of the IEEE International Conference on Computer Vision, 2013, 3495–3502. doi: 10.1109/ICCV.2013.434.
[17]	L. Liu, F. Shen, Y. Shen, X. Liu and L. Shao, Deep sketch hashing: Fast free-hand sketch-based image retrieval, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 2862–2871. doi: 10.1109/CVPR.2017.247.
[18]	S. Ouyang, T. M. Hospedales, Y. Z. Song and X. Li, Forgetmenot: Memory-aware forensic facial sketch matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 5571–5579. doi: 10.1109/CVPR.2016.601.
[19]	K. Pang, Y. Z. Song, T. Xiang and T. M. Hospedales, Cross-domain generative learning for fine-grained sketch-based image retrieval, in Proceedings of the British Machine Vision Conference (BMVC), 2017, 1–12. doi: 10.5244/C.31.46.
[20]	P. Sangkloy, N. Burnell, C. Ham and J. Hays, The sketchy database: Learning to retrieve badly drawn bunnies, ACM Transactions on Graphics (TOG), 35 (2016), 1-12. doi: 10.1145/2897824.2925954.
[21]	R. G. Schneider and T. Tuytelaars, Sketch classification and classification-driven analysis using Fisher vectors, ACM Transactions on Graphics (TOG), 33 (2014), 1-9. doi: 10.1145/2661229.2661231.
[22]	S. Sclaroff, Deformable prototypes for encoding shape categories in image databases, Pattern Recognition, 30 (1997), 627-641. doi: 10.1016/S0031-3203(96)00108-2.
[23]	O. Seddati, S. Dupont and S. Mahmoudi, Triplet networks feature masking for sketch-based image retrieval, in International Conference Image Analysis and Recognition, 2017,296–303. doi: 10.1007/978-3-319-59876-5_33.
[24]	K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, preprint, arXiv: 1409.1556.
[25]	J. Song, Y. Z. Song, T. Xiang, T. M. Hospedales and X. Ruan, Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval, in Proceedings of the British Machine Vision Conference (BMVC), 2016,132.1–132.11. doi: 10.5244/C.30.132.
[26]	J. Song, Q. Yu, Y. Z. Song, T. Xiang and T. M. Hospedales, Deep spatial-semantic attention for fine-grained sketch-based image retrieval, in Proceedings of the IEEE International Conference on Computer Vision, 2017, 5551–5560. doi: 10.1109/ICCV.2017.592.
[27]	C. Szegedy, et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1–9. doi: 10.1109/CVPR.2015.7298594.
[28]	O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show and tell: A neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 3156–3164. doi: 10.1109/CVPR.2015.7298935.
[29]	F. Wang, L. Kang and Y. Li, Sketch-based 3d shape retrieval using convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1875–1883. doi: 10.1109/CVPR.2015.7298797.
[30]	J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen and Y. Wu, Learning fine-grained image similarity with deep ranking, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 1386–1393. doi: 10.1109/CVPR.2014.180.
[31]	A. Yu and K. Grauman, Fine-grained visual comparisons with local learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014,192–199. doi: 10.1109/CVPR.2014.32.
[32]	Q. Yu, F. Liu, Y. Z. Song, T. Xiang, T. M. Hospedales and C. C. Loy, Sketch me that shoe, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,799–807. doi: 10.1007/s11263-016-0932-3.
[33]	Q. Yu, Y. Yang, Y. Z. Song, T. Xiang and T. Hospedales, Sketch-a-net that beats humans, in Proceedings of the British Machine Vision Conference (BMVC), 2015, 7–10. doi: 10.1007/s11263-016-0932-3.
[34]	H. Zhang, C. Zhang, and M. Wu, Sketch-based cross-domain image retrieval via heterogeneous network, in 2017 IEEE Visual Communications and Image Processing (VCIP), 2017, 1–4. doi: 10.1109/VCIP.2017.8305153.
[35]	H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. Roy. Statist. Soc. Ser. B, 67 (2005), 301-320. doi: 10.1111/j.1467-9868.2005.00503.x.