Learning multiple solutions to computer vision problems

Deshpande, Aditya Rajiv

Learning multiple solutions to computer vision problems

Deshpande, Aditya Rajiv

Permalink

https://hdl.handle.net/2142/107964

Description

Title

Learning multiple solutions to computer vision problems

Author(s)

Deshpande, Aditya Rajiv

Issue Date

2020-05-05

Director of Research (if dissertation) or Advisor (if thesis)

Forsyth, David

Doctoral Committee Chair(s)

Forsyth, David

Committee Member(s)

Schwing, Alexander G.
Lazebnik, Svetlana
Batra, Dhruv

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Computer Science, Computer Vision, Image Colorization, Image Captioning

Abstract

Advancements in general-purpose computing on GPUs [1, 2, 3, 4, 5, 6, 7] has led to a resurgence of deep learning methods in computer vision. Deep learning techniques have since led to tremendous successes in the field of computer vision. Some of the prominent ones are the progress made on the problems of image classification [8, 9, 10], image segmentation [11, 12, 13, 14, 15], object detection [16, 17, 12, 18] and vision language tasks, e.g. image captioning [19, 20, 21], visual question answering [22, 23, 24, 25, 26] etc. Convolutional neural networks [27, 28] and/or Recurrent Neural Networks [29] trained to regress to a single value or classify to a single class label are the workhorse of most of these methods. However, many computer vision problems are ambiguous i.e. they have more than one plausible solution. Therefore, we need methods – a) That can estimate the multi-modal (i.e. with multiple peaks) probability distribution in the output space, and b) Produce diverse and meaningful solutions from the estimated multi-modal probability distribution. In this thesis, we tackle ambiguous problems (which have multiple solutions) such as image colorization, image captioning and scene-graph prediction. Our strategy to generate multiple solutions is as follows – (i) We first generate multiple proposals given the input image. These proposals are not to be confused with the object or region proposals of detection networks. Our proposal encodes the properties/characteristics of the corresponding solution/output, before the output is generated. (ii) Given the proposal, we then generate the solution/output which (approximately) adheres to the constraints specified in the proposal. Multiple proposals allow us to generate multiple solutions. More than one colorization is feasible for a grey-level image. For image colorization, we develop a regression based method to generate a single colorization. Then, we use histograms as proposals with this regression method. Our histogram proposals encode different color schemes desired in the output colorizations. Thus, using different histograms allows our method to generate realistic multiple colorizations. Histogram proposals are difficult to input for a user, therefore we also develop a method that uses automatically generated (or learned) latent proposals. Our latent proposal method uses a combination of variational auto-encoders [30] and mixture density net- works [31] to perform multiple colorization. To the best of our knowledge, this is the first method that demonstrates learned multiple colorizations. For image captioning, the goal is to produce a sentence (i.e. caption) to describe the input image. Any given image can be described in many ways, therefore multiple captions are correct. We show that convolutional networks can be used to model the language (or output caption), which previously was done using recurrent networks only. We find that convolutional networks produce more entropy in their posteriors for output words. Therefore, more unique words and n-grams get sampled in the output captions. This demonstrates that posterior probability modeled by convolutional neural networks encodes more diversity and is multi-modal. Then, we used part-of-speech as our proposals with our convolutional captioning method to sample multiple captions. Part-of- speech encodes different language/syntactic structure of the output caption. Therefore, our method generates captions that have meaningfully different language or sentence structure. We show that our sampling of captions is – fast, accurate and diverse. Finally, we propose a new method to build scene graphs that uses object-object (or paired object) proposals. Our part-of-speech technique helped add syntactic diversity to the multiple captions. In future work, scene graphs can be used to add semantic diversity to the captioning methods and help obtain diverse captions.

Graduation Semester

2020-05

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/107964

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Learning multiple solutions to computer vision problems

Deshpande, Aditya Rajiv

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In