Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation

Published: 29 May 2020 Publication History

Abstract

Converting widely-available 2D images and videos, captured using an RGB camera, to 3D can help accelerate the training of machine learning systems in spatial reasoning domains ranging from in-home assistive robots to augmented reality to autonomous vehicles. However, automating this task is challenging because it requires not only accurately estimating object location and orientation, but also requires knowing currently unknown camera properties (e.g., focal length). A scalable way to combat this problem is to leverage people's spatial understanding of scenes by crowdsourcing visual annotations of 3D object properties. Unfortunately, getting people to directly estimate 3D properties reliably is difficult due to the limitations of image resolution, human motor accuracy, and people's 3D perception (i.e., humans do not "see" depth like a laser range finder). In this paper, we propose a crowd-machine hybrid approach that jointly uses crowds' approximate measurements of multiple in-scene objects to estimate the 3D state of a single target object. Our approach can generate accurate estimates of the target object by combining heterogeneous knowledge from multiple contributors regarding various different objects that share a spatial relationship with the target object. We evaluate our joint object estimation approach with 363 crowd workers and show that our method can reduce errors in the target object's 3D location estimation by over 40%, while requiring only $35$% as much human time. Our work introduces a novel way to enable groups of people with different perspectives and knowledge to achieve more accurate collective performance on challenging visual annotation tasks.

References

[1]
Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2013. OpenSurfaces: A richly annotated catalog of surface appearance. ACM Transactions on Graphics (TOG)32, 4 (2013), 111.
[2]
Xun Cao, Alan C Bovik, Yao Wang, and Qionghai Dai. 2011. Converting 2D video to 3D: An efficient path to a 3Dexperience.IEEE MultiMedia 18, 4 (2011), 12--17.
[3]
Liang-Chieh Chen, Sanja Fidler, Alan L Yuille, and Raquel Urtasun. 2014. Beat the mturkers: Automatic image labeling from weak 3d supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3198--3205.
[4]
Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. 2016. Single-Image Depth Perception in the Wild. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 730--738.
[5]
Weifeng Chen, Shengyi Qian, and Jia Deng. 2019. Learning single-image depth from videos using quality assessment networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5604--5613.
[6]
Yan Chen, Mauli Pandey, Jean Y. Song, Walter S. Lasecki, and Steve Oney. 2020. Improving Crowd-Supported GUI Testing with Structural Guidance. In Proceedings of the SIGCHI conference on human factors in computing systems.
[7]
John J.Y. Chung, Jean Y. Song, Sindhu Kutty, Sungsoo Ray Hong, Juho Kim, and Walter S. Lasecki. 2019. Efficient Elicitation Approaches to Estimate Collective Crowd Answers. In Proceedings of the ACM conference on Computer-Supported Collaborative Work (CSCW '19). ACM, New York, NY, USA.
[8]
Antonio Criminisi, Ian Reid, and Andrew Zisserman. 2000. Single view metrology. International Journal of Computer Vision 40, 2 (2000), 123--148.
[9]
J. E. Cutting and P. M. Vishton. 1995. Perceiving layout and knowing distances: The interaction, relative potency, and contextual use of different information about depth. In Perception of space and motion. 69--117.
[10]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5828--5839.
[11]
Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm.Applied statistics(1979), 20--28.
[12]
Patrick Denis, James H Elder, and Francisco J Estrada. 2008. Efficient edge-based methods for estimating manhattan frames in urban imagery. In European conference on computer vision. Springer, 197--210.
[13]
David Eigen and Rob Fergus. 2014. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. CoRRabs/1411.4734 (2014).
[14]
Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for 3d object reconstruction froma single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 605--613.
[15]
Yun Fei, Guodong Rong, Bin Wang, and Wenping Wang. 2014. Parallel L-BFGS-B algorithm on gpu. Computers &graphics40, 1--9.
[16]
Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM24, 6 (1981), 381--395.
[17]
Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Conference on Computer Vision and Pattern Recognition (CVPR).
[18]
Andreas Geiger, Christian Wojek, and Raquel Urtasun. 2011. Joint 3d estimation of objects and scene layout. In Advances in Neural Information Processing Systems. 1467--1475.
[19]
R. I. Hartley and A. Zisserman. 2004.Multiple View Geometry in Computer Vision(second ed.). Cambridge University Press, ISBN: 0521540518.
[20]
Evan Heit. 1994. Models of the effects of prior knowledge on category learning.Journal of Experimental Psychology:Learning, Memory, and Cognition 20, 6 (1994), 1264.
[21]
Tomas Hodan, Rigas Kouskouridas, Tae-Kyun Kim, Federico Tombari, Kostas Bekris, Bertram Drost, Thibault Groueix, Krzysztof Walas, Vincent Lepetit, Ales Leonardis, et al.2018. A Summary of the 4th International Workshop on Recovering 6D Object Pose. In Proceedings of the European Conference on Computer Vision (ECCV). 0--0.
[22]
Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal,Stephan Ihrke, Xenophon Zabulis, et al.2018. Bop: Benchmark for 6d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 19--34.
[23]
D. Hoiem, A.A. Efros, and M. Hebert. 2005. Geometric Context from a Single Image. InICCV.
[24]
Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation. ACM, 64--67.
[25]
Stephen James and Edward Johns. 2016. 3d simulation for robot arm control with deep q-learning.arXiv preprintarXiv:1609.03759(2016).
[26]
Youxuan Jiang, Catherine Finegan-Dollak, Jonathan K. Kummerfeld, and Walter Lasecki. 2018. Effective Crowdsourcing for a New Type of Summarization Task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 628--633.
[27]
Youxuan Jiang, Jonathan K. Kummerfeld, and Walter S. Lasecki. 2017. Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Vancouver, Canada, 103--109.
[28]
Oliphant T. Peterson P. Jones, E. (2001, accessed 2 January 2020). SciPy: open source scientific tools for Python.http://www.scipy.org
[29]
Sanjay Kairam and Jeffrey Heer. [n.d.]. Parting crowds: Characterizing divergent interpretations in crowdsourced annotation tasks(CSCW '16).
[30]
CT Kelley. 1999.Iterative Methods for Optimization. SIAM Publications, Philadelphia.
[31]
Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 1301--1318.
[32]
Janusz Konrad, Meng Wang, and Prakash Ishwar. 2012. 2d-to-3d image conversion by learning depth from examples.In2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 16--22.
[33]
Michael Laielli, James Smith, Giscard Biamby, Trevor Darrell, and Bjoern Hartmann. 2019. LabelAR: A Spatial Guidance Interface for Fast Computer Vision Image Collection. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology. 987--998.
[34]
Walter S. Lasecki, Mitchell Gordon, Danai Koutra, Malte F. Jung, Steven P. Dow, and Jeffrey P. Bigham. 2014. Glance:Rapidly coding behavioral video with the crowd. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 551--562.
[35]
Walter S Lasecki, Rachel Wesley, Jeffrey Nichols, Anand Kulkarni, James F Allen, and Jeffrey P Bigham. 2013. Chorus:a crowd-powered conversational assistant. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 151--162.
[36]
Dawei Leng and Weidong Sun. 2009. Finding all the solutions of PnP problem. In2009 IEEE International Workshop on Imaging Systems and Techniques. IEEE, 348--352.
[37]
Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. 2009. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision81, 2 (2009), 155.
[38]
Hongwei Li, Bo Zhao, and Ariel Fuxman. 2014. The wisdom of minority: Discovering and targeting the right group of workers for crowdsourcing. In Proceedings of the 23rd international conference on World wide web. ACM, 165--176.
[39]
Christopher H. Lin, Mausam Mausam, and Daniel S. Weld. 2012. Dynamically Switching Between Synergistic Workflows for Crowdsourcing. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI'12). AAAI Press,87--93.
[40]
David G. Lowe. 1991. Fitting parameterized three-dimensional models to images.IEEE Transactions on Pattern Analysis & Machine Intelligence 5 (1991), 441--450.
[41]
C-P Lu, Gregory D Hager, and Eric Mjolsness. 2000. Fast and globally convergent pose estimation from video images. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 6 (2000), 610--622.
[42]
An T Nguyen, Matthew Lease, and Byron C Wallace. 2019. Explainable modeling of annotations in crowdsourcing. In IUI. 575--579.
[43]
Shubham Tulsiani Abhinav Gupta Nilesh Kulkarni, Ishan Misra. 2019. 3D-RelNet: Joint Object and Relational Network for 3D Prediction. In ICCV.
[44]
Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al.2016. Holoportation: Virtual 3d teleportation in real-time. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 741--754.
[45]
Robert P O'Shea, Donovan G Govan, and Robert Sekuler. 1997. Blur and contrast as pictorial depth cues. Perception 26,5 (1997), 599--612.
[46]
Satoshi Oyama, Yukino Baba, Yuko Sakurai, and Hisashi Kashima. 2013. Accurate Integration of Crowdsourced Labels Using Workers' Self-reported Confidence Scores. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI '13). AAAI Press, 2554--2560.
[47]
Satoshi Oyama, Yukino Baba, Yuko Sakurai, and Hisashi Kashima. 2013. EM-based inference of true labels using confidence judgments. In First AAAI Conference on Human Computation and Crowdsourcing.
[48]
Xinlei Pan, Yurong You, Ziyan Wang, and Cewu Lu. 2017. Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952(2017).
[49]
Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. 2017. Extreme clicking for efficient object annotation. In Proceedings of the IEEE International Conference on Computer Vision. 4930--4939.
[50]
Ramya Ramakrishnan, Ece Kamar, Besmira Nushi, Debadeepta Dey, Julie Shah, and Eric Horvitz. 2019. Overcoming Blind Spots in the Real World: Leveraging Complementary Abilities for Joint Execution. (2019).
[51]
Aditya Sankar and Steve M Seitz. 2017. Interactive Room Capture on 3D-Aware Mobile Devices. In Proceedings of the30th Annual ACM Symposium on User Interface Software and Technology. ACM, 415--426.
[52]
Ashutosh Saxena, Jamie Schulte, and Andrew Ng. 2007. Depth Estimation using Monocular and Stereo Cues. In IJCAI.
[53]
Alexander G. Schwing, Sanja Fidler, Marc Pollefeys, and Raquel Urtasun. 2013. Box in the Box: Joint 3D Layout and Object Reasoning from Single Images. In The IEEE International Conference on Computer Vision (ICCV).
[54]
Alice Smith, Alice E Smith, David W Coit, Thomas Baeck, David Fogel, and Zbigniew Michalewicz. 1997. Penalty functions. (1997).
[55]
Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 254--263.
[56]
Jean Y. Song, Raymond Fok, Juho Kim, and Walter S. Lasecki. 2019. FourEyes: Leveraging Tool Diversity as a Means toImprove Aggregate Accuracy in Crowdsourcing. ACM Transactions on Interactive Intelligent Systems (TiiS)10, 1 (2019),3.
[57]
Jean Y. Song, Raymond Fok, Alan Lundgard, Fan Yang, Juho Kim, and Walter S. Lasecki. 2018. Two Tools Are Better Than One: Tool Diversity As a Means of Improving Aggregate Crowd Performance. In23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 559--570.
[58]
Jean Y. Song, Stephan J. Lemmer, Michael Xieyang Liu, Shiyan Yan, Juho Kim, Jason J. Corso, and Walter S. Lasecki. 2019. Popup: reconstructing 3D video using particle filtering to aggregate crowd responses. In Proceedings of the 24th International Conference on Intelligent User Interfaces. ACM, 558--569.
[59]
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition. 567--576.
[60]
Alexander Sorokin, Dmitry Berenson, Siddhartha S Srinivasa, and Martial Hebert. 2010. People helping robots helping people: Crowdsourcing for grasping novel objects. In2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2117--2122.
[61]
Robert J Sternberg and Karin Sternberg. 2016.Cognitive psychology. Nelson Education.
[62]
Ryan Szeto and Jason J Corso. 2017. Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 1604--1613.
[63]
Jean-Philippe Tardif. 2009. Non-iterative approach for fast and accurate vanishing point detection. In2009 IEEE 12th International Conference on Computer Vision. IEEE, 1250--1257.
[64]
Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, and Jitendra Malik. 2017. Factoring Shape, Pose,and Layout from the 2D Image of a 3D Scene. arXiv(2017).
[65]
David J Wales and Jonathan PK Doye. 1997. Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to 110 atoms.The Journal of Physical Chemistry A101, 28 (1997), 5111--5116.
[66]
Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. 2019. Dense fusion:6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3343--3352.
[67]
Yaming Wang, Xiao Tan, Yi Yang, Ziyu Li, Xiao Liu, Feng Zhou, and Larry S Davis. 2018. Improving Annotation for 3D Pose Dataset of Fine-Grained Object Categories.arXiv preprint arXiv:1810.09263(2018).
[68]
Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. 2015. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1903--1911.
[69]
Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, and Silvio Savarese. 2016. Objectnet3d: A large scale database for 3d object recognition. In European Conference on Computer Vision. Springer, 160--176.
[70]
Y. Xiang, R. Mottaghi, and S. Savarese. 2014. Beyond PASCAL: A benchmark for 3D object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision. 75--82.
[71]
Muhammad Zeeshan Zia, Michael Stark, and Konrad Schindler. 2014. Are cars just 3d boxes?-jointly estimating the 3d shape of multiple objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3678--3685.
[72]
Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. 1997. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS)23, 4 (1997),550--560.

Cited By

View all
  • (2023)Virtual Reality Solutions Employing Artificial Intelligence Methods: A Systematic Literature ReviewACM Computing Surveys10.1145/356502055:10(1-29)Online publication date: 2-Feb-2023
  • (2022)Interaction Design of Wellness Building Space by Deep Learning and VR Technology in the Context of Internet of ThingsWireless Communications & Mobile Computing10.1155/2022/65674312022Online publication date: 1-Jan-2022
  • (2022)Verifying optimizations of concurrent programs in the promising semanticsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523734(903-917)Online publication date: 9-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction
Proceedings of the ACM on Human-Computer Interaction  Volume 4, Issue CSCW1
CSCW
May 2020
1285 pages
EISSN:2573-0142
DOI:10.1145/3403424
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020
Published in PACMHCI Volume 4, Issue CSCW1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D pose estimation
  2. answer aggregation
  3. computer vision
  4. crowdsourcing
  5. human computation
  6. optimization
  7. soft constraints

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Virtual Reality Solutions Employing Artificial Intelligence Methods: A Systematic Literature ReviewACM Computing Surveys10.1145/356502055:10(1-29)Online publication date: 2-Feb-2023
  • (2022)Interaction Design of Wellness Building Space by Deep Learning and VR Technology in the Context of Internet of ThingsWireless Communications & Mobile Computing10.1155/2022/65674312022Online publication date: 1-Jan-2022
  • (2022)Verifying optimizations of concurrent programs in the promising semanticsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523734(903-917)Online publication date: 9-Jun-2022
  • (2021)A Wide Area Multiview Static Crowd Estimation System Using UAV and 3D Training SimulatorRemote Sensing10.3390/rs1314278013:14(2780)Online publication date: 15-Jul-2021
  • (2021)Spatio-Temporal Graph Attention Embedding for Joint Crowd Flow and Transition PredictionsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34950035:4(1-24)Online publication date: 30-Dec-2021
  • (2021)A Measurement Framework for Explicit and Implicit Urban Traffic SensingACM Transactions on Sensor Networks10.1145/346184017:4(1-27)Online publication date: 10-Aug-2021
  • (2021)Crowdsourcing More Effective Initializations for Single-Target Trackers Through Automatic Re-queryingProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445181(1-13)Online publication date: 6-May-2021
  • (2021)Ground-truth or DAER: Selective Re-query of Secondary Information2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00074(683-694)Online publication date: Oct-2021
  • (2020)Under the Concealing SurfaceACM SIGMETRICS Performance Evaluation Review10.1145/3410048.341009348:1(77-78)Online publication date: 9-Jul-2020
  • (2020)Under the Concealing Surface: Detecting and Understanding Live Webcams in the WildAbstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems10.1145/3393691.3394220(77-78)Online publication date: 8-Jun-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media