-
SIGMA: An Open-Source Interactive System for Mixed-Reality Task Assistance Research
Authors:
Dan Bohus,
Sean Andrist,
Nick Saw,
Ann Paradiso,
Ishani Chakraborty,
Mahdi Rad
Abstract:
We introduce an open-source system called SIGMA (short for "Situated Interactive Guidance, Monitoring, and Assistance") as a platform for conducting research on task-assistive agents in mixed-reality scenarios. The system leverages the sensing and rendering affordances of a head-mounted mixed-reality device in conjunction with large language and vision models to guide users step by step through pr…
▽ More
We introduce an open-source system called SIGMA (short for "Situated Interactive Guidance, Monitoring, and Assistance") as a platform for conducting research on task-assistive agents in mixed-reality scenarios. The system leverages the sensing and rendering affordances of a head-mounted mixed-reality device in conjunction with large language and vision models to guide users step by step through procedural tasks. We present the system's core capabilities, discuss its overall design and implementation, and outline directions for future research enabled by the system. SIGMA is easily extensible and provides a useful basis for future research at the intersection of mixed reality and AI. By open-sourcing an end-to-end implementation, we aim to lower the barrier to entry, accelerate research in this space, and chart a path towards community-driven end-to-end evaluation of large language, vision, and multimodal models in the context of real-world interactive applications.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Conformalized Physics-Informed Neural Networks
Authors:
Lena Podina,
Mahdi Torabi Rad,
Mohammad Kohandel
Abstract:
Physics-informed neural networks (PINNs) are an influential method of solving differential equations and estimating their parameters given data. However, since they make use of neural networks, they provide only a point estimate of differential equation parameters, as well as the solution at any given point, without any measure of uncertainty. Ensemble and Bayesian methods have been previously app…
▽ More
Physics-informed neural networks (PINNs) are an influential method of solving differential equations and estimating their parameters given data. However, since they make use of neural networks, they provide only a point estimate of differential equation parameters, as well as the solution at any given point, without any measure of uncertainty. Ensemble and Bayesian methods have been previously applied to quantify the uncertainty of PINNs, but these methods may require making strong assumptions on the data-generating process, and can be computationally expensive. Here, we introduce Conformalized PINNs (C-PINNs) that, without making any additional assumptions, utilize the framework of conformal prediction to quantify the uncertainty of PINNs by providing intervals that have finite-sample, distribution-free statistical validity.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Scaling Instructable Agents Across Many Simulated Worlds
Authors:
SIMA Team,
Maria Abi Raad,
Arun Ahuja,
Catarina Barros,
Frederic Besse,
Andrew Bolt,
Adrian Bolton,
Bethanie Brownfield,
Gavin Buttimore,
Max Cant,
Sarah Chakera,
Stephanie C. Y. Chan,
Jeff Clune,
Adrian Collister,
Vikki Copeman,
Alex Cullum,
Ishita Dasgupta,
Dario de Cesare,
Julia Di Trapani,
Yani Donchev,
Emma Dunleavy,
Martin Engelcke,
Ryan Faulkner,
Frankie Garcia,
Charles Gbadamosi
, et al. (68 additional authors not shown)
Abstract:
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructio…
▽ More
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
△ Less
Submitted 17 April, 2024; v1 submitted 13 March, 2024;
originally announced April 2024.
-
Evaluating Frontier Models for Dangerous Capabilities
Authors:
Mary Phuong,
Matthew Aitchison,
Elliot Catt,
Sarah Cogan,
Alexandre Kaskasoli,
Victoria Krakovna,
David Lindner,
Matthew Rahtz,
Yannis Assael,
Sarah Hodkinson,
Heidi Howard,
Tom Lieberum,
Ramana Kumar,
Maria Abi Raad,
Albert Webson,
Lewis Ho,
Sharon Lin,
Sebastian Farquhar,
Marcus Hutter,
Gregoire Deletang,
Anian Ruoss,
Seliem El-Sayed,
Sasha Brown,
Anca Dragan,
Rohin Shah
, et al. (2 additional authors not shown)
Abstract:
To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous…
▽ More
To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
△ Less
Submitted 5 April, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts
Authors:
Mohammad Heydari Rad,
Farhan Farsi,
Shayan Bali,
Romina Etezadi,
Mehrnoush Shamsfard
Abstract:
Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones…
▽ More
Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World
Authors:
Xin Wang,
Taein Kwon,
Mahdi Rad,
Bowen Pan,
Ishani Chakraborty,
Sean Andrist,
Dan Bohus,
Ashley Feniello,
Bugra Tekin,
Felipe Vieira Frujeri,
Neel Joshi,
Marc Pollefeys
Abstract:
Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale e…
▽ More
Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks. The task performer executes the task while wearing a mixed-reality headset that captures seven synchronized data streams. The task instructor watches the performer's egocentric video in real time and guides them verbally. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment. HoloAssist spans 166 hours of data captured by 350 unique instructor-performer pairs. Furthermore, we construct and present benchmarks on mistake detection, intervention type prediction, and hand forecasting, along with detailed analysis. We expect HoloAssist will provide an important resource for building AI assistants that can fluidly collaborate with humans in the real world. Data can be downloaded at https://holoassist.github.io/.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
CaSAR: Contact-aware Skeletal Action Recognition
Authors:
Junan Lin,
Zhichao Sun,
Enjie Cao,
Taein Kwon,
Mahdi Rad,
Marc Pollefeys
Abstract:
Skeletal Action recognition from an egocentric view is important for applications such as interfaces in AR/VR glasses and human-robot interaction, where the device has limited resources. Most of the existing skeletal action recognition approaches use 3D coordinates of hand joints and 8-corner rectangular bounding boxes of objects as inputs, but they do not capture how the hands and objects interac…
▽ More
Skeletal Action recognition from an egocentric view is important for applications such as interfaces in AR/VR glasses and human-robot interaction, where the device has limited resources. Most of the existing skeletal action recognition approaches use 3D coordinates of hand joints and 8-corner rectangular bounding boxes of objects as inputs, but they do not capture how the hands and objects interact with each other within the spatial context. In this paper, we present a new framework called Contact-aware Skeletal Action Recognition (CaSAR). It uses novel representations of hand-object interaction that encompass spatial information: 1) contact points where the hand joints meet the objects, 2) distant points where the hand joints are far away from the object and nearly not involved in the current action. Our framework is able to learn how the hands touch or stay away from the objects for each frame of the action sequence, and use this information to predict the action class. We demonstrate that our approach achieves the state-of-the-art accuracy of 91.3% and 98.4% on two public datasets, H2O and FPHA, respectively.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Intermittent in-situ high-resolution X-ray microscopy of 400-nm porous glass under uniaxial compression: study of pore changes and crack formation
Authors:
Sebastian Schäfer,
François Willot,
Mansoureh Norouzi Rad,
Stephen T. Kelly,
Dirk Enke,
Juliana Martins de Souza e Silva
Abstract:
The properties of porous glasses and their field of application strongly depend on the characteristics of the void space. Understanding the relationship between their porous structure and failure behaviour can contribute to the development of porous glasses with long-term reliability optimized for specific applications. In the present work, we used X-ray computed tomography with nanometric resolut…
▽ More
The properties of porous glasses and their field of application strongly depend on the characteristics of the void space. Understanding the relationship between their porous structure and failure behaviour can contribute to the development of porous glasses with long-term reliability optimized for specific applications. In the present work, we used X-ray computed tomography with nanometric resolution (nano-CT) to image a controlled pore glass (CPG) with 400 nm-sized pores whilst undergoing uniaxial compression in-situ to emulate a stress process. Our results show that in-situ nano-CT provides an ideal platform for identifying the mechanisms of damage within glass with pores of 400 nm, as it allowed the tracking of the pores and struts change of shape during compression until specimen failure. We have also applied computational tools to quantify the microstructural changes within the CPG sample by mapping the displacements and strain fields, and to numerically simulate the behaviour of the CPG using a Fast Fourier Transform/phase-field method. Both experimental and numerical data show local shear deformation, organized along bands, consistent with the appearance and propagation of +/- 45 degrees cracks.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
Broadband high-resolution integrated spectrometer architecture & data processing method
Authors:
Mehedi Hasan,
Gazi Mahamud Hasan,
Houman Ghorbani,
Mohammad Rad,
Peng Liu,
Eric Bernier,
Trevor Hall
Abstract:
Up-to-date network telemetry is the key enabler for resource optimization by capacity scaling, fault recovery, and network reconfiguration among other means. Reliable optical performance monitoring in general and, specifically, the monitoring of the spectral profile of WDM signals in fixed- and flex- grid architectures across the entire C-band, remains challenging. This article describes a two-sta…
▽ More
Up-to-date network telemetry is the key enabler for resource optimization by capacity scaling, fault recovery, and network reconfiguration among other means. Reliable optical performance monitoring in general and, specifically, the monitoring of the spectral profile of WDM signals in fixed- and flex- grid architectures across the entire C-band, remains challenging. This article describes a two-stage spectrometer architecture amenable to integration on a single chip that can measure quantitatively the spectrum across the entire C-band with a resolution of 1 GHz approximately. The first stage consists of a ring resonator with intra-ring phase shifter to provide a tuneable fine filter. The second stage makes use of an AWG subsystem and novel processing algorithm to synthesize a tuneable coarse filter with a flat passband which isolates individual resonances of a multiplicity of ring resonances. Due to its maturity and low loss, CMOS compatible Si$_3$N$_4$ is chosen for integration. A fabricated ring resonator functioning over the entire C-band with 1.3 GHz FWHM bandwidth resonances tunable over a complete free spectral range of 50 GHz is experimentally demonstrated. The complete system operation is demonstrated using an industry standard simulation tool and AWG constructor data. The operation of the circuit is invariant to the optical path length between individual components substantially improving robustness to fabrication process variations.
△ Less
Submitted 16 January, 2023;
originally announced January 2023.
-
A biologically interfaced evolvable organic pattern classifier
Authors:
Jennifer Gerasimov,
Deyu Tu,
Vivek Hitaishi,
Padinhare Cholakkal Harikesh,
Chi-Yuan Yang,
Tobias Abrahamsson,
Meysam Rad,
Mary J. Donahue,
Malin Silverå Ejneby,
Magnus Berggren,
Robert Forchheimer,
Simone Fabiano
Abstract:
Future brain-computer interfaces will require local and highly individualized signal processing of fully integrated electronic circuits within the nervous system and other living tissue. New devices will need to be developed that can receive data from a sensor array, process data into meaningful information, and translate that information into a format that living systems can interpret. Here, we r…
▽ More
Future brain-computer interfaces will require local and highly individualized signal processing of fully integrated electronic circuits within the nervous system and other living tissue. New devices will need to be developed that can receive data from a sensor array, process data into meaningful information, and translate that information into a format that living systems can interpret. Here, we report the first example of interfacing a hardware-based pattern classifier with a biological nerve. The classifier implements the Widrow-Hoff learning algorithm on an array of evolvable organic electrochemical transistors (EOECTs). The EOECTs' channel conductance is modulated in situ by electropolymerizing the semiconductor material within the channel, allowing for low voltage operation, high reproducibility, and an improvement in state retention of two orders of magnitude over state-of-the-art OECT devices. The organic classifier is interfaced with a biological nerve using an organic electrochemical spiking neuron to translate the classifier's output to a simulated action potential. The latter is then used to stimulate muscle contraction selectively based on the input pattern, thus paving the way for the development of closed-loop therapeutic systems.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
On Radical of Intuitionistic Fuzzy Primary Submodule
Authors:
Abbas Taherpour,
Shaban Ghalandarzadeh,
Parastoo Malakooti Rad,
Parvin Safari
Abstract:
In this paper, we further study the theory of Intuitionistic fuzzy submodules and we will define intuitionistic fuzzy primary submodule with the help of the definition of a radical submodule, and we also study the properties of these submodules. Furthermore, homomorphic image and pre-image of intuitionistic fuzzy primary submodule are investigated.
In this paper, we further study the theory of Intuitionistic fuzzy submodules and we will define intuitionistic fuzzy primary submodule with the help of the definition of a radical submodule, and we also study the properties of these submodules. Furthermore, homomorphic image and pre-image of intuitionistic fuzzy primary submodule are investigated.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
Automatic detection of equiaxed dendrites using computer vision neural networks
Authors:
A. Viardin,
K. Noth,
M. Torabi Rad,
L. Sturz
Abstract:
Equaixed dendrites are frequently encountered in solidification. They typically form in large numbers, which makes their detection, localization, and tracking practically impossible for a human eye. In this paper, we show how recent progress in the field of machine learning can be leveraged to tackle this problem and we present computer vision neural network to automatically detect equiaxed dendri…
▽ More
Equaixed dendrites are frequently encountered in solidification. They typically form in large numbers, which makes their detection, localization, and tracking practically impossible for a human eye. In this paper, we show how recent progress in the field of machine learning can be leveraged to tackle this problem and we present computer vision neural network to automatically detect equiaxed dendrites. Our network is trained using phase-field simulation results, and proper data augmentation allows to perform the detection task in solidification conditions entirely different from those simulated for training. For example, here we show how they can successfully detect dendrites of various sizes in a microgravity solidification experiment. We discuss challenges in training such a network along with our solutions for them, and compare the performance of neural network with traditional methods of shapes detection.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
MCTS with Refinement for Proposals Selection Games in Scene Understanding
Authors:
Sinisa Stekovic,
Mahdi Rad,
Alireza Moradi,
Friedrich Fraundorfer,
Vincent Lepetit
Abstract:
We propose a novel method applicable in many scene understanding problems that adapts the Monte Carlo Tree Search (MCTS) algorithm, originally designed to learn to play games of high-state complexity. From a generated pool of proposals, our method jointly selects and optimizes proposals that minimize the objective term. In our first application for floor plan reconstruction from point clouds, our…
▽ More
We propose a novel method applicable in many scene understanding problems that adapts the Monte Carlo Tree Search (MCTS) algorithm, originally designed to learn to play games of high-state complexity. From a generated pool of proposals, our method jointly selects and optimizes proposals that minimize the objective term. In our first application for floor plan reconstruction from point clouds, our method selects and refines the room proposals, modelled as 2D polygons, by optimizing on an objective function combining the fitness as predicted by a deep network and regularizing terms on the room shapes. We also introduce a novel differentiable method for rendering the polygonal shapes of these proposals. Our evaluations on the recent and challenging Structured3D and Floor-SP datasets show significant improvements over the state-of-the-art, without imposing hard constraints nor assumptions on the floor plan configurations. In our second application, we extend our approach to reconstruct general 3D room layouts from a color image and obtain accurate room layouts. We also show that our differentiable renderer can easily be extended for rendering 3D planar polygons and polygon embeddings. Our method shows high performance on the Matterport3D-Layout dataset, without introducing hard constraints on room layout configurations.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
PROMISSING: Pruning Missing Values in Neural Networks
Authors:
Seyed Mostafa Kia,
Nastaran Mohammadian Rad,
Daniel van Opstal,
Bart van Schie,
Andre F. Marquand,
Josien Pluim,
Wiepke Cahn,
Hugo G. Schnack
Abstract:
While data are the primary fuel for machine learning models, they often suffer from missing values, especially when collected in real-world scenarios. However, many off-the-shelf machine learning models, including artificial neural network models, are unable to handle these missing values directly. Therefore, extra data preprocessing and curation steps, such as data imputation, are inevitable befo…
▽ More
While data are the primary fuel for machine learning models, they often suffer from missing values, especially when collected in real-world scenarios. However, many off-the-shelf machine learning models, including artificial neural network models, are unable to handle these missing values directly. Therefore, extra data preprocessing and curation steps, such as data imputation, are inevitable before learning and prediction processes. In this study, we propose a simple and intuitive yet effective method for pruning missing values (PROMISSING) during learning and inference steps in neural networks. In this method, there is no need to remove or impute the missing values; instead, the missing values are treated as a new source of information (representing what we do not know). Our experiments on simulated data, several classification and regression benchmarks, and a multi-modal clinical dataset show that PROMISSING results in similar prediction performance compared to various imputation techniques. In addition, our experiments show models trained using PROMISSING techniques are becoming less decisive in their predictions when facing incomplete samples with many unknowns. This finding hopefully advances machine learning models from being pure predicting machines to more realistic thinkers that can also say "I do not know" when facing incomplete sources of information.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Optical Wavelength Meter with Machine Learning Enhanced Precision
Authors:
Gazi Mahamud Hasan,
Mehedi Hasan,
Peng Liu,
Mohammad Rad,
Eric Bernier,
Trevor James Hall
Abstract:
Diverse applications in photonics and microwave engineering require a means of measurement of the instantaneous frequency of a signal. A photonic implementation typically applies an interferometer equipped with three or more output ports to measure the frequency dependent phase shift provided by an optical delay line. The components constituting the interferometer are prone to impairments which re…
▽ More
Diverse applications in photonics and microwave engineering require a means of measurement of the instantaneous frequency of a signal. A photonic implementation typically applies an interferometer equipped with three or more output ports to measure the frequency dependent phase shift provided by an optical delay line. The components constituting the interferometer are prone to impairments which results in erroneous measurements. It is shown that the information to be retrieved is encoded by a three-component vector that lies on a circular cone within a three-dimensional Cartesian object space. The measured data belongs to the image of the object space under a linear map that describes the action of the interferometer. Assisted by a learning algorithm, an inverse map from the image space into the object space is constructed. The inverse map compensates for a variety of impairments while being robust to noise. Simulation results demonstrate that, to the extent the interferometer model captures all significant impairments, a precision limited only by the level of random noise is attainable. A wavelength meter architecture is fabricated on Si3N4 photonic integration platform to prove the method experimentally. Applied to the measured data, greater than an order of magnitude improvement in precision is achieved by the proposed method compared to the conventional method.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Geometry of triple junctions during grain boundary premelting
Authors:
M. Torabi Rad,
G. Boussinot,
M. Apel
Abstract:
Grain Boundaries (GB) whose energy is larger than twice the energy of the solid/liquid interface exhibit the premelting phenomenon, for which an atomically thin liquid layer develops at temperatures slightly below the bulk melting temperature. Premelting can have a severe impact on the structural integrity of a polycrystalline material and on the mechanical high temperature properties, also in the…
▽ More
Grain Boundaries (GB) whose energy is larger than twice the energy of the solid/liquid interface exhibit the premelting phenomenon, for which an atomically thin liquid layer develops at temperatures slightly below the bulk melting temperature. Premelting can have a severe impact on the structural integrity of a polycrystalline material and on the mechanical high temperature properties, also in the context of crack formation during the very last stages of solidification. The triple junction between a dry GB and the two solid/liquid interfaces of a liquid layer propagating along the GB cannot be defined from macroscopic continuum properties and surface tension equilibria in terms of Young's law. We show how incorporating atomistic scale physics using a disjoining potential regularizes the state of the triple junction and yields an equilibrium with a well-defined microscopic contact angle. We support this finding by dynamical simulations using a multi-phase field model with obstacle potential for both purely kinetic and diffusive conditions. Generally, our results should provide insights on the dynamics of GB phase transitions, of which the complex phenomena associated with liquid metal embrittlement are an example.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
Self-Supervised Generative Style Transfer for One-Shot Medical Image Segmentation
Authors:
Devavrat Tomar,
Behzad Bozorgtabar,
Manana Lortkipanidze,
Guillaume Vray,
Mohammad Saeed Rad,
Jean-Philippe Thiran
Abstract:
In medical image segmentation, supervised deep networks' success comes at the cost of requiring abundant labeled data. While asking domain experts to annotate only one or a few of the cohort's images is feasible, annotating all available images is impractical. This issue is further exacerbated when pre-trained deep networks are exposed to a new image dataset from an unfamiliar distribution. Using…
▽ More
In medical image segmentation, supervised deep networks' success comes at the cost of requiring abundant labeled data. While asking domain experts to annotate only one or a few of the cohort's images is feasible, annotating all available images is impractical. This issue is further exacerbated when pre-trained deep networks are exposed to a new image dataset from an unfamiliar distribution. Using available open-source data for ad-hoc transfer learning or hand-tuned techniques for data augmentation only provides suboptimal solutions. Motivated by atlas-based segmentation, we propose a novel volumetric self-supervised learning for data augmentation capable of synthesizing volumetric image-segmentation pairs via learning transformations from a single labeled atlas to the unlabeled data. Our work's central tenet benefits from a combined view of one-shot generative learning and the proposed self-supervised training strategy that cluster unlabeled volumetric images with similar styles together. Unlike previous methods, our method does not require input volumes at inference time to synthesize new images. Instead, it can generate diversified volumetric image-segmentation pairs from a prior distribution given a single or multi-site dataset. Augmented data generated by our method used to train the segmentation network provide significant improvements over state-of-the-art deep one-shot learning methods on the task of brain MRI segmentation. Ablation studies further exemplified that the proposed appearance model and joint training are crucial to synthesize realistic examples compared to existing medical registration methods. The code, data, and models are available at https://github.com/devavratTomar/SST.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
Circuit design and integration feasibility of a high-resolution broadband on-chip spectral monitor
Authors:
Mehedi Hasan,
Gazi Mahamud Hasan,
Houman Ghorbani,
Mohammad Rad,
Peng Liu,
Eric Bernier,
Trevor Hall
Abstract:
Up-to-date network telemetry is the key enabler for resource optimization by a variety of means including capacity scaling, fault recovery, network reconfiguration. Reliable optical performance monitoring in general and specifically the monitoring of the spectral profile of WDM signals in fixed- and flex-grid architecture across the entire C-band remains challenging. This article describes a spect…
▽ More
Up-to-date network telemetry is the key enabler for resource optimization by a variety of means including capacity scaling, fault recovery, network reconfiguration. Reliable optical performance monitoring in general and specifically the monitoring of the spectral profile of WDM signals in fixed- and flex-grid architecture across the entire C-band remains challenging. This article describes a spectrometer circuit architecture along with an original data processing algorithm that combined can measure the spectrum quantitatively across the entire C-band aiming at 1 GHz resolution bandwidth. The circuit is composed of a scanning ring resonator followed by a parallel arrangement of AWGs with interlaced channel spectra. The comb of ring resonances provides the high resolution and the algorithm creates a virtual tuneable AWG that isolates individual resonances of the comb within the flat pass-band of its synthesized channels. The parallel arrangement of AWGs may be replaced by a time multiplexed multi-input port AWG. The feasibility of a ring resonator functioning over whole C-band is experimentally validated. Full tuning of the comb of resonances over a free spectral range is achieved with a high-resolution bandwidth of 1.30 GHz. Due to its maturity and low loss, CMOS compatible silicon nitride is chosen for integration. Additionally, the whole system demonstration is presented using industry standard simulation tool. The architecture is robust to fabrication process variations owing to its data processing approach.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Hybrid Deep Neural Network for Brachial Plexus Nerve Segmentation in Ultrasound Images
Authors:
Juul P. A. van Boxtel,
Vincent R. J. Vousten,
Josien Pluim,
Nastaran Mohammadian Rad
Abstract:
Ultrasound-guided regional anesthesia (UGRA) can replace general anesthesia (GA), improving pain control and recovery time. This method can be applied on the brachial plexus (BP) after clavicular surgeries. However, identification of the BP from ultrasound (US) images is difficult, even for trained professionals. To address this problem, convolutional neural networks (CNNs) and more advanced deep…
▽ More
Ultrasound-guided regional anesthesia (UGRA) can replace general anesthesia (GA), improving pain control and recovery time. This method can be applied on the brachial plexus (BP) after clavicular surgeries. However, identification of the BP from ultrasound (US) images is difficult, even for trained professionals. To address this problem, convolutional neural networks (CNNs) and more advanced deep neural networks (DNNs) can be used for identification and segmentation of the BP nerve region. In this paper, we propose a hybrid model consisting of a classification model followed by a segmentation model to segment BP nerve regions in ultrasound images. A CNN model is employed as a classifier to precisely select the images with the BP region. Then, a U-net or M-net model is used for the segmentation. Our experimental results indicate that the proposed hybrid model significantly improves the segmentation performance over a single segmentation model.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation
Authors:
Shreyas Hampali,
Sayan Deb Sarkar,
Mahdi Rad,
Vincent Lepetit
Abstract:
We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and re…
▽ More
We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call "Keypoint Transformer", is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available.
△ Less
Submitted 19 April, 2022; v1 submitted 29 April, 2021;
originally announced April 2021.
-
Test-Time Adaptation for Super-Resolution: You Only Need to Overfit on a Few More Images
Authors:
Mohammad Saeed Rad,
Thomas Yu,
Behzad Bozorgtabar,
Jean-Philippe Thiran
Abstract:
Existing reference (RF)-based super-resolution (SR) models try to improve perceptual quality in SR under the assumption of the availability of high-resolution RF images paired with low-resolution (LR) inputs at testing. As the RF images should be similar in terms of content, colors, contrast, etc. to the test image, this hinders the applicability in a real scenario. Other approaches to increase th…
▽ More
Existing reference (RF)-based super-resolution (SR) models try to improve perceptual quality in SR under the assumption of the availability of high-resolution RF images paired with low-resolution (LR) inputs at testing. As the RF images should be similar in terms of content, colors, contrast, etc. to the test image, this hinders the applicability in a real scenario. Other approaches to increase the perceptual quality of images, including perceptual loss and adversarial losses, tend to dramatically decrease fidelity to the ground-truth through significant decreases in PSNR/SSIM. Addressing both issues, we propose a simple yet universal approach to improve the perceptual quality of the HR prediction from a pre-trained SR network on a given LR input by further fine-tuning the SR network on a subset of images from the training dataset with similar patterns of activation as the initial HR prediction, with respect to the filters of a feature extractor. In particular, we show the effects of fine-tuning on these images in terms of the perceptual quality and PSNR/SSIM values. Contrary to perceptually driven approaches, we demonstrate that the fine-tuned network produces a HR prediction with both greater perceptual quality and minimal changes to the PSNR/SSIM with respect to the initial HR prediction. Further, we present novel numerical experiments concerning the filters of SR networks, where we show through filter correlation, that the filters of the fine-tuned network from our method are closer to "ideal" filters, than those of the baseline network or a network fine-tuned on random images.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans
Authors:
Sinisa Stekovic,
Mahdi Rad,
Friedrich Fraundorfer,
Vincent Lepetit
Abstract:
We propose a novel method for reconstructing floor plans from noisy 3D point clouds. Our main contribution is a principled approach that relies on the Monte Carlo Tree Search (MCTS) algorithm to maximize a suitable objective function efficiently despite the complexity of the problem. Like previous work, we first project the input point cloud to a top view to create a density map and extract room p…
▽ More
We propose a novel method for reconstructing floor plans from noisy 3D point clouds. Our main contribution is a principled approach that relies on the Monte Carlo Tree Search (MCTS) algorithm to maximize a suitable objective function efficiently despite the complexity of the problem. Like previous work, we first project the input point cloud to a top view to create a density map and extract room proposals from it. Our method selects and optimizes the polygonal shapes of these room proposals jointly to fit the density map and outputs an accurate vectorized floor map even for large complex scenes. To do this, we adapted MCTS, an algorithm originally designed to learn to play games, to select the room proposals by maximizing an objective function combining the fitness with the density map as predicted by a deep network and regularizing terms on the room shapes. We also introduce a refinement step to MCTS that adjusts the shape of the room proposals. For this step, we propose a novel differentiable method for rendering the polygonal shapes of these proposals. We evaluate our method on the recent and challenging Structured3D and Floor-SP datasets and show a significant improvement over the state-of-the-art, without imposing any hard constraints nor assumptions on the floor plan configurations.
△ Less
Submitted 13 September, 2021; v1 submitted 20 March, 2021;
originally announced March 2021.
-
On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations
Authors:
M. Torabi Rad,
A. Viardin,
M. Apel
Abstract:
Deep neural networks are transforming fields ranging from computer vision to computational medicine, and we recently extended their application to the field of phase-change heat transfer by introducing theory-trained neural networks (TTNs) for a solidification problem \cite{TTN}. Here, we present general, in-depth, and empirical insights into theory-training networks for learning the solution of h…
▽ More
Deep neural networks are transforming fields ranging from computer vision to computational medicine, and we recently extended their application to the field of phase-change heat transfer by introducing theory-trained neural networks (TTNs) for a solidification problem \cite{TTN}. Here, we present general, in-depth, and empirical insights into theory-training networks for learning the solution of highly coupled differential equations. We analyze the deteriorating effects of the oscillating loss on the ability of a network to satisfy the equations at the training data points, measured by the final training loss, and on the accuracy of the inferred solution. We introduce a theory-training technique that, by leveraging regularization, eliminates those oscillations, decreases the final training loss, and improves the accuracy of the inferred solution, with no additional computational cost. Then, we present guidelines that allow a systematic search for the network that has the optimal training time and inference accuracy for a given set of equations; following these guidelines can reduce the number of tedious training iterations in that search. Finally, a comparison between theory-training and the rival, conventional method of solving differential equations using discretization attests to the advantages of theory-training not being necessarily limited to high-dimensional sets of equations. The comparison also reveals a limitation of the current theory-training framework that may limit its application in domains where extreme accuracies are necessary.
△ Less
Submitted 10 February, 2021; v1 submitted 9 February, 2021;
originally announced February 2021.
-
Benefiting from Bicubically Down-Sampled Images for Learning Real-World Image Super-Resolution
Authors:
Mohammad Saeed Rad,
Thomas Yu,
Claudiu Musat,
Hazim Kemal Ekenel,
Behzad Bozorgtabar,
Jean-Philippe Thiran
Abstract:
Super-resolution (SR) has traditionally been based on pairs of high-resolution images (HR) and their low-resolution (LR) counterparts obtained artificially with bicubic downsampling. However, in real-world SR, there is a large variety of realistic image degradations and analytically modeling these realistic degradations can prove quite difficult. In this work, we propose to handle real-world SR by…
▽ More
Super-resolution (SR) has traditionally been based on pairs of high-resolution images (HR) and their low-resolution (LR) counterparts obtained artificially with bicubic downsampling. However, in real-world SR, there is a large variety of realistic image degradations and analytically modeling these realistic degradations can prove quite difficult. In this work, we propose to handle real-world SR by splitting this ill-posed problem into two comparatively more well-posed steps. First, we train a network to transform real LR images to the space of bicubically downsampled images in a supervised manner, by using both real LR/HR pairs and synthetic pairs. Second, we take a generic SR network trained on bicubically downsampled images to super-resolve the transformed LR image. The first step of the pipeline addresses the problem by registering the large variety of degraded images to a common, well understood space of images. The second step then leverages the already impressive performance of SR on bicubically downsampled images, sidestepping the issues of end-to-end training on datasets with many different image degradations. We demonstrate the effectiveness of our proposed method by comparing it to recent methods in real-world SR and show that our proposed approach outperforms the state-of-the-art works in terms of both qualitative and quantitative results, as well as results of an extensive user study conducted on several real image datasets.
△ Less
Submitted 5 November, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
ALCN: Adaptive Local Contrast Normalization
Authors:
Mahdi Rad,
Peter M. Roth,
Vincent Lepetit
Abstract:
To make Robotics and Augmented Reality applications robust to illumination changes, the current trend is to train a Deep Network with training images captured under many different lighting conditions. Unfortunately, creating such a training set is a very unwieldy and complex task. We therefore propose a novel illumination normalization method that can easily be used for different problems with cha…
▽ More
To make Robotics and Augmented Reality applications robust to illumination changes, the current trend is to train a Deep Network with training images captured under many different lighting conditions. Unfortunately, creating such a training set is a very unwieldy and complex task. We therefore propose a novel illumination normalization method that can easily be used for different problems with challenging illumination conditions. Our preliminary experiments show that among current normalization methods, the Difference-of Gaussians method remains a very good baseline, and we introduce a novel illumination normalization model that generalizes it. Our key insight is then that the normalization parameters should depend on the input image, and we aim to train a Convolutional Neural Network to predict these parameters from the input image. This, however, cannot be done in a supervised manner, as the optimal parameters are not known a priori. We thus designed a method to train this network jointly with another network that aims to recognize objects under different illuminations: The latter network performs well when the former network predicts good values for the normalization parameters. We show that our method significantly outperforms standard normalization methods and would also be appear to be universal since it does not have to be re-trained for each new application. Our method improves the robustness to light changes of state-of-the-art 3D object detection and face recognition methods.
△ Less
Submitted 15 April, 2020;
originally announced April 2020.
-
Stability for Hawkes processes with inhibition
Authors:
Mads Bonde Raad,
Eva Löcherbach
Abstract:
We consider a multivariate non-linear Hawkes process in a multi-class setup where particles are organised within two populations of possibly different sizes, such that one of the populations acts excitatory on the system while the other population acts inhibitory on the system. The goal of this note is to present a class of Hawkes Processes with stable dynamics without assumptions on the spectral…
▽ More
We consider a multivariate non-linear Hawkes process in a multi-class setup where particles are organised within two populations of possibly different sizes, such that one of the populations acts excitatory on the system while the other population acts inhibitory on the system. The goal of this note is to present a class of Hawkes Processes with stable dynamics without assumptions on the spectral radius of the associated weight function matrix. This illustrates how inhibition in a Hawkes system significantly affects the stability properties of the system.
△ Less
Submitted 4 April, 2020;
originally announced April 2020.
-
Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction
Authors:
Anil Armagan,
Guillermo Garcia-Hernando,
Seungryul Baek,
Shreyas Hampali,
Mahdi Rad,
Zhaohui Zhang,
Shipeng Xie,
MingXiu Chen,
Boshen Zhang,
Fu Xiong,
Yang Xiao,
Zhiguo Cao,
Junsong Yuan,
Pengfei Ren,
Weiting Huang,
Haifeng Sun,
Marek Hrúz,
Jakub Kanis,
Zdeněk Krňoul,
Qingfu Wan,
Shile Li,
Linlin Yang,
Dongheui Lee,
Angela Yao,
Weiguo Zhou
, et al. (10 additional authors not shown)
Abstract:
We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole…
▽ More
We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole space densely, despite recent efforts in collecting large-scale training datasets. This sampling problem is even more severe when hands are interacting with objects and/or inputs are RGB rather than depth images, as RGB images also vary with lighting conditions and colors. To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set. More exactly, HANDS'19 is designed (a) to evaluate the influence of both depth and color modalities on 3D hand pose estimation, under the presence or absence of objects; (b) to assess the generalisation abilities w.r.t. four main axes: shapes, articulations, viewpoints, and objects; (c) to explore the use of a synthetic hand model to fill the gaps of current datasets. Through the challenge, the overall accuracy has dramatically improved over the baseline, especially on extrapolation tasks, from 27mm to 13mm mean joint error. Our analyses highlight the impacts of: Data pre-processing, ensemble approaches, the use of a parametric 3D hand model (MANO), and different HPE methods/backbones.
△ Less
Submitted 10 September, 2020; v1 submitted 30 March, 2020;
originally announced March 2020.
-
General 3D Room Layout from a Single View by Render-and-Compare
Authors:
Sinisa Stekovic,
Shreyas Hampali,
Mahdi Rad,
Sayan Deb Sarkar,
Friedrich Fraundorfer,
Vincent Lepetit
Abstract:
We present a novel method to reconstruct the 3D layout of a room (walls, floors, ceilings) from a single perspective view in challenging conditions, by contrast with previous single-view methods restricted to cuboid-shaped layouts. This input view can consist of a color image only, but considering a depth map results in a more accurate reconstruction. Our approach is formalized as solving a constr…
▽ More
We present a novel method to reconstruct the 3D layout of a room (walls, floors, ceilings) from a single perspective view in challenging conditions, by contrast with previous single-view methods restricted to cuboid-shaped layouts. This input view can consist of a color image only, but considering a depth map results in a more accurate reconstruction. Our approach is formalized as solving a constrained discrete optimization problem to find the set of 3D polygons that constitute the layout. In order to deal with occlusions between components of the layout, which is a problem ignored by previous works, we introduce an analysis-by-synthesis method to iteratively refine the 3D layout estimate. As no dataset was available to evaluate our method quantitatively, we created one together with several appropriate metrics. Our dataset consists of 293 images from ScanNet, which we annotated with precise 3D layouts. It offers three times more samples than the popular NYUv2 303 benchmark, and a much larger variety of layouts.
△ Less
Submitted 21 July, 2020; v1 submitted 7 January, 2020;
originally announced January 2020.
-
Theory-training deep neural networks for an alloy solidification benchmark problem
Authors:
M. Torabi Rad,
A. Viardin,
G. J. Schmitz,
M. Apel
Abstract:
Deep neural networks are machine learning tools that are transforming fields ranging from speech recognition to computational medicine. In this study, we extend their application to the field of alloy solidification modeling. To that end, and for the first time in the field, theory-trained deep neural networks (TTNs) for solidification are introduced. These networks are trained using the framework…
▽ More
Deep neural networks are machine learning tools that are transforming fields ranging from speech recognition to computational medicine. In this study, we extend their application to the field of alloy solidification modeling. To that end, and for the first time in the field, theory-trained deep neural networks (TTNs) for solidification are introduced. These networks are trained using the framework founded by Raissi et al.[1-3] and a theory that consists of a mathematical macroscale solidification model and the boundary and initial conditions of a well-known solidification benchmark problem. One of the main advantages of TTNs is that they do not need any prior knowledge of the solution of the governing equations or any external data for training. Using the built-in capabilities in TensorFlow, networks with different widths and depths are trained, and their predictions are examined in detail to verify that they satisfy both the model equations and the initial and boundary conditions of the benchmark problem. Issues that are critical in theory-training are identified, and guidelines that can be used in the future for successful and efficient training of similar networks are proposed. Through this study, theory-trained deep neural networks are shown to be a viable tool to simulate alloy solidification problems.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Circuit architecture of a sub-GHz resolution panoramic C band on-chip spectral sensor
Authors:
Mehedi Hasan,
Mohammad Rad,
Gazi Mahamud Hasan,
Peng Liu,
Patric Dumais,
Eric Bernier,
Trevor Hall
Abstract:
Monitoring the state of the optical network is a key enabler for programmability of network functions, protocols and efficient use of the spectrum. A particular challenge is to provide the SDN-EON controller with a panoramic view of the complete state of the optical spectrum. This paper describes the architecture for compact on-chip spectrometry targeting high resolution across the entire C-band t…
▽ More
Monitoring the state of the optical network is a key enabler for programmability of network functions, protocols and efficient use of the spectrum. A particular challenge is to provide the SDN-EON controller with a panoramic view of the complete state of the optical spectrum. This paper describes the architecture for compact on-chip spectrometry targeting high resolution across the entire C-band to reliably and accurately measure the spectral profile of WDM signals in fixed and flex-grid architectures. An industry standard software tool is used to validate the performance of the spectrometer. The fabrication of the proposed design is found to be practical.
△ Less
Submitted 24 October, 2019;
originally announced December 2019.
-
Artificial Intelligence Approaches
Authors:
Yingjie Hu,
Wenwen Li,
Dawn Wright,
Orhun Aydin,
Daniel Wilson,
Omar Maher,
Mansour Raad
Abstract:
Artificial Intelligence (AI) has received tremendous attention from academia, industry, and the general public in recent years. The integration of geography and AI, or GeoAI, provides novel approaches for addressing a variety of problems in the natural environment and our human society. This entry briefly reviews the recent development of AI with a focus on machine learning and deep learning appro…
▽ More
Artificial Intelligence (AI) has received tremendous attention from academia, industry, and the general public in recent years. The integration of geography and AI, or GeoAI, provides novel approaches for addressing a variety of problems in the natural environment and our human society. This entry briefly reviews the recent development of AI with a focus on machine learning and deep learning approaches. We discuss the integration of AI with geography and particularly geographic information science, and present a number of GeoAI applications and possible future directions.
△ Less
Submitted 27 August, 2019;
originally announced August 2019.
-
SROBB: Targeted Perceptual Loss for Single Image Super-Resolution
Authors:
Mohammad Saeed Rad,
Behzad Bozorgtabar,
Urs-Viktor Marti,
Max Basler,
Hazim Kemal Ekenel,
Jean-Philippe Thiran
Abstract:
By benefiting from perceptual losses, recent studies have improved significantly the performance of the super-resolution task, where a high-resolution image is resolved from its low-resolution counterpart. Although such objective functions generate near-photorealistic results, their capability is limited, since they estimate the reconstruction error for an entire image in the same way, without con…
▽ More
By benefiting from perceptual losses, recent studies have improved significantly the performance of the super-resolution task, where a high-resolution image is resolved from its low-resolution counterpart. Although such objective functions generate near-photorealistic results, their capability is limited, since they estimate the reconstruction error for an entire image in the same way, without considering any semantic information. In this paper, we propose a novel method to benefit from perceptual loss in a more objective way. We optimize a deep network-based decoder with a targeted objective function that penalizes images at different semantic levels using the corresponding terms. In particular, the proposed method leverages our proposed OBB (Object, Background and Boundary) labels, generated from segmentation labels, to estimate a suitable perceptual loss for boundaries, while considering texture similarity for backgrounds. We show that our proposed approach results in more realistic textures and sharper edges, and outperforms other state-of-the-art algorithms in terms of both qualitative results on standard benchmarks and results of extensive user studies.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.
-
Benefiting from Multitask Learning to Improve Single Image Super-Resolution
Authors:
Mohammad Saeed Rad,
Behzad Bozorgtabar,
Claudiu Musat,
Urs-Viktor Marti,
Max Basler,
Hazim Kemal Ekenel,
Jean-Philippe Thiran
Abstract:
Despite significant progress toward super resolving more realistic images by deeper convolutional neural networks (CNNs), reconstructing fine and natural textures still remains a challenging problem. Recent works on single image super resolution (SISR) are mostly based on optimizing pixel and content wise similarity between recovered and high-resolution (HR) images and do not benefit from recogniz…
▽ More
Despite significant progress toward super resolving more realistic images by deeper convolutional neural networks (CNNs), reconstructing fine and natural textures still remains a challenging problem. Recent works on single image super resolution (SISR) are mostly based on optimizing pixel and content wise similarity between recovered and high-resolution (HR) images and do not benefit from recognizability of semantic classes. In this paper, we introduce a novel approach using categorical information to tackle the SISR problem; we present a decoder architecture able to extract and use semantic information to super-resolve a given image by using multitask learning, simultaneously for image super-resolution and semantic segmentation. To explore categorical information during training, the proposed decoder only employs one shared deep network for two task-specific output layers. At run-time only layers resulting HR image are used and no segmentation label is required. Extensive perceptual experiments and a user study on images randomly selected from COCO-Stuff dataset demonstrate the effectiveness of our proposed method and it outperforms the state-of-the-art methods.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
HOnnotate: A method for 3D Annotation of Hand and Object Poses
Authors:
Shreyas Hampali,
Mahdi Rad,
Markus Oberweger,
Vincent Lepetit
Abstract:
We propose a method for annotating images of a hand manipulating an object with the 3D poses of both the hand and the object, together with a dataset created using this method. Our motivation is the current lack of annotated real images for this problem, as estimating the 3D poses is challenging, mostly because of the mutual occlusions between the hand and the object. To tackle this challenge, we…
▽ More
We propose a method for annotating images of a hand manipulating an object with the 3D poses of both the hand and the object, together with a dataset created using this method. Our motivation is the current lack of annotated real images for this problem, as estimating the 3D poses is challenging, mostly because of the mutual occlusions between the hand and the object. To tackle this challenge, we capture sequences with one or several RGB-D cameras and jointly optimize the 3D hand and object poses over all the frames simultaneously. This method allows us to automatically annotate each frame with accurate estimates of the poses, despite large mutual occlusions. With this method, we created HO-3D, the first markerless dataset of color images with 3D annotations for both the hand and object. This dataset is currently made of 77,558 frames, 68 sequences, 10 persons, and 10 objects. Using our dataset, we develop a single RGB image-based method to predict the hand pose when interacting with objects under severe occlusions and show it generalizes to objects not seen in the dataset.
△ Less
Submitted 30 May, 2020; v1 submitted 2 July, 2019;
originally announced July 2019.
-
Renewal Time Points for Hawkes Processes
Authors:
Mads Bonde Raad
Abstract:
In the last decade Hawkes processes have received much attention as models for functional connectivity in neural spiking networks and other dynamical systems with a cascade behavior. In this paper we establish a renewal approach for analyzing this process. We consider the ordinary nonlinear Hawkes process as well as the more recently described age dependent Hawkes process. We construct renewal-tim…
▽ More
In the last decade Hawkes processes have received much attention as models for functional connectivity in neural spiking networks and other dynamical systems with a cascade behavior. In this paper we establish a renewal approach for analyzing this process. We consider the ordinary nonlinear Hawkes process as well as the more recently described age dependent Hawkes process. We construct renewal-times and establish moment results for these. This gives rise to study the Hawkes process as a Markov chain. As an application, we prove asymptotic results such as a functional CLT and a time-average CLT.
△ Less
Submitted 8 June, 2019; v1 submitted 5 June, 2019;
originally announced June 2019.
-
Using Photorealistic Face Synthesis and Domain Adaptation to Improve Facial Expression Analysis
Authors:
Behzad Bozorgtabar,
Mohammad Saeed Rad,
Hazim Kemal Ekenel,
Jean-Philippe Thiran
Abstract:
Cross-domain synthesizing realistic faces to learn deep models has attracted increasing attention for facial expression analysis as it helps to improve the performance of expression recognition accuracy despite having small number of real training images. However, learning from synthetic face images can be problematic due to the distribution discrepancy between low-quality synthetic images and rea…
▽ More
Cross-domain synthesizing realistic faces to learn deep models has attracted increasing attention for facial expression analysis as it helps to improve the performance of expression recognition accuracy despite having small number of real training images. However, learning from synthetic face images can be problematic due to the distribution discrepancy between low-quality synthetic images and real face images and may not achieve the desired performance when the learned model applies to real world scenarios. To this end, we propose a new attribute guided face image synthesis to perform a translation between multiple image domains using a single model. In addition, we adopt the proposed model to learn from synthetic faces by matching the feature distributions between different domains while preserving each domain's characteristics. We evaluate the effectiveness of the proposed approach on several face datasets on generating realistic face images. We demonstrate that the expression recognition performance can be enhanced by benefiting from our face synthesis model. Moreover, we also conduct experiments on a near-infrared dataset containing facial expression videos of drivers to assess the performance using in-the-wild data for driver emotion recognition.
△ Less
Submitted 17 May, 2019;
originally announced May 2019.
-
Learn to synthesize and synthesize to learn
Authors:
Behzad Bozorgtabar,
Mohammad Saeed Rad,
Hazım Kemal Ekenel,
Jean-Philippe Thiran
Abstract:
Attribute guided face image synthesis aims to manipulate attributes on a face image. Most existing methods for image-to-image translation can either perform a fixed translation between any two image domains using a single attribute or require training data with the attributes of interest for each subject. Therefore, these methods could only train one specific model for each pair of image domains,…
▽ More
Attribute guided face image synthesis aims to manipulate attributes on a face image. Most existing methods for image-to-image translation can either perform a fixed translation between any two image domains using a single attribute or require training data with the attributes of interest for each subject. Therefore, these methods could only train one specific model for each pair of image domains, which limits their ability in dealing with more than two domains. Another disadvantage of these methods is that they often suffer from the common problem of mode collapse that degrades the quality of the generated images. To overcome these shortcomings, we propose attribute guided face image generation method using a single model, which is capable to synthesize multiple photo-realistic face images conditioned on the attributes of interest. In addition, we adopt the proposed model to increase the realism of the simulated face images while preserving the face characteristics. Compared to existing models, synthetic face images generated by our method present a good photorealistic quality on several face datasets. Finally, we demonstrate that generated facial images can be used for synthetic data augmentation, and improve the performance of the classifier used for facial expression recognition.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
Beyond traditional coatings, a review on thermal sprayed functional and smart coatings
Authors:
Daniel Tejero-Martin,
Milad Rezvani Rad,
André McDonald,
Tanvir Hussain
Abstract:
Thermal spraying has been present for over a century, being greatly refined and optimised during this time, becoming nowadays a reliable and cost-efficient method to deposit thick coatings with a wide variety of feedstock materials and substrates. Thermal sprayed coatings have been successfully applied in fields such as aerospace or electricity production, becoming an essential component of today'…
▽ More
Thermal spraying has been present for over a century, being greatly refined and optimised during this time, becoming nowadays a reliable and cost-efficient method to deposit thick coatings with a wide variety of feedstock materials and substrates. Thermal sprayed coatings have been successfully applied in fields such as aerospace or electricity production, becoming an essential component of today's industry. To overpass the traditional capabilities of those coatings, new functionalities and coherent responses are being integrated, opening the field of functional and smart coatings. The aim of this paper is to present a comprehensive review of the current state of functional and smart coatings produced using thermal spraying deposition. It will first describe the different thermal spraying technologies, with a focus on how different techniques achieve the thermal and kinetic energy required to form a coating, as well as the environment to which feedstock particles are exposed in terms of temperature and velocity. It will then deal with the state-of-the-art functional and smart coatings applied using thermal spraying techniques, with a discussion on the fundamentals on which the coatings are designed, the efficiency of its performance and the industrial applications, both current and potential. The inherent designing flexibility of thermal sprayed functional and smart coatings has been exploited to explore exciting new possibilities on many different fields. Applications such as anti-bacterial and anti-fouling coatings, superhydrophobic surfaces, electrical and heating devices for functional coatings and self-healing, self-lubricating and sensors for smart coatings are here presented and discussed. All these exciting developments pave the way for the numerous applications that are to come in the next decade, making the field of thermal sprayed coatings a unique opportunity for research.
△ Less
Submitted 25 March, 2019; v1 submitted 13 November, 2018;
originally announced November 2018.
-
Domain Transfer for 3D Pose Estimation from Color Images without Manual Annotations
Authors:
Mahdi Rad,
Markus Oberweger,
Vincent Lepetit
Abstract:
We introduce a novel learning method for 3D pose estimation from color images. While acquiring annotations for color images is a difficult task, our approach circumvents this problem by learning a mapping from paired color and depth images captured with an RGB-D camera. We jointly learn the pose from synthetic depth images that are easy to generate, and learn to align these synthetic depth images…
▽ More
We introduce a novel learning method for 3D pose estimation from color images. While acquiring annotations for color images is a difficult task, our approach circumvents this problem by learning a mapping from paired color and depth images captured with an RGB-D camera. We jointly learn the pose from synthetic depth images that are easy to generate, and learn to align these synthetic depth images with the real depth images. We show our approach for the task of 3D hand pose estimation and 3D object pose estimation, both from color images only. Our method achieves performances comparable to state-of-the-art methods on popular benchmark datasets, without requiring any annotations for the color images.
△ Less
Submitted 21 February, 2019; v1 submitted 8 October, 2018;
originally announced October 2018.
-
Age Dependent Hawkes Process
Authors:
Mads Bonde Raad,
Susanne Ditlevsen,
Eva Löcherbach
Abstract:
In the last decade, Hawkes processes have received a lot of attention as good models for functional connectivity in neural spiking networks. In this paper we consider a variant of this process, the Age Dependent Hawkes process, which incorporates individual post-jump behaviour into the framework of the usual Hawkes model. This allows to model recovery properties such as refractory periods, where t…
▽ More
In the last decade, Hawkes processes have received a lot of attention as good models for functional connectivity in neural spiking networks. In this paper we consider a variant of this process, the Age Dependent Hawkes process, which incorporates individual post-jump behaviour into the framework of the usual Hawkes model. This allows to model recovery properties such as refractory periods, where the effects of the network are momentarily being suppressed or altered. We show how classical stability results for Hawkes processes can be improved by introducing age into the system. In particular, we neither need to a priori bound the intensities nor to impose any conditions on the Lipschitz constants. When the interactions between neurons are of mean field type, we study large network limits and establish the propagation of chaos property of the system.
△ Less
Submitted 7 October, 2019; v1 submitted 17 June, 2018;
originally announced June 2018.
-
Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation
Authors:
Markus Oberweger,
Mahdi Rad,
Vincent Lepetit
Abstract:
We introduce a novel method for robust and accurate 3D object pose estimation from a single color image under large occlusions. Following recent approaches, we first predict the 2D projections of 3D points related to the target object and then compute the 3D pose from these correspondences using a geometric method. Unfortunately, as the results of our experiments show, predicting these 2D projecti…
▽ More
We introduce a novel method for robust and accurate 3D object pose estimation from a single color image under large occlusions. Following recent approaches, we first predict the 2D projections of 3D points related to the target object and then compute the 3D pose from these correspondences using a geometric method. Unfortunately, as the results of our experiments show, predicting these 2D projections using a regular CNN or a Convolutional Pose Machine is highly sensitive to partial occlusions, even when these methods are trained with partially occluded examples. Our solution is to predict heatmaps from multiple small patches independently and to accumulate the results to obtain accurate and robust predictions. Training subsequently becomes challenging because patches with similar appearances but different positions on the object correspond to different heatmaps. However, we provide a simple yet effective solution to deal with such ambiguities. We show that our approach outperforms existing methods on two challenging datasets: The Occluded LineMOD dataset and the YCB-Video dataset, both exhibiting cluttered scenes with highly occluded objects. Project website: https://www.tugraz.at/institute/icg/research/team-lepetit/research-projects/robust-object-pose-estimation/
△ Less
Submitted 26 July, 2018; v1 submitted 11 April, 2018;
originally announced April 2018.
-
Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images
Authors:
Mahdi Rad,
Markus Oberweger,
Vincent Lepetit
Abstract:
We propose a simple and efficient method for exploiting synthetic images when training a Deep Network to predict a 3D pose from an image. The ability of using synthetic images for training a Deep Network is extremely valuable as it is easy to create a virtually infinite training set made of such images, while capturing and annotating real images can be very cumbersome. However, synthetic images do…
▽ More
We propose a simple and efficient method for exploiting synthetic images when training a Deep Network to predict a 3D pose from an image. The ability of using synthetic images for training a Deep Network is extremely valuable as it is easy to create a virtually infinite training set made of such images, while capturing and annotating real images can be very cumbersome. However, synthetic images do not resemble real images exactly, and using them for training can result in suboptimal performance. It was recently shown that for exemplar-based approaches, it is possible to learn a mapping from the exemplar representations of real images to the exemplar representations of synthetic images. In this paper, we show that this approach is more general, and that a network can also be applied after the mapping to infer a 3D pose: At run time, given a real image of the target object, we first compute the features for the image, map them to the feature space of synthetic images, and finally use the resulting features as input to another network which predicts the 3D pose. Since this network can be trained very effectively by using synthetic images, it performs very well in practice, and inference is faster and more accurate than with an exemplar-based approach. We demonstrate our approach on the LINEMOD dataset for 3D object pose estimation from color images, and the NYU dataset for 3D hand pose estimation from depth maps. We show that it allows us to outperform the state-of-the-art on both datasets.
△ Less
Submitted 26 March, 2018; v1 submitted 11 December, 2017;
originally announced December 2017.
-
On graphs of bounded semilattices
Authors:
Parastoo Malakooti Rad,
Peyman Nasehpour
Abstract:
In this paper, we introduce the graph $G(S)$ of a bounded semilattice $S$, which is a generalization of the intersection graph of the substructures of an algebraic structure. We prove some general theorems about these graphs; as an example, we show that if $S$ is a product of three or more chains, then $G(S)$ is Eulerian if and only if either the length of every chain is even or all the chains are…
▽ More
In this paper, we introduce the graph $G(S)$ of a bounded semilattice $S$, which is a generalization of the intersection graph of the substructures of an algebraic structure. We prove some general theorems about these graphs; as an example, we show that if $S$ is a product of three or more chains, then $G(S)$ is Eulerian if and only if either the length of every chain is even or all the chains are of length one. We also show that if $G(S)$ contains a cycle, then $girth(G(S)) = 3$. Finally, we show that if $(S,+,\cdot,0,1)$ is a dually atomic bounded distributive lattice whose set of dual atoms is nonempty, and the graph $G(S)$ of $S$ has no isolated vertex, then $G(S)$ is connected with $diam(G(S))\leq 4$.
△ Less
Submitted 5 November, 2018; v1 submitted 3 November, 2017;
originally announced November 2017.
-
A Computer Vision System to Localize and Classify Wastes on the Streets
Authors:
Mohammad Saeed Rad,
Andreas von Kaenel,
Andre Droux,
Francois Tieche,
Nabil Ouerhani,
Hazim Kemal Ekenel,
Jean-Philippe Thiran
Abstract:
Littering quantification is an important step for improving cleanliness of cities. When human interpretation is too cumbersome or in some cases impossible, an objective index of cleanliness could reduce the littering by awareness actions. In this paper, we present a fully automated computer vision application for littering quantification based on images taken from the streets and sidewalks. We hav…
▽ More
Littering quantification is an important step for improving cleanliness of cities. When human interpretation is too cumbersome or in some cases impossible, an objective index of cleanliness could reduce the littering by awareness actions. In this paper, we present a fully automated computer vision application for littering quantification based on images taken from the streets and sidewalks. We have employed a deep learning based framework to localize and classify different types of wastes. Since there was no waste dataset available, we built our acquisition system mounted on a vehicle. Collected images containing different types of wastes. These images are then annotated for training and benchmarking the developed system. Our results on real case scenarios show accurate detection of littering on variant backgrounds.
△ Less
Submitted 31 October, 2017;
originally announced October 2017.
-
Deep Learning for Automatic Stereotypical Motor Movement Detection using Wearable Sensors in Autism Spectrum Disorders
Authors:
Nastaran Mohammadian Rad,
Seyed Mostafa Kia,
Calogero Zarbo,
Twan van Laarhoven,
Giuseppe Jurman,
Paola Venuti,
Elena Marchiori,
Cesare Furlanello
Abstract:
Autism Spectrum Disorders are associated with atypical movements, of which stereotypical motor movements (SMMs) interfere with learning and social interaction. The automatic SMM detection using inertial measurement units (IMU) remains complex due to the strong intra and inter-subject variability, especially when handcrafted features are extracted from the signal. We propose a new application of th…
▽ More
Autism Spectrum Disorders are associated with atypical movements, of which stereotypical motor movements (SMMs) interfere with learning and social interaction. The automatic SMM detection using inertial measurement units (IMU) remains complex due to the strong intra and inter-subject variability, especially when handcrafted features are extracted from the signal. We propose a new application of the deep learning to facilitate automatic SMM detection using multi-axis IMUs. We use a convolutional neural network (CNN) to learn a discriminative feature space from raw data. We show how the CNN can be used for parameter transfer learning to enhance the detection rate on longitudinal data. We also combine the long short-term memory (LSTM) with CNN to model the temporal patterns in a sequence of multi-axis signals. Further, we employ ensemble learning to combine multiple LSTM learners into a more robust SMM detector. Our results show that: 1) feature learning outperforms handcrafted features; 2) parameter transfer learning is beneficial in longitudinal settings; 3) using LSTM to learn the temporal dynamic of signals enhances the detection rate especially for skewed training data; 4) an ensemble of LSTMs provides more accurate and stable detectors. These findings provide a significant step toward accurate SMM detection in real-time scenarios.
△ Less
Submitted 14 September, 2017;
originally announced September 2017.
-
ALCN: Meta-Learning for Contrast Normalization Applied to Robust 3D Pose Estimation
Authors:
Mahdi Rad,
Peter M. Roth,
Vincent Lepetit
Abstract:
To be robust to illumination changes when detecting objects in images, the current trend is to train a Deep Network with training images captured under many different lighting conditions. Unfortunately, creating such a training set is very cumbersome, or sometimes even impossible, for some applications such as 3D pose estimation of specific objects, which is the application we focus on in this pap…
▽ More
To be robust to illumination changes when detecting objects in images, the current trend is to train a Deep Network with training images captured under many different lighting conditions. Unfortunately, creating such a training set is very cumbersome, or sometimes even impossible, for some applications such as 3D pose estimation of specific objects, which is the application we focus on in this paper. We therefore propose a novel illumination normalization method that lets us learn to detect objects and estimate their 3D pose under challenging illumination conditions from very few training samples. Our key insight is that normalization parameters should adapt to the input image. In particular, we realized this via a Convolutional Neural Network trained to predict the parameters of a generalization of the Difference-of-Gaussians method. We show that our method significantly outperforms standard normalization methods and demonstrate it on two challenging 3D detection and pose estimation problems.
△ Less
Submitted 31 August, 2017;
originally announced August 2017.
-
A dataset for Computer-Aided Detection of Pulmonary Embolism in CTA images
Authors:
Mojtaba Masoudi,
Hamidreza Pourreza,
Mahdi Saadatmand Tarzjan,
Fateme Shafiee Zargar,
Masoud Pezeshki Rad,
Noushin Eftekhari
Abstract:
Todays, researchers in the field of Pulmonary Embolism (PE) analysis need to use a publicly available dataset to assess and compare their methods. Different systems have been designed for the detection of pulmonary embolism (PE), but none of them have used any public datasets. All papers have used their own private dataset. In order to fill this gap, we have collected 5160 slices of computed tomog…
▽ More
Todays, researchers in the field of Pulmonary Embolism (PE) analysis need to use a publicly available dataset to assess and compare their methods. Different systems have been designed for the detection of pulmonary embolism (PE), but none of them have used any public datasets. All papers have used their own private dataset. In order to fill this gap, we have collected 5160 slices of computed tomography angiography (CTA) images acquired from 20 patients, and after labeling the image by experts in this field, we provided a reliable dataset which is now publicly available. In some situation, PE detection can be difficult, for example when it occurs in the peripheral branches or when patients have pulmonary diseases (such as parenchymal disease). Therefore, the efficiency of CAD systems highly depends on the dataset. In the given dataset, 66% of PE are located in peripheral branches, and different pulmonary diseases are also included.
△ Less
Submitted 5 July, 2017;
originally announced July 2017.
-
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth
Authors:
Mahdi Rad,
Vincent Lepetit
Abstract:
We introduce a novel method for 3D object detection and pose estimation from color images only. We first use segmentation to detect the objects of interest in 2D even in presence of partial occlusions and cluttered background. By contrast with recent patch-based methods, we rely on a "holistic" approach: We apply to the detected objects a Convolutional Neural Network (CNN) trained to predict their…
▽ More
We introduce a novel method for 3D object detection and pose estimation from color images only. We first use segmentation to detect the objects of interest in 2D even in presence of partial occlusions and cluttered background. By contrast with recent patch-based methods, we rely on a "holistic" approach: We apply to the detected objects a Convolutional Neural Network (CNN) trained to predict their 3D poses in the form of 2D projections of the corners of their 3D bounding boxes. This, however, is not sufficient for handling objects from the recent T-LESS dataset: These objects exhibit an axis of rotational symmetry, and the similarity of two images of such an object under two different poses makes training the CNN challenging. We solve this problem by restricting the range of poses used for training, and by introducing a classifier to identify the range of a pose at run-time before estimating it. We also use an optional additional step that refines the predicted poses. We improve the state-of-the-art on the LINEMOD dataset from 73.7% to 89.3% of correctly registered RGB frames. We are also the first to report results on the Occlusion dataset using color images only. We obtain 54% of frames passing the Pose 6D criterion on average on several sequences of the T-LESS dataset, compared to the 67% of the state-of-the-art on the same sequences which uses both color and depth. The full approach is also scalable, as a single network can be trained for multiple objects simultaneously.
△ Less
Submitted 26 March, 2018; v1 submitted 31 March, 2017;
originally announced March 2017.