I am a Researcher in Video Processing in the Sigmedia group of the department of Electronic & Electrical Engineering at Trinity College Dublin.
My research is concerned with media signal processing applications related to film postproduction and film restoration. The aim of my research is to develop tools that can enhance the cinematic experience of immersion. Phone: +353 1896 3818 Address: AAP 2.18 5 College Green Trinity College Dublin
Interactive object cutout tools are the cornerstone of the image editing workflow. Recent deep-le... more Interactive object cutout tools are the cornerstone of the image editing workflow. Recent deep-learning based interactive segmentation algorithms have made significant progress in handling complex images and rough binary selections can typically be obtained with just a few clicks. Yet, deep learning techniques tend to plateau once this rough selection has been reached. In this work, we interpret this plateau as the inability of current algorithms to sufficiently leverage each user interaction and also as the limitations of current training/testing datasets. We propose a novel interactive architecture and a novel training scheme that are both tailored to better exploit the user workflow. We also show that significant improvements can be further gained by introducing a synthetic training dataset that is specifically designed for complex object boundaries. Comprehensive experiments support our approach, and our network achieves state of the art performance.
Cutting out an object and estimating its opacity mask, known as image matting, is a key task in m... more Cutting out an object and estimating its opacity mask, known as image matting, is a key task in many image editing applications. Deep learning approaches have made significant progress by adapting the encoder-decoder architecture of segmentation networks. However, most of the existing networks only predict the alpha matte and post-processing methods must then be used to recover the original foreground and background colours in the transparent regions. Recently, two methods have shown improved results by also estimating the foreground colours, but at a significant computational and memory cost. In this paper, we propose a low-cost modification to alpha matting networks to also predict the foreground and background colours. We study variations of the training regime and explore a wide range of existing and novel loss functions for the joint prediction. Our method achieves the state of the art performance on the Adobe Composition-1k dataset for alpha matte and composite colour quality....
The current state of the art alpha matting methods mainly rely on the trimap as the secondary and... more The current state of the art alpha matting methods mainly rely on the trimap as the secondary and only guidance to estimate alpha. This paper investigates the effects of utilising the background information as well as trimap in the process of alpha calculation. To achieve this goal, a state of the art method, AlphaGan is adopted and modified to process the background information as an extra input channel. Extensive experiments are performed to analyse the effect of the background information in image and video matting such as training with mildly and heavily distorted backgrounds. Based on the quantitative evaluations performed on Adobe Composition-1k dataset, the proposed pipeline significantly outperforms the state of the art methods using AlphaMatting benchmark metrics.
Online videos have witnessed an unprecedented growth over the last decade, owing to wide range of... more Online videos have witnessed an unprecedented growth over the last decade, owing to wide range of content creation. This provides the advertisement and marketing agencies plethora of opportunities for targeted advertisements. Such techniques involve replacing an existing advertisement in a video frame, with a new advertisement. However, such post-processing of online videos is mostly done manually by video editors. This is cumbersome and time-consuming. In this paper, we propose DeepAds -- a deep neural network, based on the simple encoder-decoder architecture, that can accurately localize the position of an advert in a video frame. Our approach of localizing billboards in outdoor scenes using neural nets, is the first of its kind, and achieves the best performance. We benchmark our proposed method with other semantic segmentation algorithms, on a public dataset of outdoor scenes with manually annotated billboard binary maps.
With the advent of faster internet services and growth of multimedia content, we observe a massiv... more With the advent of faster internet services and growth of multimedia content, we observe a massive growth in the number of online videos. The users generate these video contents at an unprecedented rate, owing to the use of smart-phones and other hand-held video capturing devices. This creates immense potential for the advertising and marketing agencies to create personalized content for the users. In this paper, we attempt to assist the video editors to generate augmented video content, by proposing candidate spaces in video frames. We propose and release a large-scale dataset of outdoor scenes, along with manually annotated maps for candidate spaces. We also benchmark several deep-learning based semantic segmentation algorithms on this proposed dataset.
2018 25th IEEE International Conference on Image Processing (ICIP), 2018
Many Defocus blur estimation methods have been proposed in recent years but, when applied to vide... more Many Defocus blur estimation methods have been proposed in recent years but, when applied to video sequences in a frame-by-frame manner, they typically exhibit temporal inconsistencies or flickering. This paper presents a temporal coherence scheme that can be coupled to any existing defocus blur estimation for still images, aiming to produce spatiotemporally coherent defocus blur map videos. The proposed method is based on the design of a Kalman Filter which is applied on a patch level. Experimental results show that the proposed method can smooth out undesirable temporal fluctuations whilst still being able to preserve the abrupt local appearance changes due to motion, occlusions or dis-occlusions.
This paper provides an overview of our participation in the TRECVID 2018 Storytelling Linking tas... more This paper provides an overview of our participation in the TRECVID 2018 Storytelling Linking task. Our approach uses a RNN based neural network to learn a semantic representation of text (news topics), images and videos (collected from Twitter and Flickr posts) in the same latent space. We applied a two-stage (pre-train + fine-tuning) learning architecture to train and adjust the model (using Flickr30k and labels from online search as additional data). During the search phase of the task we take a different strategy to generate five different runs by leveraging video-length normalization and controlling the training source.
A combination of computer vision and projector-based illumination opens the possibility for a new... more A combination of computer vision and projector-based illumination opens the possibility for a new type of computer vision technologies. One of them is augmented reality: selectively illuminating the scene to improve or manipulate how the reality itself, rather than its display, appears to a human. One such example is the Smart Headlight being developed at Carnegie Mellon University's Robotics Institute. The project team has been working on a set of new capabilities for the headlight, such as making rain drops and snowflakes disappear, allowing for the high beams to always be on without glare, and enhancing the appearance of objects of interest. Using the Smart Headlight as an example, this talk will further discuss various ideas, concepts and possible applications of coaxial and non-coaxial projector-camera systems. About the speaker: Professor Takeo Kanade is the U. A. and Helen Whitaker University Professor of Computer Science and Robotics at Carnegie Mellon University. He rec...
Online video advertising gives content providers the ability to deliver compelling content, reach... more Online video advertising gives content providers the ability to deliver compelling content, reach a growing audience, and generate additional revenue from online media. Recently, advertising strategies are designed to look for original advert(s) in a video frame, and replacing them with new adverts. These strategies, popularly known as product placement or embedded marketing, greatly help the marketing agencies to reach out to a wider audience. However, in the existing literature, such detection of candidate frames in a video sequence for the purpose of advert integration, is done manually. In this paper, we propose a deep-learning architecture called ADNet, that automatically detects the presence of advertisements in video frames. Our approach is the first of its kind that automatically detects the presence of adverts in a video frame, and achieves state-of-the-art results on a public dataset.
This research project aims to provide an AI enhanced productivity tool for media generation to vi... more This research project aims to provide an AI enhanced productivity tool for media generation to video editors and producers. The goal is to improve productivity among producers and artists in terms of augmenting video content with new objects or effects, in a natural and appealing way. Furthermore, the project aims to bridge the gap between offline augmented reality technologies, occlusion handling, and camera tracking.
Proceedings of the 2021 International Conference on Multimodal Interaction, 2021
Live video comments, or “danmu”, are an emerging social feature on Asian online video platforms. ... more Live video comments, or “danmu”, are an emerging social feature on Asian online video platforms. These time-synchronous comments are overlaid on the video playback and uniquely enrich the viewing experience, engaging hundreds of millions of users in rich community discussions. The presence of danmu comments has become a determining factor for video popularity. Recent work has proposed a model to automatically generate comments, but very little work has so far considered the problem of where to insert the comments in the video timeline. In this work, we propose to address both the what and where of automatic danmu generation, by jointly predicting the danmu comment content to be generated, as well as its optimal insertion point in the video timeline. Our model exploits the video visual content, subtitles, audio signals, and any existing surrounding comments, in one unified architecture and can handle scenarios where the videos are already heavily commented or when the video has no comments yet. Experiments show that our proposed unified framework is in general observed to outperform state-of-the-art comment generation methods.
Interactive object cutout tools are the cornerstone of the image editing workflow. Recent deep-le... more Interactive object cutout tools are the cornerstone of the image editing workflow. Recent deep-learning based interactive segmentation algorithms have made significant progress in handling complex images and rough binary selections can typically be obtained with just a few clicks. Yet, deep learning techniques tend to plateau once this rough selection has been reached. In this work, we interpret this plateau as the inability of current algorithms to sufficiently leverage each user interaction and also as the limitations of current training/testing datasets. We propose a novel interactive architecture and a novel training scheme that are both tailored to better exploit the user workflow. We also show that significant improvements can be further gained by introducing a synthetic training dataset that is specifically designed for complex object boundaries. Comprehensive experiments support our approach, and our network achieves state of the art performance.
Cutting out an object and estimating its opacity mask, known as image matting, is a key task in m... more Cutting out an object and estimating its opacity mask, known as image matting, is a key task in many image editing applications. Deep learning approaches have made significant progress by adapting the encoder-decoder architecture of segmentation networks. However, most of the existing networks only predict the alpha matte and post-processing methods must then be used to recover the original foreground and background colours in the transparent regions. Recently, two methods have shown improved results by also estimating the foreground colours, but at a significant computational and memory cost. In this paper, we propose a low-cost modification to alpha matting networks to also predict the foreground and background colours. We study variations of the training regime and explore a wide range of existing and novel loss functions for the joint prediction. Our method achieves the state of the art performance on the Adobe Composition-1k dataset for alpha matte and composite colour quality....
The current state of the art alpha matting methods mainly rely on the trimap as the secondary and... more The current state of the art alpha matting methods mainly rely on the trimap as the secondary and only guidance to estimate alpha. This paper investigates the effects of utilising the background information as well as trimap in the process of alpha calculation. To achieve this goal, a state of the art method, AlphaGan is adopted and modified to process the background information as an extra input channel. Extensive experiments are performed to analyse the effect of the background information in image and video matting such as training with mildly and heavily distorted backgrounds. Based on the quantitative evaluations performed on Adobe Composition-1k dataset, the proposed pipeline significantly outperforms the state of the art methods using AlphaMatting benchmark metrics.
Online videos have witnessed an unprecedented growth over the last decade, owing to wide range of... more Online videos have witnessed an unprecedented growth over the last decade, owing to wide range of content creation. This provides the advertisement and marketing agencies plethora of opportunities for targeted advertisements. Such techniques involve replacing an existing advertisement in a video frame, with a new advertisement. However, such post-processing of online videos is mostly done manually by video editors. This is cumbersome and time-consuming. In this paper, we propose DeepAds -- a deep neural network, based on the simple encoder-decoder architecture, that can accurately localize the position of an advert in a video frame. Our approach of localizing billboards in outdoor scenes using neural nets, is the first of its kind, and achieves the best performance. We benchmark our proposed method with other semantic segmentation algorithms, on a public dataset of outdoor scenes with manually annotated billboard binary maps.
With the advent of faster internet services and growth of multimedia content, we observe a massiv... more With the advent of faster internet services and growth of multimedia content, we observe a massive growth in the number of online videos. The users generate these video contents at an unprecedented rate, owing to the use of smart-phones and other hand-held video capturing devices. This creates immense potential for the advertising and marketing agencies to create personalized content for the users. In this paper, we attempt to assist the video editors to generate augmented video content, by proposing candidate spaces in video frames. We propose and release a large-scale dataset of outdoor scenes, along with manually annotated maps for candidate spaces. We also benchmark several deep-learning based semantic segmentation algorithms on this proposed dataset.
2018 25th IEEE International Conference on Image Processing (ICIP), 2018
Many Defocus blur estimation methods have been proposed in recent years but, when applied to vide... more Many Defocus blur estimation methods have been proposed in recent years but, when applied to video sequences in a frame-by-frame manner, they typically exhibit temporal inconsistencies or flickering. This paper presents a temporal coherence scheme that can be coupled to any existing defocus blur estimation for still images, aiming to produce spatiotemporally coherent defocus blur map videos. The proposed method is based on the design of a Kalman Filter which is applied on a patch level. Experimental results show that the proposed method can smooth out undesirable temporal fluctuations whilst still being able to preserve the abrupt local appearance changes due to motion, occlusions or dis-occlusions.
This paper provides an overview of our participation in the TRECVID 2018 Storytelling Linking tas... more This paper provides an overview of our participation in the TRECVID 2018 Storytelling Linking task. Our approach uses a RNN based neural network to learn a semantic representation of text (news topics), images and videos (collected from Twitter and Flickr posts) in the same latent space. We applied a two-stage (pre-train + fine-tuning) learning architecture to train and adjust the model (using Flickr30k and labels from online search as additional data). During the search phase of the task we take a different strategy to generate five different runs by leveraging video-length normalization and controlling the training source.
A combination of computer vision and projector-based illumination opens the possibility for a new... more A combination of computer vision and projector-based illumination opens the possibility for a new type of computer vision technologies. One of them is augmented reality: selectively illuminating the scene to improve or manipulate how the reality itself, rather than its display, appears to a human. One such example is the Smart Headlight being developed at Carnegie Mellon University's Robotics Institute. The project team has been working on a set of new capabilities for the headlight, such as making rain drops and snowflakes disappear, allowing for the high beams to always be on without glare, and enhancing the appearance of objects of interest. Using the Smart Headlight as an example, this talk will further discuss various ideas, concepts and possible applications of coaxial and non-coaxial projector-camera systems. About the speaker: Professor Takeo Kanade is the U. A. and Helen Whitaker University Professor of Computer Science and Robotics at Carnegie Mellon University. He rec...
Online video advertising gives content providers the ability to deliver compelling content, reach... more Online video advertising gives content providers the ability to deliver compelling content, reach a growing audience, and generate additional revenue from online media. Recently, advertising strategies are designed to look for original advert(s) in a video frame, and replacing them with new adverts. These strategies, popularly known as product placement or embedded marketing, greatly help the marketing agencies to reach out to a wider audience. However, in the existing literature, such detection of candidate frames in a video sequence for the purpose of advert integration, is done manually. In this paper, we propose a deep-learning architecture called ADNet, that automatically detects the presence of advertisements in video frames. Our approach is the first of its kind that automatically detects the presence of adverts in a video frame, and achieves state-of-the-art results on a public dataset.
This research project aims to provide an AI enhanced productivity tool for media generation to vi... more This research project aims to provide an AI enhanced productivity tool for media generation to video editors and producers. The goal is to improve productivity among producers and artists in terms of augmenting video content with new objects or effects, in a natural and appealing way. Furthermore, the project aims to bridge the gap between offline augmented reality technologies, occlusion handling, and camera tracking.
Proceedings of the 2021 International Conference on Multimodal Interaction, 2021
Live video comments, or “danmu”, are an emerging social feature on Asian online video platforms. ... more Live video comments, or “danmu”, are an emerging social feature on Asian online video platforms. These time-synchronous comments are overlaid on the video playback and uniquely enrich the viewing experience, engaging hundreds of millions of users in rich community discussions. The presence of danmu comments has become a determining factor for video popularity. Recent work has proposed a model to automatically generate comments, but very little work has so far considered the problem of where to insert the comments in the video timeline. In this work, we propose to address both the what and where of automatic danmu generation, by jointly predicting the danmu comment content to be generated, as well as its optimal insertion point in the video timeline. Our model exploits the video visual content, subtitles, audio signals, and any existing surrounding comments, in one unified architecture and can handle scenarios where the videos are already heavily commented or when the video has no comments yet. Experiments show that our proposed unified framework is in general observed to outperform state-of-the-art comment generation methods.
Uploads
Papers by François Pitié