2019 IEEE International Workshop on Information Forensics and Security (WIFS), 2019
The vulnerability of deep neural networks to adversarial attacks currently represents one of the ... more The vulnerability of deep neural networks to adversarial attacks currently represents one of the most challenging open problems in the deep learning field. The NeurIPS 2018 work that obtained the best paper award proposed a new paradigm for defining deep neural networks with continuous internal activations. In this kind of networks, dubbed Neural ODE Networks, a continuous hidden state can be defined via parametric ordinary differential equations, and its dynamics can be adjusted to build representations for a given task, such as image classification. In this paper, we analyze the robustness of image classifiers implemented as ODE Nets to adversarial attacks and compare it to standard deep models. We show that Neural ODE are natively more robust to adversarial attacks with respect to state-of-the-art residual networks, and some of their intrinsic properties, such as adaptive computation cost, open new directions to further increase the robustness of deep-learned models. Moreover, thanks to the continuity of the hidden state, we are able to follow the perturbation injected by manipulated inputs and pinpoint the part of the internal dynamics that is most responsible for the misclassification.
Facial expressions play a fundamental role in human communication, and their study, which represe... more Facial expressions play a fundamental role in human communication, and their study, which represents a multidisciplinary subject, embraces a great variety of research fields, e.g., from psychology to computer science, among others. Concerning Deep Learning, the recognition of facial expressions is a task named Facial Expression Recognition (FER). With such an objective, the goal of a learning model is to classify human emotions starting from a facial image of a given subject. Typically, face images are acquired by cameras that have, by nature, different characteristics, such as the output resolution. Moreover, other circumstances might involve cameras placed far from the observed scene, thus obtaining faces with very low resolutions. Therefore, since the FER task might involve analyzing face images that can be acquired with heterogeneous sources, it is plausible to expect that resolution plays a vital role. In such a context, we propose a multi-resolution training approach to solve ...
Proceedings of the 1st International Workshop on Multimedia AI against Disinformation
Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realist... more Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realistic manipulated images and videos and endangering the serenity of modern society. The continual emergence of new and varied techniques brings with it a further problem to be faced, namely the ability of deepfake detection models to update themselves promptly in order to be able to identify manipulations carried out using even the most recent methods. This is an extremely complex problem to solve, as training a model requires large amounts of data, which are difficult to obtain if the deepfake generation method is too recent. Moreover, continuously retraining a network would be unfeasible. In this paper, we ask ourselves if, among the various deep learning techniques, there is one that is able to generalise the concept of deepfake to such an extent that it does not remain tied to one or more specific deepfake generation methods used in the training set. We compared a Vision Transformer with an EfficientNetV2 on a cross-forgery context based on the ForgeryNet dataset. From our experiments, It emerges that EfficientNetV2 has a greater tendency to specialize often obtaining better results on training methods while Vision Transformers exhibit a superior generalization ability that makes them more competent even on images generated with new methodologies. CCS CONCEPTS • Applied computing → Computer forensics; • Computing methodologies → Computer vision.
In recent years, Quantum Computing witnessed massive improvements in terms of available resources... more In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community’s interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.
Space exploration has always been a source of inspiration for humankind, and thanks to modern tel... more Space exploration has always been a source of inspiration for humankind, and thanks to modern telescopes, it is now possible to observe celestial bodies far away from us. With a growing number of real and imaginary images of space available on the web and exploiting modern Deep Learning architectures such as Generative Adversarial Networks, it is now possible to generate new representations of space. In this research, using a Lightweight GAN, a dataset of images obtained from the web, and the Galaxy Zoo Dataset, we have generated thousands of new images of celestial bodies, galaxies, and finally, by combining them, a wide view of the universe. The code for reproducing our results is publicly available at https://github.com/davide-coccomini/GAN-Universe, and the generated images can be explored at https://davide-coccomini.github.io/GANUniverse/.
A new approach for video-stream filtering that makes use of the features representing video conte... more A new approach for video-stream filtering that makes use of the features representing video content and exploits the properties of metric spaces can help reduce the filtering receiver’s computational load.
Neural networks are said to be biologically inspired since they mimic the behavior of real neuron... more Neural networks are said to be biologically inspired since they mimic the behavior of real neurons. However, several processes in state-of-the-art neural networks, including Deep Convolutional Neural Networks (DCNN), are far from the ones found in animal brains. One relevant difference is the training process. In state-of-the-art artificial neural networks, the training process is based on backpropagation and Stochastic Gradient Descent (SGD) optimization. However, studies in neuroscience strongly suggest that this kind of processes does not occur in the biological brain. Rather, learning methods based on Spike-Timing-Dependent Plasticity (STDP) or the Hebbian learning rule seem to be more plausible, according to neuroscientists. In this paper, we investigate the use of the Hebbian learning rule when training Deep Neural Networks for image classification by proposing a novel weight update rule for shared kernels in DCNNs. We perform experiments using the CIFAR-10 dataset in which we employ Hebbian learning, along with SGD, to train parts of the model or whole networks for the task of image classification, and we discuss their performance thoroughly considering both effectiveness and efficiency aspects.
Face verification is a key task in many application fields, such as security and surveillance. Se... more Face verification is a key task in many application fields, such as security and surveillance. Several approaches and methodologies are currently used to try to determine if two faces belong to the same person. Among these, facial landmarks are very important in forensics, since the distance between some characteristic points of a face can be used as an objective measure in court during trials. However, the accuracy of the approaches based on facial landmarks in verifying whether a face belongs to a given person or not is often not quite good. Recently, deep learning approaches have been proposed to address the face verification problem, with very good results. In this paper, we compare the accuracy of facial landmarks and deep learning approaches in performing the face verification task. Our experiments, conducted on a real case scenario, show that the deep learning approach greatly outperforms in accuracy the facial landmarks approach. Keywords–Face Verification; Facial Landmarks;...
Deep-learning approaches in data-driven modeling relies on learning a finite number of transforma... more Deep-learning approaches in data-driven modeling relies on learning a finite number of transformations (and representations) of the data that are structured in a hierarchy and are often instantiated as deep neural networks (and their internal activations). State-of-the-art models for visual data usually implement deep residual learning: the network learns to predict a finite number of discrete updates that are applied to the internal network state to enrich it. Pushing the residual learning idea to the limit, ODE Net—a novel network formulation involving continuously evolving internal representations that gained the best paper award at NeurIPS 2018—has been recently proposed. Differently from traditional neural networks, in this model the dynamics of the internal states are defined by an ordinary differential equation with learnable parameters that defines a continuous transformation of the input representation. These representations can be computed using standard ODE solvers, and t...
Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision i... more Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision in particular. Deep Mind has recently proposed a module called Relation Network (RN) that has shown impressive results on visual question answering tasks. Unfortunately, the implementation of the proposed approach was not public. To reproduce their experiments and extend their approach in the context of Information Retrieval, we had to re-implement everything, testing many parameters and conducting many experiments. Our implementation is now public on GitHub and it is already used by a large community of researchers. Furthermore, we recently presented a variant of the relation network module that we called Aggregated Visual Features RN (AVF-RN). This network can produce and aggregate at inference time compact visual relationship-aware features for the Relational-CBIR (R-CBIR) task. R-CBIR consists in retrieving images with given relationships among objects. In this paper, we discuss the d...
Soccer analytics is attracting increasing interest in academia and industry, thanks to the availa... more Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of data that describe all the spatio-temporal events that occur in each match. These events (e.g., passes, shots, fouls) are collected by human operators manually, constituting a considerable cost for data providers in terms of time and economic resources. In this paper, we describe PassNet, a method to recognize the most frequent events in soccer, i.e., passes, from video streams. Our model combines a set of artificial neural networks that perform feature extraction from video streams, object detection to identify the positions of the ball and the players, and classification of frame sequences as passes or not passes. We test PassNet on different scenarios, depending on the similarity of conditions to the match used for training. Our results show good classification results and significant improvement in the accuracy of pass detection with respect to baseline classifiers, even wh...
Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. N... more Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. Nowadays, the rapid increase of video data has paved the way to the advancement of the technologies in many different communities for the creation of Content-Based Video Indexing and Retrieval (CBVIR). However, greater attention needs to be devoted to the development of effective tools for video search and browse. In this paper, we present Visione, a system for large-scale video retrieval. The system integrates several content-based analysis and retrieval modules, including a keywords search, a spatial object-based search, and a visual similarity search. From the tests carried out by users when they needed to find as many correct examples as possible, the similarity search proved to be the most promising option. Our implementation is based on state-of-the-art deep learning approaches for content analysis and leverages highly efficient indexing techniques to ensure scalability. Specificall...
New Trends in Image Analysis and Processing – ICIAP 2019, 2019
Convolutional neural networks have reached extremely high performances on the Face Recognition ta... more Convolutional neural networks have reached extremely high performances on the Face Recognition task. These models are commonly trained by using high-resolution images and for this reason, their discrimination ability is usually degraded when they are tested against lowresolution images. Thus, Low-Resolution Face Recognition remains an open challenge for deep learning models. Such a scenario is of particular interest for surveillance systems in which it usually happens that a low-resolution probe has to be matched with higher resolution galleries. This task can be especially hard to accomplish since the probe can have resolutions as low as 8, 16 and 24 pixels per side while the typical input of state-of-the-art neural network is 224. In this paper, we described the training campaign we used to fine-tune a ResNet-50 architecture, with Squeeze-and-Excitation blocks, on the tasks of very low and mixed resolutions face recognition. For the training process we used the VGGFace2 dataset and then we tested the performance of the final model on the IJB-B dataset; in particular, we tested the neural network on the 1:1 verification task. In our experiments we considered two different scenarios: 1) probe and gallery with same resolution; 2) probe and gallery with mixed resolutions. Experimental results show that with our approach it is possible to improve upon state-of-the-art models performance on the low and mixed resolution face recognition tasks with a negligible loss at very high resolutions.
Many approaches for approximate metric search rely on a permutation-based representation of the o... more Many approaches for approximate metric search rely on a permutation-based representation of the original data objects. The main advantage of transforming metric objects into permutations is that the latter can be efficiently indexed and searched using data structures such as inverted-files and prefix trees. Typically, the permutation is obtained by ordering the identifiers of a set of pivots according to their distances to the object to be represented. In this paper, we present a novel approach to transform metric objects into permutations. It uses the object-pivot distances in combination with a metric transformation, called n-Simplex projection. The resulting permutation-based representation, named SPLX-Perm, is suitable only for the large class of metric space satisfying the n-point property. We tested the proposed approach on two benchmarks for similarity search. Our preliminary results are encouraging and open new perspectives for further investigations on the use of the n-Simplex projection for supporting permutation-based indexing.
Transformations of data objects into the Hamming space are often exploited to speed-up the simila... more Transformations of data objects into the Hamming space are often exploited to speed-up the similarity search in metric spaces. Techniques applicable in generic metric spaces require expensive learning, e.g., selection of pivoting objects. However, when searching in common Euclidean space, the best performance is usually achieved by transformations specifically designed for this space. We propose a novel transformation technique that provides a good trade-off between the applicability and the quality of the space approximation. It uses the n-Simplex projection to transform metric objects into a low-dimensional Euclidean space, and then transform this space to the Hamming space. We compare our approach theoretically and experimentally with several techniques of the metric embedding into the Hamming space. We focus on the applicability, learning cost, and the quality of search space approximation.
In this paper we tackle the problem of image search when the query is a short textual description... more In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature mapping, while a text-focused loss is aimed at modeling the higherlevel semantic concepts expressed in language and countering the overfit on non-relevant visual components of the visual loss. We report preliminary results on the MS-COCO dataset.
Content-based image retrieval using Deep Learning has become very popular during the last few yea... more Content-based image retrieval using Deep Learning has become very popular during the last few years. In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases. The idea is to provide a text encoding for these features enabling the use of a text retrieval engine to perform image similarity search. In this way, we built LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities. In order to optimize the index occupation and the query response time, we evaluated various tuning parameters to generate the text encoding. To this end, we have developed a web-based prototype to efficiently search through a dataset of 100 million of images.
In this paper, we consider the task of recognizing epigraphs in images such as photos taken using... more In this paper, we consider the task of recognizing epigraphs in images such as photos taken using mobile devices. Given a set of 17,155 photos related to 14,560 epigraphs, we used a k-NearestNeighbor approach in order to perform the recognition. The contribution of this work is in evaluating state-ofthe-art visual object recognition techniques in this specific context. The experimental results conducted show that Vector of Locally Aggregated Descriptors obtained aggregating SIFT descriptors is the best choice for this task.
Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR'11, 2011
We present the VIsual Support to Interactive TOurism in Tuscany (VISITO Tuscany) project which of... more We present the VIsual Support to Interactive TOurism in Tuscany (VISITO Tuscany) project which offers an interactive guide for tourists visiting cities of art accessible via smartphones. The peculiarity of the system is that user interaction is mainly obtained by the use of images -- In order to receive information on a particular monument users just have to take a picture of it. VISITO Tuscany, using techniques of image analysis and content recognition, automatically recognize the photographed monuments and pertinent information is displayed to the user. In this paper we illustrate how the use of landmarks recognition from mobile devices can provide the tourist with relevant and customized information about various type of objects in cities of art.
2019 IEEE International Workshop on Information Forensics and Security (WIFS), 2019
The vulnerability of deep neural networks to adversarial attacks currently represents one of the ... more The vulnerability of deep neural networks to adversarial attacks currently represents one of the most challenging open problems in the deep learning field. The NeurIPS 2018 work that obtained the best paper award proposed a new paradigm for defining deep neural networks with continuous internal activations. In this kind of networks, dubbed Neural ODE Networks, a continuous hidden state can be defined via parametric ordinary differential equations, and its dynamics can be adjusted to build representations for a given task, such as image classification. In this paper, we analyze the robustness of image classifiers implemented as ODE Nets to adversarial attacks and compare it to standard deep models. We show that Neural ODE are natively more robust to adversarial attacks with respect to state-of-the-art residual networks, and some of their intrinsic properties, such as adaptive computation cost, open new directions to further increase the robustness of deep-learned models. Moreover, thanks to the continuity of the hidden state, we are able to follow the perturbation injected by manipulated inputs and pinpoint the part of the internal dynamics that is most responsible for the misclassification.
Facial expressions play a fundamental role in human communication, and their study, which represe... more Facial expressions play a fundamental role in human communication, and their study, which represents a multidisciplinary subject, embraces a great variety of research fields, e.g., from psychology to computer science, among others. Concerning Deep Learning, the recognition of facial expressions is a task named Facial Expression Recognition (FER). With such an objective, the goal of a learning model is to classify human emotions starting from a facial image of a given subject. Typically, face images are acquired by cameras that have, by nature, different characteristics, such as the output resolution. Moreover, other circumstances might involve cameras placed far from the observed scene, thus obtaining faces with very low resolutions. Therefore, since the FER task might involve analyzing face images that can be acquired with heterogeneous sources, it is plausible to expect that resolution plays a vital role. In such a context, we propose a multi-resolution training approach to solve ...
Proceedings of the 1st International Workshop on Multimedia AI against Disinformation
Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realist... more Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realistic manipulated images and videos and endangering the serenity of modern society. The continual emergence of new and varied techniques brings with it a further problem to be faced, namely the ability of deepfake detection models to update themselves promptly in order to be able to identify manipulations carried out using even the most recent methods. This is an extremely complex problem to solve, as training a model requires large amounts of data, which are difficult to obtain if the deepfake generation method is too recent. Moreover, continuously retraining a network would be unfeasible. In this paper, we ask ourselves if, among the various deep learning techniques, there is one that is able to generalise the concept of deepfake to such an extent that it does not remain tied to one or more specific deepfake generation methods used in the training set. We compared a Vision Transformer with an EfficientNetV2 on a cross-forgery context based on the ForgeryNet dataset. From our experiments, It emerges that EfficientNetV2 has a greater tendency to specialize often obtaining better results on training methods while Vision Transformers exhibit a superior generalization ability that makes them more competent even on images generated with new methodologies. CCS CONCEPTS • Applied computing → Computer forensics; • Computing methodologies → Computer vision.
In recent years, Quantum Computing witnessed massive improvements in terms of available resources... more In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community’s interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.
Space exploration has always been a source of inspiration for humankind, and thanks to modern tel... more Space exploration has always been a source of inspiration for humankind, and thanks to modern telescopes, it is now possible to observe celestial bodies far away from us. With a growing number of real and imaginary images of space available on the web and exploiting modern Deep Learning architectures such as Generative Adversarial Networks, it is now possible to generate new representations of space. In this research, using a Lightweight GAN, a dataset of images obtained from the web, and the Galaxy Zoo Dataset, we have generated thousands of new images of celestial bodies, galaxies, and finally, by combining them, a wide view of the universe. The code for reproducing our results is publicly available at https://github.com/davide-coccomini/GAN-Universe, and the generated images can be explored at https://davide-coccomini.github.io/GANUniverse/.
A new approach for video-stream filtering that makes use of the features representing video conte... more A new approach for video-stream filtering that makes use of the features representing video content and exploits the properties of metric spaces can help reduce the filtering receiver’s computational load.
Neural networks are said to be biologically inspired since they mimic the behavior of real neuron... more Neural networks are said to be biologically inspired since they mimic the behavior of real neurons. However, several processes in state-of-the-art neural networks, including Deep Convolutional Neural Networks (DCNN), are far from the ones found in animal brains. One relevant difference is the training process. In state-of-the-art artificial neural networks, the training process is based on backpropagation and Stochastic Gradient Descent (SGD) optimization. However, studies in neuroscience strongly suggest that this kind of processes does not occur in the biological brain. Rather, learning methods based on Spike-Timing-Dependent Plasticity (STDP) or the Hebbian learning rule seem to be more plausible, according to neuroscientists. In this paper, we investigate the use of the Hebbian learning rule when training Deep Neural Networks for image classification by proposing a novel weight update rule for shared kernels in DCNNs. We perform experiments using the CIFAR-10 dataset in which we employ Hebbian learning, along with SGD, to train parts of the model or whole networks for the task of image classification, and we discuss their performance thoroughly considering both effectiveness and efficiency aspects.
Face verification is a key task in many application fields, such as security and surveillance. Se... more Face verification is a key task in many application fields, such as security and surveillance. Several approaches and methodologies are currently used to try to determine if two faces belong to the same person. Among these, facial landmarks are very important in forensics, since the distance between some characteristic points of a face can be used as an objective measure in court during trials. However, the accuracy of the approaches based on facial landmarks in verifying whether a face belongs to a given person or not is often not quite good. Recently, deep learning approaches have been proposed to address the face verification problem, with very good results. In this paper, we compare the accuracy of facial landmarks and deep learning approaches in performing the face verification task. Our experiments, conducted on a real case scenario, show that the deep learning approach greatly outperforms in accuracy the facial landmarks approach. Keywords–Face Verification; Facial Landmarks;...
Deep-learning approaches in data-driven modeling relies on learning a finite number of transforma... more Deep-learning approaches in data-driven modeling relies on learning a finite number of transformations (and representations) of the data that are structured in a hierarchy and are often instantiated as deep neural networks (and their internal activations). State-of-the-art models for visual data usually implement deep residual learning: the network learns to predict a finite number of discrete updates that are applied to the internal network state to enrich it. Pushing the residual learning idea to the limit, ODE Net—a novel network formulation involving continuously evolving internal representations that gained the best paper award at NeurIPS 2018—has been recently proposed. Differently from traditional neural networks, in this model the dynamics of the internal states are defined by an ordinary differential equation with learnable parameters that defines a continuous transformation of the input representation. These representations can be computed using standard ODE solvers, and t...
Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision i... more Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision in particular. Deep Mind has recently proposed a module called Relation Network (RN) that has shown impressive results on visual question answering tasks. Unfortunately, the implementation of the proposed approach was not public. To reproduce their experiments and extend their approach in the context of Information Retrieval, we had to re-implement everything, testing many parameters and conducting many experiments. Our implementation is now public on GitHub and it is already used by a large community of researchers. Furthermore, we recently presented a variant of the relation network module that we called Aggregated Visual Features RN (AVF-RN). This network can produce and aggregate at inference time compact visual relationship-aware features for the Relational-CBIR (R-CBIR) task. R-CBIR consists in retrieving images with given relationships among objects. In this paper, we discuss the d...
Soccer analytics is attracting increasing interest in academia and industry, thanks to the availa... more Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of data that describe all the spatio-temporal events that occur in each match. These events (e.g., passes, shots, fouls) are collected by human operators manually, constituting a considerable cost for data providers in terms of time and economic resources. In this paper, we describe PassNet, a method to recognize the most frequent events in soccer, i.e., passes, from video streams. Our model combines a set of artificial neural networks that perform feature extraction from video streams, object detection to identify the positions of the ball and the players, and classification of frame sequences as passes or not passes. We test PassNet on different scenarios, depending on the similarity of conditions to the match used for training. Our results show good classification results and significant improvement in the accuracy of pass detection with respect to baseline classifiers, even wh...
Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. N... more Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. Nowadays, the rapid increase of video data has paved the way to the advancement of the technologies in many different communities for the creation of Content-Based Video Indexing and Retrieval (CBVIR). However, greater attention needs to be devoted to the development of effective tools for video search and browse. In this paper, we present Visione, a system for large-scale video retrieval. The system integrates several content-based analysis and retrieval modules, including a keywords search, a spatial object-based search, and a visual similarity search. From the tests carried out by users when they needed to find as many correct examples as possible, the similarity search proved to be the most promising option. Our implementation is based on state-of-the-art deep learning approaches for content analysis and leverages highly efficient indexing techniques to ensure scalability. Specificall...
New Trends in Image Analysis and Processing – ICIAP 2019, 2019
Convolutional neural networks have reached extremely high performances on the Face Recognition ta... more Convolutional neural networks have reached extremely high performances on the Face Recognition task. These models are commonly trained by using high-resolution images and for this reason, their discrimination ability is usually degraded when they are tested against lowresolution images. Thus, Low-Resolution Face Recognition remains an open challenge for deep learning models. Such a scenario is of particular interest for surveillance systems in which it usually happens that a low-resolution probe has to be matched with higher resolution galleries. This task can be especially hard to accomplish since the probe can have resolutions as low as 8, 16 and 24 pixels per side while the typical input of state-of-the-art neural network is 224. In this paper, we described the training campaign we used to fine-tune a ResNet-50 architecture, with Squeeze-and-Excitation blocks, on the tasks of very low and mixed resolutions face recognition. For the training process we used the VGGFace2 dataset and then we tested the performance of the final model on the IJB-B dataset; in particular, we tested the neural network on the 1:1 verification task. In our experiments we considered two different scenarios: 1) probe and gallery with same resolution; 2) probe and gallery with mixed resolutions. Experimental results show that with our approach it is possible to improve upon state-of-the-art models performance on the low and mixed resolution face recognition tasks with a negligible loss at very high resolutions.
Many approaches for approximate metric search rely on a permutation-based representation of the o... more Many approaches for approximate metric search rely on a permutation-based representation of the original data objects. The main advantage of transforming metric objects into permutations is that the latter can be efficiently indexed and searched using data structures such as inverted-files and prefix trees. Typically, the permutation is obtained by ordering the identifiers of a set of pivots according to their distances to the object to be represented. In this paper, we present a novel approach to transform metric objects into permutations. It uses the object-pivot distances in combination with a metric transformation, called n-Simplex projection. The resulting permutation-based representation, named SPLX-Perm, is suitable only for the large class of metric space satisfying the n-point property. We tested the proposed approach on two benchmarks for similarity search. Our preliminary results are encouraging and open new perspectives for further investigations on the use of the n-Simplex projection for supporting permutation-based indexing.
Transformations of data objects into the Hamming space are often exploited to speed-up the simila... more Transformations of data objects into the Hamming space are often exploited to speed-up the similarity search in metric spaces. Techniques applicable in generic metric spaces require expensive learning, e.g., selection of pivoting objects. However, when searching in common Euclidean space, the best performance is usually achieved by transformations specifically designed for this space. We propose a novel transformation technique that provides a good trade-off between the applicability and the quality of the space approximation. It uses the n-Simplex projection to transform metric objects into a low-dimensional Euclidean space, and then transform this space to the Hamming space. We compare our approach theoretically and experimentally with several techniques of the metric embedding into the Hamming space. We focus on the applicability, learning cost, and the quality of search space approximation.
In this paper we tackle the problem of image search when the query is a short textual description... more In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature mapping, while a text-focused loss is aimed at modeling the higherlevel semantic concepts expressed in language and countering the overfit on non-relevant visual components of the visual loss. We report preliminary results on the MS-COCO dataset.
Content-based image retrieval using Deep Learning has become very popular during the last few yea... more Content-based image retrieval using Deep Learning has become very popular during the last few years. In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases. The idea is to provide a text encoding for these features enabling the use of a text retrieval engine to perform image similarity search. In this way, we built LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities. In order to optimize the index occupation and the query response time, we evaluated various tuning parameters to generate the text encoding. To this end, we have developed a web-based prototype to efficiently search through a dataset of 100 million of images.
In this paper, we consider the task of recognizing epigraphs in images such as photos taken using... more In this paper, we consider the task of recognizing epigraphs in images such as photos taken using mobile devices. Given a set of 17,155 photos related to 14,560 epigraphs, we used a k-NearestNeighbor approach in order to perform the recognition. The contribution of this work is in evaluating state-ofthe-art visual object recognition techniques in this specific context. The experimental results conducted show that Vector of Locally Aggregated Descriptors obtained aggregating SIFT descriptors is the best choice for this task.
Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR'11, 2011
We present the VIsual Support to Interactive TOurism in Tuscany (VISITO Tuscany) project which of... more We present the VIsual Support to Interactive TOurism in Tuscany (VISITO Tuscany) project which offers an interactive guide for tourists visiting cities of art accessible via smartphones. The peculiarity of the system is that user interaction is mainly obtained by the use of images -- In order to receive information on a particular monument users just have to take a picture of it. VISITO Tuscany, using techniques of image analysis and content recognition, automatically recognize the photographed monuments and pertinent information is displayed to the user. In this paper we illustrate how the use of landmarks recognition from mobile devices can provide the tourist with relevant and customized information about various type of objects in cities of art.
27th Italian Symposium on Advanced Database Systems, 2019
Since the publication of the AlexNet in 2012, Deep Convolutional Neural Network models became the... more Since the publication of the AlexNet in 2012, Deep Convolutional Neural Network models became the most promising and powerful technique for image representation. Specifically, the ability of their inner layers to extract high level abstractions of the input images, called deep features vectors, has been employed. Such vectors live in a high dimensional space in which an inner product and thus a metric is defined. The latter allows to carry out similarity measurements among them. This property is particularly useful in order to accomplish tasks such as Face Recognition. Indeed, in order to identify a person it is possible to compare deep features, used as face descriptors, from different identities by means of their similarities. Surveillance systems, among others, utilize this technique. To be precise, deep features extracted from probe images are matched against a database of descriptors from known identities. A critical point is that the database typically contains features extracted from high resolution images while the probes, taken by surveillance cameras, can be at a very low resolution. Therefore, it is mandatory to have a neural network which is able to extract deep features that are robust with respect to resolution variations. In this paper we discuss a CNN-based pipeline that we built for the task of Face Recognition among images with different resolution. The entire system relies on the ability of a CNN to extract deep features that can be used to perform a similarity search in order to fulfill the face recognition task.
Ital-IA - Convegno Nazionale CINI sull'Intelligenza Artificiale, 2019
La visita a musei o a luoghi di interesse di città d'ar-te può essere completamente reinventata a... more La visita a musei o a luoghi di interesse di città d'ar-te può essere completamente reinventata attraverso modalità di fruizione moderne e dinamiche, basa-te su tecnologie di riconoscimento e localizzazione visuale, ricerca per immagini e visualizzazioni in realtà aumentata. Da anni il gruppo di ricerca AI-MIR porta avanti attività di ricerca su queste temati-che ricoprendo anche ruoli di responsabilità in pro-getti nazionali ed internazionali. Questo contributo riassume alcune delle attività di ricerca svolte e del-le tecnologie utilizzate, nonché la partecipazione a progetti che hanno utilizzato tecnologie di intelli-genza artificiale per la valorizzazione e la fruizione del patrimonio culturale. 1 Introduzione Il gruppo di ricerca Artificial Intelligence for Multimedia Information Retrieval (AIMIR) studia soluzioni di intelligen-za artificiale per l'analisi, ricerca e riconoscimento visuale in database di immagini di grandi dimensioni, tramite disposi-tivi mobili, sistemi informativi e motori di ricerca multime-diali. Negli ultimi anni, ha partecipato a numerosi progetti nazionali ed internazionali in ambito Beni Culturali, svilup-pando sistemi che consentono di riconoscere automaticamen-te, a partire da un'immagine, opere d'arte quali quadri, statue , edifici, iscrizioni antiche, effettuarne ricerche visuale su larga scala e visualizzazioni in realtà aumentata. Si consi-derino, ad esempio, il sistema http://art.isti.cnr.it/ capace di riconoscere e fornire informazioni su più di 100 mila quadri, o http://www.eagle-network.eu/image-search/ capace di rico-noscere visivamente iscrizioni antiche, in un database di più di un milione di immagini, anche da dispositivi mobili. Le tecniche sviluppate tengono in considerazione sia le problematiche di accuratezza che di scalabilità, garantendo lo sviluppo di sistemi con tempi di risposta fluidi e natura-li anche in situazioni e contesti dove la quantità di elementi da riconoscere, localizzare visivamente, e rendere aumentati è enorme, come all'interno di musei, o in zone di interesse di importanti città d'arte (piazze storiche, cattedrali, etc.). 2 Attività Scientifica L'attività scientifica portata avanti dal gruppo AIMIR sfrut-ta una sinergia di tecniche di analisi delle immagini, deep learning, strutture dati ed algoritmi di ricerca per similarità scalabili. I prototipi di ricerca sviluppati sono stati applica-ti con successo nell'ambito dei beni culturali, ad esempio, per riconoscere opere d'arte o edifici storici, per accedere ad informazioni in realtà aumentata, e per generare descri-zione automatiche di materiale digitale non adeguatamente annotato. Nell'ambito del riconoscimento visuale sono stati investi-gati sia approcci basati su aggregazioni (per es. BoW, VLAD, FV) di feature locali di immagini (quali SIFT ed ORB), sia feature estratte da reti neurali convoluzionali (CNN feature), che approcci ibridi (quale la combinazione di FV con CNN feature). Gli approcci ibridi basati sulla combinazione di aggregazioni di feature locali e CNN feature, per esempio, hanno mostrato una elevata efficacia nel riconoscimento di iscrizioni antiche [Amato et al., 2016b]. Approcci basati su "hand-crafted" feature e deep learning sono stati studiati ed utilizzati anche per la classifi-cazione automatica, il retrieval di immagini, la localizza-zione visuale ed applicazioni di realtà aumentata [Amato et al., 2015; Bolettieri et al., 2015; Amato et al., 2017b; Amato et al., 2017a]. Inoltre, per poter effettuare ricer-che visuali anche in datatabase di enormi dimensioni, sono state sviluppati innovativi algoritmi di ricerca per similari-tà approssimata [Amato et al., 2014; Amato et al., 2016a; Amato et al., 2018]. 3 Progetti in Ambito Beni Culturali Negli ultimi anni, il gruppo AIMIR ha partecipato a numerosi progetti nazionali ed internazionali su tematiche relative ai beni culturali e all'analisi del contenuto delle immagini per l'estrazione automatica di informazioni che ne permettano la descrizione automatica, il riconoscimento, la classificazione, la ricerca su larga scala, ed il loro accesso in realtà aumentata. Si citano a titolo d'esempio: VISECH-Visual Engines for Cultural Heritage, progetto regionale che ha lo scopo di avanzare lo stato dell'arte nel-l'ambito dell'analisi automatica delle immagini, sviluppando tecniche di riconoscimento e localizzazione visuale per effet-tuare realtà aumentata, mediante algoritmi altamente scala
Ital-IA - Convegno Nazionale CINI sull'Intelligenza Artificiale, 2019
La diffusa produzione di immagini e media digita-li ha reso necessario l'utilizzo di metodi autom... more La diffusa produzione di immagini e media digita-li ha reso necessario l'utilizzo di metodi automatici di analisi e indicizzazione su larga scala per la loro fruzione. Il gruppo AIMIR dell'ISTI-CNR si è spe-cializzato da anni in questo ambito ed ha abbraccia-to tecniche di Deep Learning basate su reti neurali artificiali per molteplici aspetti di questa disciplina, come l'analisi, l'annotazione e la descrizione au-tomatica di contenuti visuali e il loro recupero su larga scala. 1 Attività Scientifica Il gruppo Artificial Intelligence for Multimedia Information Retrieval (AIMIR) dell'ISTI-CNR nasce storicamente in un contesto di gestione di dati multimediali ed ha quindi abbrac-ciato le moderne tecniche di IA nella modellazione e rappre-sentazione di tali dati, sposandole con successo con moltepli-ci aspetti di questa disciplina, in particolare con la gestione su larga scala di dati percettivi visuali, quali immagini e video. Tra le attività scientifiche sostenute e le competenze presenti nel gruppo, spiccano le seguenti: Recupero di immagini su larga scala basati sul contenu-to Data la mole di immagini prodotte quotidianamente dagli utenti del Web, lo sviluppo di tecniche automatiche e scalabi-li per la comprensione automatica ed il recupero di immagini risulta di vitale importanza. Sfruttando tecniche di modella-zione profonda data-driven come il Deep Learning, il grup-po si è specializzato nello sviluppo e l'utilizzo di rappresen-tazioni vettoriali compatte ed efficaci per immagini estratte tramite reti neurali convoluzionali (Deep Features, R-MAC). L'adozione di questo tipo di rappresentazioni ci ha permesso di sviluppare tecniche di indicizzazione e ricerca per simila-rità visuale di immagini non etichettate con un alto grado di scalabilità (nell'ordine di centinaia di milioni di immagini 1) mantenendo un alto livello di accuratezza dei risultati della ricerca [Amato et al., 2016a]. In questo contesto, sono state svolte attività di ricerca sulla trasformazione di tali rappresentazioni tramite l'uti-lizzo di permutazioni [Amato et al., 2014; Amato et al., 2016b] e trasformazioni geometriche [Amato et al., 2018a; 1 http://mifile.deepfeatures.org/ Amato et al., 2018b] per facilitarne l'indicizzazione. Le tra-sformazioni introdotte ci permettono utilizzare delle rappre-sentazioni testuali surrogate dei descrittori visuali e quindi di impiegare indici open source basati su liste invertite tradi-zionalmente usati per documenti testuali (e.g. Elasticsearch, Apache Lucene) per la gestione di database di immagini, fa-vorendo il trasferimento tecnologico di tali tecniche [Amato et al., 2017] 2. Inoltre, grazie alla flessibilità delle reti neurali profonde, sono state sviluppate tecniche di recupero di immagini che affrontano e risolvono problemi avanzati in questa discipli-na, quali il cross-media retrieval [Carrara et al., 2017], i.e. il recupero di immagini non etichettate partendo da una sua descrizione testuale , ed il relational content-based image retrieval [Messina et al., 2018], dove si richiede di recuperare immagini raffiguranti oggetti con precise relazioni spaziali o semantiche tra loro 3. Analisi visuale dell'emotività trasmessa Nel contesto del-l'analisi dei dati provenienti dai social media, il gruppo ha sviluppato competenze e tecniche allo stato dell'arte di visual sentiment analysis [Vadicamo et al., 2017], cioè nell'analisi del sentimento veicolato da media visuali, tramite l'utilizzo di reti neurali convoluzionali 4. Sono state sviluppate techiche di allenamento cross-media che sfruttano la grande quantità di dati rumorosi provenienti dai social media (in particolare Twitter) per allenare modelli per la classificazione del sen-timento visuale allo stato dell'arte senza indurre in costi di etichettatura o di creazione di dataset di training. Sistemi di video-browsing Dall'unione delle competenze sopraelencate, il gruppo ha svolto attività di ricerca e svilup-po di tool per la ricerca interattiva di video su larga scala, partecipando alla competizione di Video Browsing Showdown (VBS 2019) con il sistema VISIONE [Amato et al., 2019]. Il sistema integra moduli di analisi, annotazione e recupero del contenuto visuale basate su tecniche deep learning allo stato dell'arte e fornisce molteplici modalità di ricerca, come la ricerca per similarità visuale, per locazione spaziale di og-getti o per semplici keyword testuali. Tutte le informazioni ri-sultanti dalle analisi sono codificate tramite rappresentazioni testuali surrogate ed indicizzate con motori di ricerca testuali performanti e scalabili. 2
Ital-IA - Convegno Nazionale CINI sull'Intelligenza Artificiale, 2019
Negli ultimi anni la Cyber Security ha acquisito una connotazione sempre più vasta, andando oltre... more Negli ultimi anni la Cyber Security ha acquisito una connotazione sempre più vasta, andando oltre la concezione di semplice sicurezza dei sistemi in-formatici e includendo anche la sorveglianza e la sicurezza in senso lato, sfruttando le ultime tecno-logie come ad esempio l'intelligenza artificiale. In questo contributo vengono presentate le principa-li attività di ricerca e alcune delle tecnologie uti-lizzate e sviluppate dal gruppo di ricerca AIMIR dell'ISTI-CNR, e viene fornita una panoramica dei progetti di ricerca, sia passati che attualmente atti-vi, in cui queste tecnologie di intelligenza artificiale vengono utilizzare per lo sviluppo di applicazioni e servizi per la Cyber Security. 1 Attività Scientifica Il gruppo AIMIR (Artificial Intelligence for Multimedia Information Retrieval) è un gruppo di ricerca dell'ISTI-CNR molto attivo nel campo della Cyber Security che ha acquisito una notevole esperienza nell'utilizzo di tecniche di intelligen-za artificiale applicate allo sviluppo di applicazioni e servizi di Cyber Security. In particolare, vengono sfruttate, e in al-cuni casi sviluppate, reti neurali convoluzionali (CNN) ad-destrate con approccio Deep Learning sia per classificazione diretta di media, sia per l'estrazione di descrittori visuali (fea-tures) dai media analizzati. Questi descrittori visuali vengono poi utilizzati per eseguire ricerche, classificazione, riconosci-mento anche su larga scala (ad esempio ricerche per contenuti su milioni di immagini) usando tecniche di indicizzazione e classificatori. Il gruppo ha sviluppato applicazioni o servizi in diversi campi: monitoraggio di parcheggi, riconoscimen-to facciale, re-identificazione di persone, intrusion detection, sicurezza pubblica. Nella sezione seguente, vengono brevemente presentati i progetti di ricerca in cui il gruppo è stato o è attualmente coinvolto. 2 Progetti Energia da Fonti Rinnovabili e ICT per la Sostenibilità Energetica È un progetto nazionale finanziato dal CNR nel contesto del-le smart cities e dell'utilizzo dell'ICT per il risparmio ener-Figura 1: Monitoraggio visuale dello stato di occupazione degli stal-li del parcheggio dell'area di ricerca di Pisa tramite l'applicazione Smart Parking. getico. Nell'ambito di questo progetto, il gruppo AIMIR ha sviluppato due applicazioni: l'applicazione Smart Parking e l'applicazione Smart Surveillance. L'applicazione Smart Parking [Amato et al., 2016][Amato et al., 2017] [Ciampi et al., 2018] sfrutta delle telecamere intelligenti (cioè dotate di capacità di analisi dell'immagine acquisita), su cui è stata installata una CNN per monitorare visivamente lo stato di occupazione del parcheggio dell'area di ricerca di Pisa (vedi Figura 1). Sia le telecamere intelligenti che la rete neurale installata a bordo, sono state realizzate dal gruppo AIMIR. Una cosa importante da notare è che tutta l'elaborazione viene effettuata a bordo della camera. L'unica informazione trasmessa all'esterno, sia per motivi di traffico dati, che per motivi di privacy, è l'informazione testuale sullo stato di occupazione dei singoli stalli. L'applicazione Smart Surveillance [Amato et al., 2018a] [Barsocchi et al., 2018] [Kavalionak et al., 2018] è un siste-ma di video sorveglianza in grado di rilevare, tramite rico-noscimento facciale, intrusioni di persone non autorizzate in ambienti monitorati. Il sistema sfrutta delle telecamere in-stallate all'interno di alcuni uffici dell'area di ricerca di Pisa e una CNN pre-addestrata a riconoscere facce con tecniche
Local descriptors are state-of-the-art of representing low-level visual information in object rec... more Local descriptors are state-of-the-art of representing low-level visual information in object recognition. Because of their effectiveness , they are also largely used in content-based image retrieval whenever the query visually express a specific object to be retrieved between the images in the archive. Given that searching for the local descriptors can be very costly, many recent works have proposed to encode the local descriptors in a compact representation. In this paper, we propose to embed the aggregated information in the local descriptors in order to achieve higher effectiveness. The experimental results, obtained on a largely used public dataset, reveal the potential of the approach. Even if we only tested our approach in a content-based image retrieval scenario , the idea of combining aggregated and local information is general and could be applied in other similarity search tasks. We call the proposed approach bifocal searching because of the similarity with bifocal eyeglasses which have two parts with different focal lengths.
Uploads
Papers by Fabrizio Falchi