research-article

Open access

Visualization and Visual Analytics Approaches for Image and Video Datasets: A Survey

Authors:

Shehzad Afzal,

Sohaib Ghani,

Mohamad Mazen Hittawe,

Sheikh Faisal Rashid,

Omar M. Knio,

Markus Hadwiger,

Ibrahim HoteitAuthors Info & Claims

ACM Transactions on Interactive Intelligent Systems, Volume 13, Issue 1

Article No.: 5, Pages 1 - 41

https://doi.org/10.1145/3576935

Published: 09 March 2023 Publication History

All formats PDF

Abstract

Image and video data analysis has become an increasingly important research area with applications in different domains such as security surveillance, healthcare, augmented and virtual reality, video and image editing, activity analysis and recognition, synthetic content generation, distance education, telepresence, remote sensing, sports analytics, art, non-photorealistic rendering, search engines, and social media. Recent advances in Artificial Intelligence (AI) and particularly deep learning have sparked new research challenges and led to significant advancements, especially in image and video analysis. These advancements have also resulted in significant research and development in other areas such as visualization and visual analytics, and have created new opportunities for future lines of research. In this survey article, we present the current state of the art at the intersection of visualization and visual analytics, and image and video data analysis. We categorize the visualization articles included in our survey based on different taxonomies used in visualization and visual analytics research. We review these articles in terms of task requirements, tools, datasets, and application areas. We also discuss insights based on our survey results, trends and patterns, the current focus of visualization research, and opportunities for future research.

1 Introduction

Image and video analysis research has significantly advanced in recent years due to massive growth in artificial intelligence (AI), especially in the field of deep learning. Deep learning-based frameworks have revolutionized the field of computer vision [74]. Considering the prevalence of image and video data in our daily lives, and the availability of high-performance computing infrastructure, intense research interest surrounds the field of computer vision and other related areas. Consequently, significant visualization research is being conducted in problem areas encountered when working with image and video datasets, such as medical datasets [39, 67, 141, 173], sports analytics [7, 22, 112, 114, 117, 118, 154, 169, 170], video and image editing and processing [14, 66, 72, 159, 171], video surveillance, activity and scene recognition, motion flow analysis, content analysis [3, 21, 24, 41, 60, 89, 95, 122, 160, 167], video and image search [11, 75], annotations, content summarization and synthesis [6, 42, 70, 104, 109], social media analytics [59], non-photorealistic rendering, art, video painting [45, 52, 61, 178, 178], and virtual and augmented reality [20, 106, 121, 151].

Advances in AI have significantly improved the state of the art in computer vision, but have simultaneously posed new challenges for visualization and visual analytics researchers to develop new visualization techniques and frameworks to address these challenges. This mandates a systematic review of the current state of the art at the intersection of image and video analysis and visualization research to identify the gaps that exist between the requirements of the two domains, and explore opportunities for future research.

For this purpose, we have gathered a multidisciplinary team of coauthors with visualization, machine learning, video processing, and computer vision backgrounds. To the best of our knowledge, no other recent relevant surveys have been published on the visualization and visual analytics approaches used for image and video datasets.

In the initial phase, we reviewed articles related to image and video analysis published in visualization conferences and journals. We categorized these research articles based on various taxonomies that focus on different aspects of visualization research, such as interaction, visualization, machine learning methods, data scale, and application areas. Categorizing and labeling the visualization research facilitated the study of current trends and patterns and helped identify gaps between the task requirements of the computer vision domain and the current focus of visualization research.

The major contributions of our survey article are

—

Categorization of articles in the visualization domain related to image and video datasets based on standard taxonomies. Extracted task requirements from these surveyed articles and grouped them into different application areas

—

Identification of tools, libraries, and datasets used in visualization research focused on image and video datasets and grouping them based on application areas

—

Identification of gaps, challenges, and opportunities for future collaborative research at the intersection of the computer vision and visualization domains

2 Survey Design

In this section, we present details of the survey organization, methodology, and motivation for the work.

2.1 Survey Organization

In this survey report, we first provide the scope of the work. In Section 2.3, we introduce the topic and motivation for conducting this survey. We highlight the advances in AI and deep learning that have triggered rapid advancements in computer vision, and how these advances necessitate this current survey on visualization and the visual analytics domain. We also discuss the relevant existing surveys, and how our state-of-the-art report is different from these existing surveys. We then describe our survey methodology (Section 2.4), including details of the keywords used to search for articles, the selection criteria for conferences and journals, the ranking of articles, the compilation of a final set of articles, and the scope of our survey.

In Section 3, we review the relevant articles in the visualization domain, and categorize those articles based on different taxonomies. These taxonomies include algorithms and techniques (Section 3.1), visualization techniques (Section 3.2), and application areas (Section 3.4). We also provide tables (Table 1a, 1b, and 1c) showing how surveyed articles in the visualization domain are labeled according to these taxonomies. We also provide additional insights into the high-level representative task requirements, tools, libraries, and datasets grouped based on application areas.

In Section 4, we provide details of discussions with computer vision domain experts concerning their domain specific visualization requirements for image and video analysis. In the final sections, we present the current trends and patterns, gaps in the current research, major challenges, and a discussion on future research directions. Finally, we outline some limitations of our work.

2.2 Scope

In this survey report, we aim at identifying overlaps and gaps in computer vision and visualization research focused on image and video datasets, and identify potential areas of collaborative research. We catalogue the current research occurring at the intersection of these domains in terms of techniques and algorithms, tools and libraries, datasets, application areas, and task requirements. The results and findings are summarized in the form of tables and discussions that aim at facilitating researchers working in relevant research areas in terms of guidance and identifying potential future research areas.

The tables presented throughout the article serve as a reference to highlight current trends, identify gaps in research, tools and techniques in use, and application areas that are less actively explored at the intersection of computer vision and visualization. These tables are also a reference guide for researchers to different visualization techniques for various computer vision problems.

This work can assist visualization researchers working on image and video datasets to obtain insights into the nature of task requirements of different application areas, existing tools and techniques, and research gaps. Discussions with computer vision domain experts provide additional insights based on their own experience.

This survey report can also aid computer vision researchers in understanding which type of interactive visualization solutions have already been designed for various computer vision problems and identify opportunities for collaborative research efforts.

2.3 Motivation

Visualization and visual analytics tools are used in image and video analysis in different application areas such as video surveillance, activity recognition, human motion analysis and recognition, scene interpretation, video and image editing, sports analytics, medical imaging analysis, and social media analytics. Due to advancements in the AI domain and the availability of high-performance computing infrastructure, computer vision techniques and algorithms have significantly evolved in the last few years. This, in turn, presents significant challenges, relating to different aspects of visualization design, for visualization and visual analytics researchers to address different task requirements for analyzing an image and video datasets. To identify the gaps between visualization research and the current focus of computer vision research across different dimensions, visualization literature should be explored using different visualization taxonomies. Existing visual analytics models and frameworks [125, 126] must be evaluated in the context of task requirements for analyzing an image and video datasets in the wake of recent rapid growth in computer vision. Clearly, there is a need to evaluate the current state-of-the-art in visualization research.

There are some relevant surveys available in the literature [18, 19, 71, 155, 168, 179]. ML4VIS [155] focused on understanding the current practices of employing machine learning techniques to solve visualization problems. This survey explores the relevant research to determine which visualization processes can benefit from machine learning. Moreover, how machine learning techniques can be employed for visualization problems. AI4VIS [168] explored the vision of considering visualization as a new data format (visualization data) and reviewed recent advances in applying AI techniques to this data format. The ML4VIS and AI4VIS surveys probed different sets of questions with a different focus than our survey. Yuan et al. [179] categorized visual analytics techniques based on their usage before, during, and after model building in machine learning applications. Our survey, instead, focuses specifically on image and video datasets. Borgo et al. [18, 19] conducted a survey on video-based graphics and video visualization; however, this survey was conducted almost ten years ago and did not cover recent advancements in the computer vision and visualization domains. Kyprianidis et al. [71] conducted a survey in 2012 focusing on non-photorealistic rendering (NPR) techniques to transform images and videos into artistically styled renderings. Dudley and Kristensson [38] presented a review of user interface research to design effective interfaces for interactive machine learning algorithms. There are some other survey articles [28, 54, 85, 98, 132, 181] related to machine learning models analytics and visualization, but they have primarily not considered image and video datasets.

Some other related surveys also exist, but they focus on computer vision and are not relevant to visualization. Khurana and Kushwaha [65] surveyed literature related to human activity recognition in surveillance videos. Shih [137] reviewed research focused on content-aware video analysis for sports videos rather than from a spatiotemporal viewpoint. Wang and Ji [156] conducted a survey on effective video content analysis based on direct and implicit approaches. Other notable surveys, that are not directly relevant to our survey, include [17, 36, 79, 84, 99, 102, 111, 146, 150, 176]. In this work, we provide detailed coverage of visualization and visual analytics approaches for image and video datasets. We also identify the major challenges and future research directions.

Fig. 1.

Fig. 2.

2.4 Methodology

Figure 1 shows the flowchart of the methodology we employed in this work. We initiated our survey by searching for articles from major visualization and visual analytics conferences and journals with the keywords “image”, “images”, “video”, or “videos” in their titles or abstracts. Our search included IEEE Transactions on Visualization and Computer Graphics (TVCGs), EG & VGTC Conference on Visualization (EuroVis), IEEE Visual Analytics Science & Technology (VAST), Computer Graphics Forum (CGF), IEEE Scientific Visualization (SciVis), IEEE Symposium on Information Visualization (InfoVis), Computer Graphics Applications (CG&A), IEEE Large Scale Data Analysis & Visualization (LDAV), and IEEE Pacific Visualization Symposium (PacificVis).

We collected around 150 articles and studied their titles and abstracts. Some non-relevant articles were filtered out after reading their full text. In total, we compiled 107 relevant articles. Figure 2 shows the temporal distribution of the surveyed articles according to the publishing year. It is clear from the distribution that image and video data-related problems are becoming more common due to advancements in AI and computer vision. We labeled the algorithms, visualization techniques, application areas, datasets, and so on, used in these articles, as shown in (Table 1a, 1b, and 1c). We present details of this coding along with a detailed discussion of the trends, lessons learned, and future research challenges.

The scope of this survey article was limited to articles published in major visualization conferences and journals. We did not search for any relevant visualization work for image and video data in other areas, such as big data, parallel computing, high-performance computing, and social media. There are also articles focused on interactive visualizations of deep learning networks [50, 62, 124, 144], but we considered visualization domain articles focusing only on image and video datasets.

Table 1a.

Table 1b.

Table 1c.

3 Image and Video Data Research in the Visualization Domain

In this section, we review articles in the visualization domain related to image and video datasets, and categorize them based on various taxonomies such as algorithms and techniques, visualization techniques, interaction methods, and application areas. The categorization of these articles is summarized in Table 1a, 1b, and 1c. In these tables, articles are grouped based on the year of publication, providing an overview of the work done over the years. However, within each year, the articles are not listed in any particular order as we do not think ordering is essential within a year. This table provides insights into the current coverage of these taxonomies in research relevant to image and video datasets in the visualization domain. We also discuss data types, scale, and dimensions of data used in research. Furthermore, we also identify higher-level representative task requirements based on application areas. Lastly, we provide details of the tools and libraries used in the visualization domain in the surveyed collection of articles.

3.1 Automated Techniques and Algorithms

In this section, we review the techniques and algorithms related to machine learning, statistics, and computer vision used in the surveyed articles. We grouped the algorithms and techniques into high-level categories, as shown in Table 1a, 1b, and 1c. We adopted an initial high-level categorization based on the taxonomy proposed by Patgiri [110] and adjusted these categories while reviewing the techniques and algorithms used in the surveyed articles and eventually merged similar ones.

3.1.1 Dimensionality Reduction.

Visualization of high-dimensional data is often facilitated by applying different dimensionality reduction methods by converting high-dimensional data into lower meaningful dimensions [25, 94, 127, 140, 152, 153]. Turban et al. [151] used principal component analysis (PCA) to analyze the impact of criteria correlations on the distribution of the data, and to reduce the high-dimensional data in the video dataset. Herman et al. [53] used PCA spaces to make comparisons between different models, and support methods such as PCA navigation or browsing.

3.1.2 Regression.

Identifying the multivariate relationship between data variables is one of the critical challenges in data analysis, especially when the number of variables considered is large. Regression methods coupled with visual analytics can facilitate causal analysis and can be applied in areas such as sports analytics. ForVizor system [170] facilitates the analysis of dynamic changes in player formations during soccer matches under varying temporal and spatial scenarios. Utilizing the least-squares method, the system acquires player positions by mapping the tracking result from the real world to a 2D plane.

3.1.3 Clustering.

Clustering is frequently used in visualization research to group items together based on their similarities [8, 60, 123, 151]. The effectiveness of results obtained by clustering methods varies depending on the intended usage and application, which has given rise to the development of different clustering algorithms. K-means is considered to be one of the most popular clustering algorithms used to find the centroid of clusters, where the number of clusters is represented by K [63]. In [151], an algorithm based on spatial prediction, pyramidal computation, and human vision characteristics was proposed for peripheral extension of existing video (movie) content. The authors used the K-means clustering algorithm to cluster each set of data scores (Enjoyment, Comfort, Consistency, Presence, and Emotion), where each score is assigned by a user to one video only.

Hierarchical clustering, which builds a tree-like clustering structure and maintains the relationships between different clusters, is used in many of the surveyed articles related to visualization [33, 67, 75, 119, 128, 130]. Schultz and Kindlmann [130] used adapting spectral clustering for specific image analysis tasks, which involved exploring hierarchical, spectral embeddings, and tuning parameters. Their work focused on 3D medical image analysis and they proposed a framework that maps a spectral clustering-based high-dimensional features space to a three-dimensional data space.

3.1.4 Correlation.

Correlation is a task of association and quantification of the strength and direction of the relationship between numeric variables. Botchen et al. [21] proposed a technique to dynamically detect events and activities in video by converting it into a series of snapshots. A human figure is tracked in the input video to generate spatiotemporal movement data. An optical flow descriptor is then used to characterize the motions of different body parts. Spatiotemporal cross-correlations are computed to find the similarities between the motion descriptor and the stored database of action fragments.

3.1.5 Machine Learning Methods.

With the advancements in machine learning, and especially deep learning, more advanced libraries and frameworks are being developed. Table 1a, 1b, and 1c show an increasing trend in the use of deep-learning methods in visualization research focused on image and video datasets. These results accord with discussions with the domain experts, who also emphasized that there will be an increasing focus on deep-learning methods due to advancements in computation resources.

Serrano et al. [134] proposed a technique to generate a 3D visualization from a 2D visualization, which is carried out by extracting per-frame depth map information from the video data using a convolutional neural network (CNN). VC-Net [162] utilized an end-to-end CNN-based framework to segment and visualize 3D sparse microvascular structures by leveraging information from maximum intensity projection (MIP).

Zeng et al. [182] implemented a visual analytics system to generate emotional summaries from classroom videos that uses an adapted CNN model to recognize facial expressions. Facetto [69] system integrates a CNN-based framework for cell classification that supports semi-automated analysis of high-dimensional multi-channel images in cancer studies. Zhu et al. [185] used a series of CNN-based networks to synthesize scale- and space-continuous satellite images conditioned on cartographic data.

Since processing image and video datasets is often intended to analyze content and summarize information [172], deep-learning techniques such as CNN are more commonly utilized than other deep-learning methods, as they are focused on analyzing the spatial content based on convolutional networks. In datasets and problems where analyzing temporal components of the dataset is desired, such as analyzing speech patterns, techniques such as recurrent neural network (RNN) are generally utilized. Different variants of deep-learning frameworks are used in practice. A combination of CNN and RNN was utilized by Bi et al. [13] to learn patterns of vehicle trajectories at intersections, which supports vehicle editing and generation of new simulations. Zhang et al. [184] used BASNet [120], a deep neural network-based architecture composed of encoder-decoder framework and a residual refinement module, to embed information into visualization images.

Gaining insights and understanding the internal decision process is important to trust the machine learning-based methods. Dmitriev et al. [37] utilized a visual analytics-based approach to explain the rationale behind a computer-aided diagnosis (CAD) for pancreatic lesions based on random forest (RF) and CNN components. This rationale is based on the visual analysis of these individual components.

3.1.6 Optimization.

In the surveyed articles, certain problems are modeled as optimization problems. These include affine scaling and reduction methods for iterative non-linear optimization [149], unconstrained minimization [8], and contour optimization [82]. There are also some general non-convex quadratic programming problems. Liao et al. [81] used a combination of motion analysis with user interaction to convert videos into stereoscopic videos. The authors recover dense depth-maps information by analyzing the optical flow for all frames and utilizing a quadratic programming technique to recover both quantitative and qualitative depth information.

3.1.7 Moving Average.

The moving average, also called the “rolling mean” or “moving mean”, is a well-known statistical technique to analyze data points. Different variants of the moving average are commonly used. These variants include the exponential moving average [21], the weighted moving average [25], and the simple moving average [142]. Stein et al. [142] designed an interactive system for extracting player movements and visualizing their trajectories. If an individual player’s position is incorrectly detected, it may lead to incorrect detection of the other players’ movements. The authors used a moving average filter to overcome the scenario whenever a transformed player’s position differs from the actual position.

3.1.8 Data Aggregations.

Data aggregation is a data-processing technique that gathers and compiles data in a summarized format, which can then be used for further statistical analysis or in visualization representations [122, 128, 143]. Viz-A-Viz [122] is a video analytics tool to analyze activities in videos, generating spatial, temporal, and semantic aggregations based on an activities dataset. Statistical tools like histograms are often used to aggregate data by showing the distribution of the data. A color histogram is used to represent the distribution of the colors in image and video datasets [48, 49, 142, 170]. Vian [49], a visual film annotation system, uses a 3D space-filling curve to map a color histogram to 1D feature vectors.

3.1.9 Others.

In addition to techniques and algorithms related to machine learning and statistics, the surveyed articles also employed other computer vision methods. We discuss some of these methods here.

Segmentation [42, 128] techniques are used to subdivide images into segments or parts based on different criteria, and those parts are then used in later stages of the image or video processing pipeline. There are different segmentation techniques, including color segmentation [128], graph-based segmentation [83], and semantic segmentation [72].

Multiview stereo (MVS) enables reconstruction of 3D scenes from multiple calibrated static images captured from varying view points for the same scene. Liu et al. [86] adapted an MVS algorithm to utilize point cloud for 3D reconstruction. Optical Flow encodes the pattern of movement of different objects contained within a set of images caused by the movement of the objects or the observer, and its applications include finding similarities between videos [8] and improving segmentation results [83]. Motion Estimation calculates the motion vectors in videos or a sequence of images. Sunkavalli et al. [148] utilizes affine transformations to represent camera motion for videos, which is also a form of motion estimation.

Fig. 3.

3.2 Visualization Techniques

Various visualization techniques have been used by researchers to analyze image and video datasets. Since these datasets are often complex with a large number of features, multiple coordinated visualization techniques are often used for interactive visualization [23, 25, 33, 167].

In this work, we use the taxonomy introduced by Keim [64] and Ko [68] for standard visualization techniques. In this section, we briefly review each type with examples from our surveyed articles. Table 1a, 1b, and 1c also demonstrate the categorization of visualization techniques for our surveyed articles.

3.2.1 2D Techniques.

2D techniques are among the most common standard visualization techniques used to visualize various features of image and video data. These techniques include 2D charts such as pie charts, bar charts, and histograms. These charts are also often used to show various analytics associated with the data.

In the design of an interactive tool for analyzing multiple TED talks, Wu and Qu [167] used multiple 2D plots, such as stacked line charts, sankey diagrams, and word clouds (Figure 3). For the analysis of snooker game data, Parry et al. [109] also used bar charts and line charts to display temporal information. The Motion Browser [24] tool extensively uses line charts to analyze patients’ therapy data (Figure 11). Animation can be used to show temporal changes; for example, Lobo et al.[87] efficiently demonstrated the use of animation to show temporal changes in satellite images. Overall, 2D techniques are commonly used in designing tools for visualizing image and video data [25, 117, 118, 139, 161, 169].

3.2.2 3D Techniques.

With the advancement of immersive technology, 3D visualization techniques are becoming more important. Researchers often visualize image and video datasets in 3D to show them in immersive environments, such as head mounted displays [134]. Various research works have also been proposed to convert 2D visualization into 3D visualization by extracting depth and motion information from the video data [81, 86, 134].

Parry et al. [109] introduced a 3D visualization template for summarizing events from a snooker game. Meghdadi and Irani [95] presented a 3D space-time cube for visualizing movement trajectory data. Semmo and Döllner [133] used texture mapping for rendering 3D scenes, also allowed users to interactively apply various image filters. Volume rendering is a common 3D visualization technique that is often used for visualizing medical images [16, 53, 141]. Weis et al. [164] used deep learning-based architecture to upscale the resolution of isosurfaces to higher resolution. Nguyen et al. [103] proposed a novel technique for construction of 3D mesoscale biological models by extracting statistical and spatial properties from 2D microscopy scans, and internal assembly through an interactive 3D rule specification.

Generally, visualization researchers are cautious to use 3D visualization techniques because of the inherent problems of perception and occlusion [100]. Nevertheless, 3D visualization techniques for imaging and video data are becoming more important due to the advancement and availability of immersive environments [93].

3.2.3 Geometrically Transformed Displays.

Multidimensional datasets often use geometrically transformed display techniques to visualize multiple dimensions of data. These techniques are often based on dimensionality reduction principles [153] for meaningful representations of data. Common examples of this category are scatterplots and parallel coordinates.

Höferlin et al. [57] used a cascaded scatter plot to show the quality of trained classifiers used in interactive learning by involving human experts for video visual analytics (Figure 8). In the design of GazeDx [141], the authors used scatterplot matrices for the comparison of gaze analysis data for medical images.

Parallel coordinates uses multiple parallel axes to display multidimensional data on a 2D display. Legg et al. [75] utilized parallel coordinates along with other visualization views in the design of a visual analytics system that supports sketch-based search for rugby videos. They used parallel coordinates to show similarity metrics data for each frame of the video. The PeakVizor [25] tool, which analyzes clickstreams data in Massive Open Online Courses (MOOCs) to understand the behaviors of learners, also used parallel coordinates in their design for the correlation analysis of different learner groups (Figure 7).

3.2.4 Iconic Displays.

Iconic displays are another useful technique that encodes features and attributes of data in terms of color, shape, or glyphs. Various custom glyphs-based visualizations have been designed by researchers while visualizing various features of image and video datasets [30, 39].

Chen et al. [25] introduced treemap-based glyphs to show peaks in the clickstream data of learners accessing open online courses (Figure 7). While designing a visual analytics tool to support multimodal analysis of TED talks, Wu and Qu also used a novel treemap, radar chart, and nested pie chart-based design for glyphs to encode various attributes of presentation techniques used by speakers [167] (Figure 3). In the design of the TenniVis tool, the authors also presented various novel glyphs for displaying multiple attributes of tennis-match data [118].

3.2.5 Dense Pixel Displays.

This technique encodes each pixel on the display with a particular data attribute, and can visualize the largest possible amount of data on the viewport because it can use one pixel for each data point. Researchers often use various configurations or stacking of pixels to display different information via this technique. Grid or radial layouts are the most common configurations of the pixels layout for this category.

Wu and Qu [167] presented a matrix-based grid layout color-coded view for comparing various clusters while analyzing various TED talks (Figure 3). Wu et al. [169] also used a matrix-based grid layout for the interactive analysis of table tennis game data. In the design of AnaFe [46], the authors also used a heatmap-style color-coded grid visualization for showing temporal changes in the feature progression, while performing image analysis. Overall, this visualization category is not extensively used for the visualization of image and video datasets.

3.2.6 Stacked Displays.

This category of the visualization technique is used to display hierarchical data. Researchers often use various styles of partitioning to represent the hierarchical structure of data [64]. A common example of this category is treemaps, which often encode hierarchical information by using nested rectangles. Jang et al. [60] used treemap visualization to display various motions in their design of the tool for analyzing human-motion data. Pretorius et al. [119] presented a tree-based node-link visualization for depicting the clustering hierarchy of input parameter space while performing interactive image analysis. Like dense pixel displays, this category is also not frequently used in visualizing image and video datasets.

3.3 Interaction Methods

We reviewed the articles included in the survey based on the taxonomy of interaction methods proposed by Yi et al. [177], using the interaction categories “Select”, “Explore”, “Reconfigure”, “Encode”, “Abstract”, “Filter”, and “Connect”. In the surveyed articles, basic interaction techniques such as “Select” and “Explore” were generally used. The experts mentioned, in the discussions, their desire to have an “overview, details-on-demand” (R22) functionality in deep-learning frameworks for computer vision applications. They explained that this type of functionality can improve the understanding of datasets, and can help in different preprocessing tasks. Querying and filtering was very common in articles in research areas such as sports analytics [118], activity analysis [60], and medical applications [24] (R23). “Connect” is often used in image and video data visualizations as these visualizations generally consist of multiple coordinated views.

A few visualization tools exist that include detailed interaction support. The ARIES [34] system is designed to explore interactive image processing, exploration, and manipulation. Xie et al. [172] designed a visual analytics system that has comprehensive interaction support for semantic-based image analysis. The DeHumor [157] system provides multiple linked views to support exploration of multimodal humor features at multiple levels of detail to facilitate analyzing human behavior. There are a few further works that have more limited interaction support [163].

Interaction methods can be useful in deep-learning applications as they can provide insights into the internal model structure, which is usually a black-box in computer vision applications. Shepherd is another interaction method [5, 90] that either implicitly (indirect shepherding) or explicitly (direct shepherding) helps the user to optimize the modeling process. Model selection or setting model parameters is an example of direct shepherding, whereas setting soft or hard constraints through a visual interface, such as defining distance thresholds, is an example of indirect shepherding. This method can be adapted for deep-learning techniques, we did not find any example of this interaction technique in the surveyed articles. The What-If Tool [166] is an open-source tool that enables interactive probing of machine learning models to understand their behavior. Users can evaluate the model’s performance by creating hypothetical scenarios, performing intersectional analysis, supporting flexible visualization of input data, and easily switching views. Interaction methods can facilitate understanding transfer learning between deep-learning models when training and adapting models for new tasks. Ma et al. [92] implemented a visual analytics application that helps users understand transfer learning at multiple levels (data, model, and features) through a suite of linked visualizations.

3.4 Application Areas

In this section, we summarize and group the application areas of our surveyed articles, explaining each application area with examples. Application area sub-categories were derived based on a higher-level area categorization of each article and merging similar categories. Tables 1a, 1b, and 1c present the application area coding of each surveyed article.

Fig. 4.

3.4.1 Sports Analytics.

The interactive analysis and visualization of sports data are rapidly gaining popularity [30, 113]. Sports experts use analytics and visualization charts to analyze the game data and to plan for future games accordingly. Various works have been carried out to interactively analyze the images and video data related to soccer [131], table tennis [169], tennis [118], rugby [29, 31], and other sports.

Stein et al. [142] proposed an interactive system that automatically extracts and visualizes object trajectories for visual analysis of team sports matches. They focused on soccer data and extracted the players’ movements and analytics and put them into the original video. Their proposed method was implemented on GPUs for faster processing. That work helped domain experts to achieve a better analysis of such data. Wu et al. [170] presented an interactive visual analytics system called “ForVizor” to analyze dynamic changes in soccer team formation. Their front-end system consisted of multiple coordinated graphical views to show team formation flow and the changes happening (Figure 4). Perin et al. [112] proposed a tool to allow soccer analysts to analyze different phases and events of the game and communicate their insights in terms of visual stories. Seebacher et al. [131] supported the creation of spatio-temporal queries through the placement of magnets on a virtual tactic board to perform a similarity search in massive soccer datasets. Parry et al. [109] also presented a framework for video storyboarding, summarizing the main events of the game video. Although this work focused solely on snooker videos, in practice it could be applied to videos of other sports.

Legg et al. [75] designed a visual analytics system that supports a sketch-based search for rugby videos. Their system supports model visualization (based on parallel coordinates), search space visualization, search results visualization, and an interface to accept or reject results that, in turn, can improve the model by adjusting the parameter weights of an active learning model (supervised learning). Sketch-based search supports spatiotemporal attributes like motion, position, distance, trajectory, and spatial occupancy.

Wu et al. [169] proposed a visual analytics system “iTTVis” for the exploration and analysis of table tennis data. iTTVis presents visualizations that support three main perspectives: time-oriented, statistical, and tactical analysis, and also supports correlative analysis and identification of tactical patterns along with a score timeline. Similarly, various interactive visualization tools have been designed for the analysis of tennis match data [117, 118].

Fig. 5.

3.4.2 Content Synthesis and Removal.

Content synthesis and removal have applications in areas such as gaming, entertainment, and architecture. This research facilitates designers in composing their desired content by utilizing existing natural examples [8, 42, 78, 165].

Flagg and Rehg [42] introduced crowd tubes for synthesizing video-based crowds, which are constraint-satisfying video objects placed by designers in a specific place and time in a video volume with an associated trajectory. Andrea et al. [8] utilized the similarity of optical flow between the reference video and the camera paths of a given 3D scene to generate a video of the scene resembling the reference video. The similarity in optical flow is due to the similarity in camera movement and scene geometry. Li et al. [80] proposed a framework to synthesize cartoon videos by using color information from the keyframes and animation information from the sketch. The correspondence between sketch and keyframes is used to create a blended image and then uses the estimated optical flow information from the user sketch to generate interpolated video.

3.4.3 Surveillance, Activity Analysis and Recognition, Motion Analysis and Tracking.

Visual surveillance and activity analysis and recognition are active areas of research, and interactive visualization coupled with computer vision techniques can facilitate different task requirements in this field [58, 95]. MotionFlow [60] helps to understand the patterns of gestures through visualization of motion-data sequences, and also supports comparative analysis between different gestures (Figure 5). Wang et al. [160] used silhouette as a cue for 3D-pose estimation on handheld cameras using motion capture systems. Romero et al. [122] proposed a system for activity analysis based on visualizing and analyzing overhead videos.

Video visual analytics are used to interpret data from surveillance cameras. Botchen et al. [21] investigated the inclusion of action-based details in video visualization by representing the video content in 3D to depict motion events and individual objects. In the context of video enhancement, many existing methods focus on manipulating visual content in the video. Stengel et al. [143] proposed a patch-based method to refine blurry frames of input video for eye-motion prediction when being watched.

Pattern analysis of human motions has broad applications, but some challenges remain. There are different motion pattern styles, and context information may be associated with such datasets. Visual analytics can facilitate such analysis of multidimensional spatiotemporal datasets through multiple linked visualizations.

3.4.4 Video Editing, Stylization, and Painterly Animation.

Wei et al. [163] proposed a method for converting distorted fisheye videos into natural-looking video sequences that preserve temporal coherency. This method supports interactive annotations to guide the correction process, and utilizes six distinct correction criteria expressed as quadratic energy functions. Berson et al. [12] integrated a generative RNN-based framework for editing facial animations to generate facial motions to fill or replace missing segments.

Fig. 6.

Video stylization and painterly animations [61, 88, 105, 178] are emerging areas of research and have applications in areas such as movies, social media, and entertainment. Lu et al. [88] designed a real-time video stylization system that uses object-flow construction based on a novel learning-based technique, which is sufficiently robust to overcome partial occlusions, problems in optical flows, and unknown object transformations. The method also supports different painterly styles. Yoon et al. [178] preserved spatial and temporal coherency by constructing a stabilized 3D-feature flow field using a combination of a 3D Sobel operator and smoothness based on color similarities and saliency features. Kagaya et al. [61] provided a painterly rendering system that enables spatial and temporal variation of style parameters and brush stroke orientation, supporting features such as the ability to emphasize/de-emphasize certain objects, to modify contrast between neighboring objects, and to adjust the level of abstraction. Figure 6 shows an example of two stylized frames of a blooming flower.

Fig. 7.

3.4.5 Video and Image Collection Analysis.

Keeping in view the scale and complexity of image and video collection datasets, visual analytics can facilitate analysis through multiple linked visualizations and support for interactive querying and filtering. Zahálka et al. [180] introduced II-20 (Image Insight, 2020), which dynamically models analytic categorization of image collections based on user interactions. Pan et al. [107] generated visual summaries of image collections based on content diversity, conciseness, and visual aesthetics and applied a backpropagation algorithm to optimize the layout of the collage. Wu and Qu. [167] proposed an interactive visual analytics system to support multimodal analysis of TED talks focusing on presentation styles. Multimodal content consists of frame images, text and meta-data. There are three major views: projection view (for cluster analysis), comparison view (for intracluster analysis), and video view. The authors’ analysis was focused on body postures, gestures, and rhetorical aspects of presentation (Figure 3). Recent research has focused on integrating machine intelligence with visualization to understand complex and large-scale data.

Image-Set Processing Streaming is an advanced technique in image processing that uses streams consisting of either image pixels or image sequences. Image population analysis is an essential method for understanding the evolution of a population that requires extensive computational power and memory. Ha et al. [47] presented a framework to solve this extensive computational problem with heterogeneous CPU or GPU-based systems. The authors presented an out-of-core solution that performed at the same level as that of an in-core solution, providing various examples to demonstrate the efficiency of the framework.

A comparative visualization was proposed by Schmidt et al. [128]. They proposed a multi-image view technique to visualize the similarities and differences in satellite image sets. LADV [91] used a deep-learning-based framework to learn design intentions from existing exemplars (dashboard images) or sketches to synthesize dashboard templates.

3.4.6 Distance Education and Massive Open Online Courses.

Distance education programs and MOOCs platforms like Coursera, edX, and Udacity have gained significant popularity in recent years [135]. These platforms offer great flexibility in terms of timing, courses offered, and access methods. The MOOC platforms offering such courses are interested in the web access logs (click stream data) of these courses to analyze learner interactions and engagement with the course material [26]. Visual analytics tools like PeakVizor [25] (Figure 7) enable experts to gain insights that are otherwise difficult to discover from raw data. PeakVizor features include analyzing peaks (regions of interest) in clickstreams, extracting anomalies, identifying different learner groups and their correlations in different peaks, and discovering patterns, the spatio-temporal distribution of clicks, and geographical and behavioral distribution of learners. He et al. [51] designed a visual analytics system “VUSphere” for the exploration and comparative analysis of video utilization in courses, students, and distance-learning centers.

Fig. 8.

3.4.7 Interaction Supported Learning.

Visual analytics-based methods can provide insights into classifier performance and facilitate model manipulation by interactively adjusting data labels and retraining. Höferlin et al. [57] proposed an inter-active learning-based framework that supports interactive data querying and selection, annotating data instances, iterative classifier refinement, model visualization and direct manipulation, and visual analysis of classifier performance through cascaded scatterplots (Figure 8). Their results showed that, in certain instances, this form of inter-active learning can help achieve classifier performances comparable to other learning methods within a few cycles. Work by Huang et al. [55] supports the interactive analysis and understanding of multiple attributes learning models for x-ray scattering images, by visual exploration in embedding spaces defined on multiple criteria.

3.4.8 Video Storyboard and Summarization.

Summarizing a large video to allow a user to quickly see the important events contained within it is another important application area. Researchers have also designed techniques to play non-important parts of a video at a faster rate [58, 73] or to add spatial context in the video for quick analysis. Wang et al. [158] conducted user studies to show that adding spatial context to a video helps participants to better understand it. Flagg and Rehg [42] presented a system to synthesize a crowd from the input video of natural crowds. AutoClips [136] generates videos automatically based on given data facts, utilizing a fact-driven clip library and an algorithm that selects clips, arranges them, and configures duration.

Sunkavalli et al. [148] presented a framework to generate a high-quality snapshot from a video clip along with a visual summary of the activities in the video (Figure 9). Meghdadi and Irani [95] proposed a video visual analytic system “sViSIT” to allow users to interactively search and track objects in a video. Their system automatically extracts all paths of an objects’ movements and allows them to be visualized in different views and forms. Users can query and retrieve any data from the video. Botchen et al. [21] proposed a technique to detect events and activities in a video dynamically by converting it into a series of snapshots. Parry et al. [109] also proposed a video summarization system called video storyboard, which is a summarized video with important frames and activities enhanced by illustrative annotations. Perin et al. [112] demonstrated a tool called “SoccerStories” to allow soccer experts to interactively analyze quantitative game data with game context such as player positions, player actions, and player movements. Their tool also helps to effectively communicate the revealed insights. Shu et al. [138] conducted an exploratory user study to understand the impact of different data-GIF designs on storytelling and provide guidelines for effective designs.

Fig. 9.

3.4.9 Augmented and Virtual Reality and Telepresence.

Recently, augmented and virtual reality environments have become more common for analyzing complex image and video visualizations. Head-mounted display devices are inexpensive and provide an immersive experience. Debarnardis et al. [35] evaluated various specifications of text visualizations on head mounted augmented reality displays. Serrano et al. [134] presented a technique for real-time playback of 360-degree videos in virtual reality headsets by adding parallax. The evaluation of their technique showed that the technique improves a users viewing experience. Turban et al. [151] proposed an algorithm (referred as Extrafoveal) based on spatial prediction, pyramidal computation, and human vision characteristics for peripheral extension of existing video (movie) content to improve the immersive experience.

Decomposing a video to augment the information contained within it has also been evaluated in the domain. Meka et al. [96] introduced a novel real-time method for the interactive intrinsic decomposition of scenes. Users can interactively improve the decomposition by using a mouse or through touch. The touch interaction also allows the user to place decomposition constraints directly in the 3D space. The authors’ method supports a wide variety of interactive applications, such as photorealistic recoloring, material editing, and geometry-based relighting (Figure 10). The presented method is also the foundation for many augmented reality applications. Lin et al. [83] presented a video retargeting method where 3D space-time objects are transformed by as-rigid-as-possible warping, whereas non-significant objects undergo linear rescaling. This method results in better motion and shape preservation compared to other state-of-the-art methods.

Telepresence is a technique that allows people to remotely visit and interact with other people in distant locations. Zhang et al. [183] presented a 360-degree video camera-based redirected walking robotic platform to support interaction and exploration of remote environments. The robotic platform was controlled remotely by the user wearing a head mounted display.

3.4.10 Video Stereolization.

Video Stereolization is a technique that converts a video to stereoscopic video for 3D viewing. Various algorithms and techniques are designed to capture motion and depth information in the videos to convert them into stereoscopic videos. Liao et al. [81] presented a system that combines motion analysis of a video with user interactions for conversion to stereoscopic video. Liu et al. [86] proposed a three-stage (point cloud extraction, merging, and meshing) multiview stereo algorithm based on point-clouds to generate free-viewpoint videos. Their point-cloud extraction is resistant to occlusions, noise, and lack of texture. Serrano et al. [134] also introduced a method for displaying a 360-degree video in virtual reality head mounted displays.

Some research has also been carried out to combine various projection devices and displays/recordings. Pjanic et al. [115] demonstrated a calibration method to mix different projection display devices. Their work was focused on displaying content accurately on a 3D surface.

Fig. 10.

3.4.11 Image Analysis, Editing, Summarization, and Matching.

Nearest patch matching techniques have recently emerged as a powerful tool for image and video matching, editing, and summarization, and are based on finding the most similar patch pairs between a source and target image [171]. Tan et al. [149] presented a technique to determine the distribution of paint pigment from an RGB image. This allowed users to make image editing operations in pigment space, and to perform operations such as edge enhancement, tonal adjustment, and recolor. Poco et al. [116] also introduced a technique for extracting color encodings from the bitmap images. Users can interactively verify the colors and correct them, if needed. Chartem [43] enables embedding additional information into chart images without having an impact on perception to facilitate reuse or repurposing. Chen et al. [27] proposed a framework based on R-CNN to recover 3D-shaped generalized cuboids and cylinders from a single photograph. Flyfusion [175] introduced a topology compactness strategy for the robust reconstruction of topology changes while reconstructing dynamic scenes using flying depth cameras.

Visualization of complex 3D models and scenes is difficult as the rendering of such scenes is often computationally expensive, and adding interactive navigation and analysis is challenging. Sunkavalli et al. [148] proposed a method for extracting a snapshot from a video, using an importance-based system to generate images using weighted values of image pixels. Presentation of 3D models using videos has also been proposed by Baldacci et al. [8]. This system provides many operations for a user, such as noise and blur reduction, super-resolution, and best focus selection, and also provides a visual summary of activities. Recent research has focused on optimizing input parameters of image analysis algorithms. Pretorius et al. [119] used a parameter optimization process coupled with key user requirements, and then developed a tool where users can examine the relationship between output and parameter values.

3.4.12 Medical Applications.

3D medical imaging has inspired the rapid development of visualization techniques for 3D medical image analysis [16, 44, 53, 130]. Chan et al. [24] presented a visual analytics system called “motion browser” that takes heterogeneous sensors and video-based patient hand therapy data as input, and allows users to explore the data interactively. The system comprises multiple coordinated views to allow physicians to compare and explore patient therapy data from multiple sources. The user can annotate the video data and compare it with sensors data (Figure 11). GUCCI [97] provides a suite of visualizations to compare and analyze blood flow data in the aorta of selected cohorts, helping to establish normal value ranges and derive guidelines.

In the field of medical research, many gaze analysis studies have been conducted to understand how radiologists read various types of medical images [77]. Song et al. proposed a tool called “GazeDiagnosis (GazeDx)”, which is an interactive visual analytics framework to compare gaze datasets from multiple users working with image datasets [141]. The CMed [108] system supports interactive exploration of crowdsourced medical image data annotations with the support for interactive querying, and analysis from different aspects.

Fig. 11.

3.5 Data Types, Scale, and Dimensions

Image and video datasets used in practice contain not only the visual content that forms part of the raw images and videos, but also information derived from such datasets. These datasets have diverse characteristics; they are multi-dimensional, hypervariate, spatial and temporal, heterogeneous, hierarchical, augmented, network, and multi-resolution, among others. Table 3 shows different datasets used in computer vision and visualization research focused on image and video datasets, respectively. The nature and characteristics of these datasets vary based on the relevant application areas and underlying task requirements supported in the implementation. This variety in the nature of datasets poses unique visualization challenges.

The datasets may be extended further at different stages of the processing pipeline due to additional data generated by algorithms and techniques involved in the corresponding implementations. Xie et al. [172] extracted semantic information from images and used a deep-learning framework based on CNN and LSTM to generate their descriptions. Bryan et al. [23] generated annotations to produce temporal summaries for time-varying datasets. DataClips [6] enabled interactive creation of data videos using existing data clips.

3.6 Task Requirements

We reviewed the visualization and visual analytics research related to image and video data to identify the major task requirements, and then grouped these based on the application areas. We also reviewed the recent major computer vision conferences and identified application areas relevant to image and video data analysis. This facilitated the comparison of task requirements and trends in the computer vision research and the visual analytics research related to image and video datasets.

Table 2 summarizes the major task requirements, organized into different application areas, based on the surveyed articles in the visualization domain. We identified representative higher-level task requirements (Column 3 : Task Requirements (Visualization)) aiming at explaining the needs of visualization researchers analyzing image and video datasets in the respective application areas. These task requirements not only provide an overview of the current research efforts in those application areas, but are also indicative of the volume of the research work conducted in those areas. We have also included references to selected articles relevant to different application areas in the visualization domain containing instances of these task requirements. The presentation of task requirements and application areas included in this table is not following any specific ordering strategy.

We have mentioned corresponding application areas in the computer vision domain (Column 4) by reviewing recent CVPR and ICCV program books, and those identified based on discussions with computer vision domain scientists (Column 5) (Section 4). The identification of these application areas provides us with an overview to draw comparisons between two domains, but it is certainly not an exhaustive list of application areas. This analysis helped us identify areas where there is an opportunity for collaborative future research efforts.

Application areas like “Video Editing, Stylization and Painterly Animation” have no direct overlap between the visualization and computer vision domains. In some areas, there is a partial overlap; for instance, the task requirements of visualization in “Medical Applications” include the need for analyzing heterogeneous datasets and interactive analysis of medical datasets, whereas there is more focus on annotation, localization, and segmentation in the task requirements of the computer vision domain.

The comparison of Computer Vision domain experts’ application areas and the analysis of corresponding visualization and visual analytics task requirements shows that the data preprocessing task requirements, as mentioned by domain experts, are generally applicable to multiple areas. Also, as expected, the computer vision domain experts’ application areas match more closely to the computer vision domain application areas (Table 2).

Table 2.

There are also multiple application areas that overlap in terms of similarity in task requirements in the visualization domain and those identified based on discussions with computer vision domain experts. However, areas like “Deep Learning”, “Pattern Analysis, Anomalies”, and “3D Modeling and Reconstruction” share more commonalities with the task requirements of the corresponding visualization application areas. Yet, the visualization domain task requirements are more centered toward interaction support, whereas the domain expert task requirements are more abstract and computation focused.

There are commonalities in task requirements of different application areas in the visualization domain. “Sports Analytics”, “Surveillance, Activity Analysis and Recognition, Motion Analysis and Tracking”, and “Interaction Supported Learning” have many overlaps, such as spatiotemporal pattern analysis, correlative analysis, summarization/aggregation, events and activities analysis, support for annotations, and multilevel/multifaceted search. Similarly, task requirements in “Image Analysis, Editing, Summarization and Matching”, “Video and Image Collection Analysis”, and “Augmented and Virtual Reality, and Telepresence” areas also share similarities.

Table 3.

3.7 Tools and Libraries

Table 3 provides an overview of the tools and datasets commonly used for the visualization of image and video data in the visualization domain. In the table, we group the tools and datasets used in the surveyed articles (from the major visualization conferences and journals) according to application areas defined in Section 3.4. We have provided a general overview rather than an exhaustive list of tools and datasets in each application area. We also shaded each application area of the table based on the visualization support available for that area, and categorized them into different classes based on the visualization support level. The support level represents the utilization and availability of interactive visualization tools in that area. We are not following any specific ordering strategy for application areas and tools in this table.

Based on the table, we observe that various standard libraries, such as D3, Angular, Node.js, Vue.js, and JQuery, are used for visualization and visual analytics requirements. Libraries such as Tensorflow and scikit-learn are used for machine-learning tasks. For image and video processing, OpenCV and Matlab are extensively used. For the GPU implementation of algorithms and techniques, OpenGL and Cuda are used. There are also some articles that use custom tools or have not mentioned the tools used; we labeled those articles as custom.

The color shading in Table 3 shows that the areas of Sports Analytics and Video and Image Collection Analysis have better interactive visualization tools support, followed by medical applications. On the other hand, areas like Content Synthesis and Removal, Surveillance, and Activity Analysis are not well supported.

We noticed that researchers mostly used application-oriented image and video datasets [42, 88, 148, 160, 170]. There were also a few datasets based on multiple sources [24].

4 Discussion with Domain Experts

The analysis of the visualization research publications related to image and video datasets helped us understand the current research trends in this area. It provided insights into the data and task requirements relevant to different application areas based on different taxonomies. We conducted discussions with computer domain experts to enhance our previous findings based on reviewing visualization research and to obtain a perspective from computer vision domain researchers. This helped us understand how these domain experts use visualization tools in image and video data analysis tasks and provided insights into their needs and requirements, as computer vision domain experts widely use image and video datasets. It also allowed us to understand their current research workflows, identify requirements related to computer vision research problems, learn about the challenges and limitations in their research, and recognize areas in which visual analytics can help advance computer vision research.

As we conducted discussions with only five domain experts, we do not claim that these findings are comprehensive or exhaustive; however, these discussions augment our previous findings, based on a review of visualization research, and help us to collect valuable feedback. These discussions also provided us with an overview of the datasets, visualization tools, algorithms and techniques, APIs and libraries, and computing infrastructure utilized in their research. The computer vision domain experts have at least six years of experience and are scientists working on different professional levels. Their research problems relate to 3D computer vision, 3D modeling and reconstruction, camera calibration and localization, scene interpretation, visual surveillance, activity analysis and recognition, autonomous and self-driving cars, and image classification and manipulation.

These discussions were mainly qualitative, and we took notes of the major findings. Two of the authors of this survey article conducted the discussions, with one predominantly asking questions and the other taking notes. Each discussion was approximately one hour long. Most of the questions were open-ended, and the emphasis was on finding the pain points of using visualization tools for image and video datasets in their respective research workflows. We discussed details of the experts’ application areas, their datasets, preprocessing tasks, major algorithms and techniques currently used in their research, the identification of visualization-related requirements, their current use of visualization tools in their workflows, and the APIs and frameworks utilized in their implementations.

In the initial phase, we discussed the details of the datasets currently used in the experts’ research and any general preprocessing steps involved when working with such datasets. The sizes of these datasets were mainly on the scale of hundreds of gigabytes (GBs). Some example datasets currently used by the experts include S3DIS, KITTI, 3D Point Cloud Datasets, Activity Net, Total Human Model for Safety (THUMS), YouTube Videos, ImageNet, VarCity, and Oxford RobotCar. Table 3 contains references to commonly used datasets in different application areas of visualization domain. While discussing the preprocessing stage of their work, most interviewees mentioned the difficulties involved in data normalization, cleaning, adjusting features, and exploring the overall characteristics of such datasets. They stated a need for interactive visualization tools that can both summarize and aggregate these datasets, but also enable them to make adjustments. One of the domains experts mentioned in the interview: “A visual analytics tool that can interactively provide insights about the data to facilitate preprocessing, cleaning, anomalies detection, and so on would be useful for the analysis.” Based on these discussions, we identified the following major requirements related to the preprocessing stage, which are focused on preparing the data for analysis in the later stages:

Data Preprocessing.

Provide overview of the dataset, with support for denoising, data cleaning, and outliers detection

Support for data normalization, transformation, standardization, and encoding

Adjustment in features that target all items in the dataset, such as image resizing, cropping, and rotations

Support for data augmentation and sampling methods

Almost all of the experts mentioned the prevalence and utilization of deep-learning methods in their research. One of the domain experts stated that more than 90 percent of their current research work in the domain of computer vision uses some form of deep learning. Major algorithms or techniques used by these domain experts include CNN, GCN, LSTM, GAN, PCA, and pooling methods (reducing data-dimensions). Based on our discussions, we identified the following major requirements grouped into different areas:

Support for Semantic Analysis.

Support for semantic augmentation

Semantic understanding in 3D utilizing semantic classes

Segmentation and Classification.

Annotation support in object segmentation and classification

Labeling points in the point clouds to represent their relationship with the objects

Scene Interpretation, Activity Recognition, and Tracking.

Activity analysis and recognition

R10

Trajectory analysis of moving objects

R11

Illumination changes in images and video datasets

3D Modeling and Reconstruction.

R12

Conversion of video sequences to 3D models

R13

Camera localization and 3D reconstruction

Deep Learning.

R14

Modification of internal structure and architecture of deep learning models

R15

Learn use of algorithms for certain static and dynamic scene scenarios

R16

Increase training samples

In the last phase of the discussion, we mainly focused on how interactive visualizations can support the experts’ analysis tasks. They mentioned that using interactive visualizations could help them understand the characteristics of the datasets they utilize in their work and can provide valuable insights that may help fine-tune the deep-learning models used. They expressed the desire to have interactive visualizations that can provide them with an overview of the datasets, with the support to explore details of interesting subsets of the entire dataset (overview and details-on-demand). One of the experts mentioned: “Instead of having a black-box approach where the internals of the algorithms and methods are not visible to the user of the system, a visual analytics system that opens up the black-box and provides insights into the internals of the model could help build trust and confidence in the results of the system.” Opening the black-box strategy can help understand the internals of the models and changes in patterns at different stages of learning [92, 101].

They currently use Visualization Toolkit (VTK) [129], PyVista [147], Meshlab [32], Matplotlib [56], Bokeh [15], TensorBoard [4], TensorFlow Lucid [2], LSTMVis [145], DarkSight [174], Facets [1], GANDissect [10], and NN-SVG [76] to address their visualization needs. Based on our discussion, we identified the following major requirements related to visualization support in their research workflows:

Large-Scale Data Visualization.

R17

Support for visualizing large-scale datasets

R18

Support for data aggregation and summarization

R19

Visualization framework that supports providing overview and details on demand

R20

Support for data navigation, multidimensional querying, spatial and temporal filtering

Pattern Analysis, Anomalies.

R21

Explore and highlight patterns and anomalies in the data

R22

Separation of classes and support for projection

R23

Insights into the deep network architecture, loss, activation filters, and convergence

R24

Support to avoid the manual task of labeling the datasets, noise removal, and adjustments in data items

Multiple Scenarios, Uncertainty.

R25

Uncertainty quantification in visualizations

R26

Visualization of an ensemble of images (e.g., multiple reconstruction scenes)

5 Discussion and Future Directions

Collaborative research efforts at the intersection of visual analytics and computer vision can open new avenues of research and further advance the state of the art in both of these areas. In this survey, we reviewed the research in visual analytics and visualization conferences and journals related to image and video datasets. We also obtained feedback from computer vision domain experts. We found that dealing with image and video data presents a multifaceted and unique set of challenges. These include adapting and scaling the visualization libraries to process large data; supporting interactive designs with support for multiple view visualizations; querying and filtering; sampling, summarization, and aggregation; adapting deep-learning libraries for visualization tasks; and visualization-specific benchmarks. Below, some of our findings and the existing challenges identified based on this survey are discussed. We also highlight potential future research directions.

5.1 Scalability Issues

In the computer vision domain, there is a large focus on using deep-learning techniques; in fact, more than 90 percent of current research uses these techniques, and domain experts also highlighted this point in our discussions. There are certain challenges associated with advancing the field of visual analytics to bring it in line with computer vision, when it comes to incorporating deep-learning. The tools and libraries in the machine-learning and visual analytics domains are also strongly focused on different domain-specific tasks [40], and there is a need for more collaborative efforts to design libraries and tools so that they can address the needs of domain experts from both fields.

In computer vision, deep learning is often used as a black-box, with little focus on providing insights into the internals of this black-box. On the other hand, visual analytics research intends to open up this box. To be able to trust the outcomes based on deep-learning techniques, users need insights into the decision-making process within this black box, and the rationale behind the outcomes. This builds trust on the outcomes, which is especially important if deep-learning techniques are used in applications of critical nature such as medical applications, public policy making, and law enforcement. There are also limitations associated with the availability of visualization libraries and frameworks that can enable this type of access to deep-learning techniques.

The scale of datasets used in computer vision research is extremely large (usually Gigabytes or Terabytes), and enabling visual analytics for such datasets is challenging as it may also involve integration of big-data frameworks. To provide access to data at multiple levels of detail (e.g., data, model, and features) there is an even greater need for big data frameworks to support interactive exploration.

In visualization research, there is a need to design advanced data processing frameworks that can facilitate visualization at multiple scales and granularity levels. The libraries and frameworks need to adapt and scale up to handle the exponential growth in the size of datasets. This also gives rise to unique issues of data sampling, summarization, querying, interactivity, transformation, and so on. Also, with advancements in high-performance computing technology, the use of big data frameworks and the availability of better visualization libraries for deep learning will bring more focus to the use of these techniques in visual analytics.

5.2 Insights from Surveyed Articles

Tables 1a, 1b, and 1c show the coding results based on the taxonomies of techniques and algorithms, visualization techniques, and application areas. This table can help domain researchers to understand the current trends and focus of research efforts relevant to image and video datasets. It also helps identify areas for future collaborative research. For example, areas relevant to medical applications, image and video analysis, and editing have relatively greater coverage. Below, we discuss more findings and insights based on these tables.

Application Areas.

Focusing on the coding tables of visualization and visual analytics articles in Tables 1a, 1b, and 1c, we observe that there has been a greater focus on “Image Analysis, Editing, Summarization, and Matching” compared to other areas. In the “Surveillance, Activity Analysis and Recognition, Motion Analysis and Tracking” application area, there are a very limited number of articles in comparison with the number of articles on current trends in computer vision research, where this application area is one of the actively explored research areas.

Furthermore, there was rapid growth in “Medical Applications” publications in 2021. If we look at the “techniques and algorithms” utilized in these articles, there is an increasing use of deep-learning-based techniques. In medical domain research publications, providing support for visual analysis of individual models or components at different levels of detail is important in order to build confidence in the analysis results. The support for comprehensive interactive visual querying, filtering, and visualizaton at different abstraction levels is a growing need in this area. This helps users explore and understand the rationale behind certain outcomes useful in CAD [37] and medical studies [69, 97]. In “Video Storyboard and Summarization”, there have been very few articles published since 2013.

Automated Techniques and Algorithms.

In the techniques and algorithms taxonomy (Table 1a, 1b, and 1c), we observe that there was less focus on machine-learning techniques before 2017, even though, by that time, deep learning was one of the most active research areas in computer vision. This trend may be due to the limited availability of visualization libraries related to deep learning, although general purpose libraries for deep learning were available. There is a strong separation in terms of tools and libraries used in both domains [40], but in the future there will be a pressing need for interactive visualization libraries with strong support for deep-learning frameworks due to the increasing popularity of machine learning. In recent years, the number of machine-learning related articles in visualization research has increased.

In Table 1a, 1b, and 1c, the “Other/Custom” category is prevalent in algorithms and techniques taxonomy. In the “Other/Custom” category, the techniques are mostly computer vision related; this category has the most publications, because the articles are related to image and video datasets. There is also an increasing trend in the “Clustering” and “Correlation” category. Deep learning-based frameworks may not be the only solution for image and video data analysis. As seen from our coding table, other/custom techniques are also prevalent, which shows that researchers design many custom or hybrid solutions to tackle image and video datasets. Although the recent trend shows extensive use of standard deep-learning techniques, opening the black-box is one of the pressing needs, as it not only helps explain the decisions but also enables domain experts to integrate domain knowledge during the analysis phase. This also explains the use of semi-automated approaches, extended basic models or hybrid architectures in certain cases.

Visualization Techniques.

In visualization techniques taxonomy (Table 1a, 1b, and 1c), 2D techniques are the most commonly used because image and video datasets are often 2D; thus, the visualizations designed for visualizing such datasets, and associated analytics, are also mostly 2D. The video and image datasets and information derived from such datasets are multivariate and complex in nature, so multiple linked views are often used to support their analysis. 3D is consistently being used as a visualization technique; however, its future use is projected to be more a result of advancements in immersive technologies [93]. In the surveyed articles, there is generally less focus on provided interactive support on model explorations and understanding. In the future, there is scope for providing better support for these kinds of interactions.

There is an also increasing trend of using more linked visualizations in a single dashboard. The datasets utilized in such scenarios are multi-dimensional or multi-attribute in nature. The nature of interactive explorations and interactions supported are also dependant on the underlying modeling frameworks utilized in the implementations. For example, deep-learning-based architectures may provide information at different levels (data, model, and features) and the supporting interactive exploration, analysis, and visualization of such information has unique challenges.

Immersive Analytics.

There is an increasing trend in the use of immersive environments for analytics [9, 93]. Immersive environments use 3D visualizations and have applications in areas like medical imaging and gaming. Due to advancements in computer vision in areas such as stereoscopic vision, 3D scene modeling and reconstruction, and imaging geometry, the role of immersive environments in supporting analytics will become more eminent, thus presenting an opportunity for future work in visual analytics. During our discussions with domain experts, an expert in 3D reconstruction emphasized the need for virtual reality and immersive environments (providing six degrees of freedom) in analytics applications.

Tools and Libraries.

The tools and libraries that are used in computer vision research are mostly general purpose libraries that address computer vision task requirements. Their configurations are often complicated and require deep-learning domain expertise. On the other hand, there is limited support for visualization libraries that provide the desired functionality and interfaces to access the deep-learning frameworks. There is a need for collaborative efforts at the intersection of these two areas for the development of tools and libraries that would provide support for visualization tools to be integrated within deep-learning frameworks. With the advancements in the deep-learning domain, high-level usage tools for deep learning, such as extraction of geometry, path, and object, are becoming more common; this will also facilitate advancements in visualization research focused on image and video datasets.

If we compare the tools and libraries in computer vision tools and visualization research (Table 3), the computer vision tools are mostly focused on deep learning, whereas visualization uses a combination of the tools from both domains (e.g., TensorFlow, PyTorch, D3, and Node.js).

Most visualization tools are designed to solve a particular problem or set of tasks, resulting in tools that are highly customized and not inter-operable. There is a need for visualization frameworks where multiple tools can be used to solve challenging problems. There are many application areas, and each has a unique set of requirements and challenges. Algorithms and techniques utilized in these areas also vary considerably. Studies should be conducted that provide clear guidelines for problem or application area-focused general frameworks that can also leverage the developments in other areas, such as technological advancements in deep-learning frameworks in computer vision.

Datasets.

In computer vision research, there is a focus on creating benchmark datasets. In our discussion with domain experts, they also mentioned that their focus is more on the preprocessing stage or the stage after the model training is complete in their typical research workflows dealing with dataset; their focus is not on providing visual analytics support. We also observed this when we assessed the computer vision articles included in this survey. In the visualization and visual analytics research, the datasets are more application oriented (Table 3), and they are usually in a form that enables interactive linked visualizations and utilizes the derived data, in addition to the image and video datasets.

The datasets usually used in the surveyed articles in the visualization domain are mostly derived from image and video in a semi-supervised manner. In the future, due to advancements in machine-learning techniques, we foresee that there will be more focus on the unsupervised generation of these datasets, resulting in an increase in the data scale, and an even higher demand to visualize this data.

Most datasets in our surveyed articles were either collected or generated by the authors, and are not standard datasets publicly available (demonstrated clearly in Table 3). In the computer vision domain, it is common to use standard data sets, which are publicly available; in the future, some work must be done to make various standard datasets publicly available for bench-marking and comparison.

Evaluation.

We reviewed the evaluation methods used in the surveyed articles. One common method was to evaluate the performance of the algorithms and techniques used [75, 88, 183]. In some articles, studies with domain experts were conducted to evaluate the visualization techniques [67, 142, 170]. Quantitative and qualitative evaluations were also conducted with real users [58, 151, 167]. Overall, there was less focus on evaluating the design tool in terms of perception and cognition of computer vision domain scientists. In the future, more studies are needed to evaluate how interactive visualization can better cater to the needs of computer vision domain experts.

Benchmark Tasks and Studies.

Benchmark tasks are quite common in the computer vision domain for comparing any model with the state-of-the-art model on standard tasks such as data processing, querying, and inference. There are no benchmark tasks available in the visualization domain for the use of machine learning or deep learning models. This presents an opportunity to design benchmark tasks for the benchmarking of machine learning models used in the visualization domain.

Furthermore, no studies exist to provide guidelines for the designer about users’ perception and cognition at the intersection of the computer vision and visualization domains. There is an opportunity to design such guidelines, that will help in designing better tools for image and video data analysis to improve ease of use and understanding for users.

5.3 ML Models not Designed for Visualization Domain

Machine learning models used in visualization applications are usually designed for computer vision problems. They are not tailored toward visualization use. More work is needed to effectively tailor these models to be used for visualization and visual analytics applications. In usual integration scenarios, machine learning models are considered as a black-box. On the other hand, visual analytics models are interactive with a focus on the human in the loop. The “expert in the loop” systems are designed to utilize human domain knowledge in the interactive analysis phase [69]. The routine tasks or automated approaches can be configured to trigger as a result of specific selections or interactions. Future research should evaluate how supervised and unsupervised models can be optimized by adding “human in the loop” and to interactively steer the automated algorithms.

Understanding the model behavior and performance across a wide range of input data and scenarios is important before integration into visual analytics systems. Tools like What-if [166] enable model understanding through interactive probing with minimal coding. These are open challenges and present opportunities for collaborative efforts to address these concerns. Understanding the transfer learning process while adapting models for new inputs [92] is particularly important, especially in the context of “expert in the loop” in visual analytics systems. There has been limited work in this area of visual analytics research focused on image and video datasets.

Semi-supervised and Reinforcement learning models were rare in our surveyed articles. Again, these models are not tailored toward visualization use, and there is scope for collaborative efforts to optimize their design taking into consideration the requirements of visual analytics. Other more complex and advanced machine learning techniques, such as federated learning and transform learning, are rarely used for image and video data, but we foresee this to change in the future.

5.4 Insights from Discussions with Domain Experts

During our discussions with computer vision domain experts, almost all of the experts emphasized that there is a need for visualization tools that can support interactive exploration of datasets with features like “Overview + details-on-demand”. This would help them better understand their data and apply any normalization or standardization techniques to improve the data quality before they move on to the training phase (data preprocessing). They also mentioned the limited availability of visualization tools that provide information about the internals of deep-learning networks. Currently, they mostly use it as a black-box, without gaining insights about what is going on inside the box. They also mentioned the need for interactive tools that provide information about the convergence of models during the training phase, support for interactively tuning parameters, and interactive visualization guided optimizations.

6 Limitations

Here, we discuss some of the limitations of our work. Discussions with computer vision domain experts enabled us to gather valuable insights and more practical details involved in deep-learning based implementations relevant to image and video datasets. These discussions were not exhaustive, as they were restricted to only five domain experts in computer vision. Information on the datasets, processes, algorithms and techniques, computing infrastructure, and details about the implementation and technical difficulties were occasionally not directly available from the articles. However, the discussions helped us gather this information from the domain experts.

While searching for visualization and visual analytics articles, we only focused on major visualization related conferences. There may be further relevant articles in some other related areas such as big data, high-performance computing, machine learning, and parallel computing, but each of these areas have technical issues of their own, and their inclusion would have excessively expanded the scope of our work.

In the future, we plan to build an interactive recommendation tool to explore articles, and related details included in this survey, and provide recommendations in terms of visualization techniques, automated algorithms, tools and datasets, while keeping in view the task requirements.

7 Conclusion

We presented a comprehensive survey of the state of the art in visualization and visual analytics research related to image and video datasets. We described recent advances at the intersection of computer vision and visualization to facilitate areas of visualization research, including different visualization tools, techniques, and solutions across different application areas. Overall, this also helped us to identify gaps and opportunities for future collaborative research. Discussions with experts working in the computer vision domain allowed us to identify the requirements for interactive visualizations of image and video datasets. After collection of relevant articles from the major visualization conferences and journals, we categorized the algorithms, visualization techniques, application areas, and task requirements used in these articles based on standard taxonomies. We summarized information gathered from publications in the visualization domain along with discussions with computer vision domain scientists into a systematic tabular format that facilitates comparisons and finding opportunities for future research. Lastly, we provided a detailed discussion about the insights found based on our survey results. We also presented current gaps in research, and associated limitations and challenges.

Acknowledgments

We also thank the KAUST Visualization Core Lab for their help and support.

References

[1]

2020. Facets: An Open Source Visualization Tool for Machine Learning Training Data. Retrieved September 17, 2020 from https://github.com/pair-code/facets.