Set function learning has emerged as a crucial area in machine learning, addressing the challenge of modeling functions that take sets as inputs. Unlike traditional machine learning that involves fixed-size input vectors where the order of features matters, set function learning demands methods that are invariant to permutations of the input set, presenting a unique and complex problem. This survey provides a comprehensive overview of the current development in set function learning, covering foundational theories, key methodologies, and diverse applications. We categorize and discuss existing approaches, focusing on deep learning approaches, such as DeepSets and Set Transformer-based methods, as well as other notable alternative methods beyond deep learning, offering a complete view of current models. We also introduce various applications and relevant datasets, such as point cloud processing and multi-label classification, highlighting the significant progress achieved by set function learning methods in these domains. Finally, we conclude by summarizing the current state of set function learning approaches and identifying promising future research directions, aiming to guide and inspire further advancements in this promising field.
1 Introduction
Set function learning is an emerging and rapidly developing field within machine learning [122], focusing on learning functions defined on set-structured data. In contrast to conventional learning paradigms where the order of input data significantly affects the learning process, set function learning methods are characterized by their invariance to permutations of input elements [64, 140]. This fundamental property makes them particularly effective for tasks involving unordered data. Conventional models, such as convolutional neural networks (CNNs) [59] and recurrent neural networks (RNNs) [89], have achieved significant success in tasks such as time-series analysis and natural language processing [26, 108, 111, 118], where preserving the order of input data is essential for capturing the underlying structure. However, many real-world applications involve learning from inherently unordered sets [3, 95], where conventional methods struggle, because they rely heavily on input order. For example, in point clouds analysis for three-dimensional (3D) object recognition and reconstruction, the individual points representing an object’s surface are inherently unordered [96, 133, 152]. Traditional methods often require extensive preprocessing, resulting in inefficiencies or data structure distortion. Similarly, in multi-label classification, where a single instance is associated with multiple labels [105, 135, 136], treating these labels as a set is more appropriate than using traditional approaches like binary relevance. Binary relevance treats each label independently, failing to capture complex interdependencies between labels, while set function learning models can effectively capture the underlying structure of unordered data, leading to more accurate and robust predictions.
There is increasing literature proposing novel methods that are capable of handling set-structured data, opening new avenues for machine learning and set-based learning problems [83, 122]. For instance, DeepSets [140] introduces a framework for learning permutation-invariant functions, ensuring that the outputs remain unchanged regardless of the order of input elements. PointNet [95] revolutionizes point clouds processing by directly dealing with raw point sets without requiring voxelization or other preprocessing steps, simplifying the workflow and preserving data fidelity. Set Transformer [64] leverages attention mechanisms to capture complex dependencies among set elements, enhancing the model’s ability to understand intricate relationships within the data. These pioneering works have shown the significant potential of neural networks to effectively model and learn from set-structured data, leading to substantial advancements in various domains such as point cloud processing [80, 103] and recommendation systems [38, 67].
Despite the promising advancements, set function learning faces several unique challenges. A fundamental challenge is ensuring permutation-invariance [123], as the output of a set function learning model should remain unchanged regardless of the order of set elements. Another critical challenge is scalability [86], as set function learning models should be capable of handling inputs ranging from small to large sets, often with varying sizes across different instances. This variability demands models that are flexible enough to adapt to sets with arbitrary cardinality while maintaining consistent performance. Additionally, the combinatorial nature of sets leads to significant computational challenges. As the size of the ground set increases, the number of possible subsets grows exponentially, making normal approaches infeasible for large-scale problems. The highly non-linear and interdependent relationships between set elements further complicate the learning process [64], requiring models that are expressive enough to capture these complex dependencies without becoming computationally intractable. It remains a central challenge to balance these requirements and effectively learn from limited data. Indeed, addressing these interconnected challenges demands specialized model architectures and innovative learning algorithms.
Given the rapidly growing interest in set function learning, we present a comprehensive survey of this promising area, providing researchers with insights into the state-of-the-art advancements. We review breakthrough papers and recent advancements, covering both theoretical foundations and practical implementations. While Kimura et al. [57] conduct literature review for permutation-invariant neural networks, their work only focuses on some typical methods and lacks the discussion of various applications. In contrast, as one of the very first surveys on set function learning, our work serves as a reference for anyone seeking to understand, apply, or advance this field, making several significant contributions. It provides a unified framework for understanding and categorizing diverse approaches to set function learning. The introduction of foundational theories allows interested readers to quickly capture basic concepts and engage in this area. Additionally, the systematical view of the strengths and limitations of different methodologies helps readers select the most appropriate approaches for specific tasks. The extensive discussion of applications across multiple domains underscores the broad impact and potential of set function learning methods, encouraging their adoption in new areas. Furthermore, the introduction of various datasets serves as a valuable resource for set function learning research. Finally, by identifying the challenges and future directions, we offer valuable insights for the research community, potentially inspiring new research ideas and accelerating progress in this significant area.
The rest of this survey is organized as follows: In Section 2, we formally introduce the problem of set function learning and related basic concepts. Section 3 discusses various deep learning methods for solving set function learning problems, while other approaches are mentioned in Section 4. The reviewed set function learning methods are summarized in Table 1. Section 5 describes various applications of set function learning models across different domains and introduces relevant datasets. Finally, we make a conclusion and discuss future directions in Section 6.
Set function is a type of function defined on set and particularly relevant in machine learning that deals with set structural data, such as point clouds [95, 96, 133], molecular structures [29, 31], and any other unordered collection of elements [37, 104, 141]. We begin this section by introducing the definition of set function.
Definition 2.1 (Set Function).
For two sets \(X,Y\), a set function is defined as a mapping from \(2^{X}\) to Y, i.e., \(f:2^{X}\rightarrow Y\), where Y is the response range and can be any set, such as the set of scalars, vectors, sets, and more complex structures.
There are many common set functions and we provide some examples as follows:
Example 2.1 (Point Set Function [95, 150]).
In point cloud classification, a point cloud is a set of points in 3D space, where each point is represented by its coordinates \((x,y,z) \in \mathbb {R}^3\). The objective is to learn a point set function that can predict the label associated with the input point cloud. Formally, this point set function can be formulated as \(h(\lbrace (x_i,y_i,z_i)\rbrace _{i=1}^n)=l\), where \(\lbrace (x_i,y_i,z_i)\rbrace _{i=1}^n\) denotes the set of coordinates representing the point cloud and l is the corresponding label.
Example 2.2 (Product Cost Summarizing Function [64]).
In a recommendation system, the objective is to learn a product cost summarizing function to recommend cost-effective products to users. Formally, this product cost summarizing function can be formulated as \(c(\lbrace \boldsymbol {f}_i\rbrace _{i=1}^n)=\sum _{i=1}^n(f_{i}^1\cdot f_{i}^2)\), where \(\lbrace \boldsymbol {f}_i\rbrace _{i=1}^n \subseteq \mathbb {R}^d\) is the set of products, \(\boldsymbol {f}_i=(f_{i}^1,f_{i}^2,\ldots ,f_{i}^d)\) is a d-dimensional feature vector of product i, and \(f_{i}^1\) and \(f_{i}^2\) represent price and quantity, respectively.
Example 2.3 (Square Corner Prediction Function [146]).
For object detection in traffic scenes, the objective is to learn a square corner prediction function to predict bounding boxes around objects such as cars. Formally, this square corner prediction function can be formulated as \(g(\lbrace (x_i,y_i)\rbrace _{i=1}^4)=\lbrace (x_i^\prime ,y_i^\prime)\rbrace _{i=1}^4\), where \(\lbrace (x_i,y_i)\rbrace _{i=1}^4\subseteq \mathbb {R}^2\) represents the vertices of a square and \(\lbrace (x_i^\prime ,y_i^\prime)\rbrace _{i=1}^4\subseteq \mathbb {R}^2\) represents four corners of the rotated square.
Having shown some examples of set functions, we introduce supervised set function learning, a branch of set function learning, aiming to learn set functions from labeled training data. The supervised set function learning problem is defined as follows.
Definition 2.2 (Supervised Set Function Learning).
For two sets X and Y, suppose that \(\mathcal {D}\) is an unknown underlying probability distribution over \(2^X\times Y\), from which the training set D is assumed to be sampled, i.e., \(D = \lbrace (x_i,y_i)\rbrace _{i=1}^n\), where each \(x_i\in 2^X\) is an input set and \(y_i\in Y\) is the corresponding target output label. We begin by choosing a hypothesis space \(\mathcal {H}\subseteq \lbrace h:2^X \rightarrow Y\rbrace\). The goal is to find a function \(h\in \mathcal {H}\) that maps input sets \(x_i\) to outputs \(y_i\), such that the loss function \(L_\mathcal {D}(h)\overset{\underset{\mathrm{def}}{}}{=}\mathop {\mathbb {P}}\nolimits _{(x,y)\sim \mathcal {D}}[\mathcal {\ell }(h(x),y)]\) can be minimized, where \(\mathcal {\ell }:\mathcal {H}\times Z\rightarrow \mathbb {R}^+\) measures the difference between the prediction \(h(x)\) and the ground truth y, and \(Z=2^X\times Y\).
Supervised set function learning has applications in various domains such as computer vision (e.g., object detection [145] and scene understanding [95]), natural language processing (e.g., text summarization [10] and relation extraction [71]), and bioinformatics (e.g., protein prediction [51] and drug discovery [53]). There are some example tasks of supervised set function learning as follows. The visualizations of these examples are illustrated in Figure 1.
Fig. 1.
Fig. 1. Visualization of examples.
Example 2.4 (Point Cloud Classification [95, 96]).
Let X be the set of all 3D points in the space and Y be the set of object categories such as sphere and cube. Each input set \(x_i \in 2^X\) represents a point cloud and the corresponding label \(y_i \in Y\) represents the object category of the point cloud. The goal is to find a function \(h \in \mathcal {H}\) that classifies each point cloud \(x_i\) into its correct category \(y_i\).
Example 2.5 (Predicting Total Cost of a Set of Products [122]).
Let X be the set of all possible products and Y be the set of possible total costs. Each input set \(x_i \in 2^X\) represents a set of products, where each product is characterized by features such as price, weight, and quantity. The corresponding label \(y_i \in Y\) represents the total cost of the set of products. The goal is to find a function \(h \in \mathcal {H}\) that accurately predicts the total cost \(y_i\) for each set of products \(x_i\).
Example 2.6 (Predicting Corners of the Rotated Square [145, 146]).
Let X be the set of all 2D points in a plane and Y be the set of possible sets of four points. Each input set \(x_i \in 2^X\) represents the vertices of a square rotated by an angle \(\theta\) around the origin. The corresponding label \(y_i \in Y\) represents the four corners of the square after rotation. The goal is to find a function \(h \in \mathcal {H}\) that predicts the four corners \(y_i\) for each set of vertices \(x_i\) given the rotation angle \(\theta\).
With a growing literature focusing on designing novel set function learning methods, we summarize three issues that should be taken into account when designing new set function learning methods, including permutation-invariance, theoretical expressive power, and scalability.
(1) Permutation-invariance: permutation-invariance is a fundamental requirement [141] for set function learning. In tasks such as point cloud processing [95] and molecular property prediction [51], the output of models should remain consistent regardless of the order of the input set elements [53, 96, 104]. This property, known as permutation-invariance, is essential for any learning method that handles set-structured data. To show the significance of permutation-invariance, consider the scenario of using conventional CNNs for point cloud classification. In this case, the input point set should be transformed into an ordered vector before being fed into the network. However, if the points are permuted differently, then the extracted features after convolution and pooling will reflect the new order of points, potentially resulting in different classification outcomes. This variability is undesirable, since the output label should be invariant for classifying the same set of points regardless of their order. This example underscores the necessity of a permutation-invariant hypothesis space in set function learning methods. To define permutation-invariance for functions on matrices, we first introduce the permutation matrix. The permutation matrix is a square binary matrix with exactly one entry of 1 in each row and each column, and 0 elsewhere, representing a permutation of set elements. For example, the permutation matrix \([(0,1,0),(0,0,1),(1,0,0)]\) represents the permutation where the first element goes to the second position, the second element to the third position, and the third element to the first position. Given an input \(\mathcal {X} \in \mathbb {R}^{N \times m}\) consisting of Nm-dimensional vectors, a permutation matrix \(\Pi\) belongs to the set of all permutation matrices \(\Pi _N\). Using n to represent the dimension of the output vectors, we can formally define permutation-invariance as
Definition 2.3 (Permutation-invariance).
For each \(\mathcal {X}\in \mathbb {R}^{N \times m}\) and \(\Pi \in \Pi _N\), if \(f(\Pi \mathcal {X})=f(\mathcal {X})\) always holds, then the function \(f:\mathbb {R}^{N \times m} \rightarrow \mathbb {R}^{N \times n}\) is permutation-invariant.
To keep permutation-invariance, various techniques are employed and we summarize three key strategies as follows.
—
Sorting is a straightforward technique employed in various learning models [123], where input set elements are sorted into a canonical ordering before being fed into the model. This mechanism essentially restricts the hypothesis space to inherently permutation-invariant functions, i.e., \(f(\mathrm{sort}(X))\), where X is the input set. However, this restriction may exclude some complex relationships that depend on the original data ordering or structure, biasing the hypothesis space towards functions that work well with particular sorting and limiting the generalization ability of models.
—
Augmenting the training data with various permutations of the input sets is a commonly used technique [121]. The basic idea is to create multiple reordered versions of each input set and include all these versions in the training data. While this strategy keeps the original hypothesis space unchanged, it encourages the model to find approximately permutation-invariant functions within this space. This approach is more flexible than sorting and can be easily combined with existing learning methods such as RNNs [9]. However, augmenting significantly increases the data size and it is computationally infeasible to generate all possible permutations.
—
Aggregating features of set elements through symmetric functions is an effective technique [12]. The key idea is to employ a permutation-invariant function to aggregate feature vectors from all elements in the input set into a unified set-level representation. This approach explicitly builds permutation-invariance into the model architecture and restricts hypothesis space to permutation-invariant functions of the form \(f(g(\phi (x_1), \phi (x_2),\ldots , \phi (x_n)))\), where \(\lbrace x_i\rbrace _{i=1}^n\) is the input set, \(\phi\) is the encoder, g is a symmetric aggregation function such as sum (DeepSets [140]) and max (PoinNet [95]), and f is a task-specific function. In fact, more complex encoders and aggregators, such as attention mechanisms [124], can expand the hypothesis space, potentially capturing higher-order element relationships.
(2) Theoretical expressive power: Expressive power in set function learning refers to the capacity of models to represent and approximate set functions. Set function learning methods should have sufficient expressive power with theoratical guarantee to capture complex relationships between set elements and set-level features [122, 123]. For example, point cloud processing often requires capturing high-level geometric structures and patterns [95]. Insufficient expressive power can lead to underfitting and poor performance in complex tasks.
(3) Scalability: It is vital for set function learning methods to handle input sets of varying sizes and run in polynomial time [64], ensuring their practical applicability [86]. For example, in point cloud processing [96], the number of points representing an object can vary with resolution and sampling method, sometimes reaching millions. Scalable methods can also adapt to dynamic applications with growing data over time [98, 127], allowing for incremental learning and updating the model with newly available data instead of retraining from scratch.
3 Deep Learning Methods
Deep learning methods have become pivotal in addressing various learning problems [62]. The fundamental neural networks, such as CNNs [59] and RNNs [119], have achieved remarkable success in multiple tasks, such as image segmentation [23, 106], object detection [101, 102] and speech synthesis [85, 117]. However, these models implicitly incorporate regularity assumptions in their neural structures, making them less adaptable to irregular data domains such as sets, which lack a fixed permutation [140]. To handle set function learning tasks, several works extend CNNs and RNNs to set function learning, while there is also growing literature developing novel deep learning methods specialized for processing set structural data. In this section, we introduce various deep learning methods designed for set function learning problems, categorizing them into multiple groups: CNN-based methods (Section 3.1), RNN-based methods (Section 3.2), DeepSets-based methods (Section 3.4), PointNet-based methods (Section 3.5), Set Transformer–based methods (Section 3.6), Deep Set Prediction Network–based methods (Section 3.7), Deep Submodular Function–based methods (Section 3.8), and other deep learning methods (Section 3.9) that cannot be classified into above groups.
3.1 CNN-based Methods
CNNs are highly efficient architectures due to their ability to leverage local connectivity and shared weights [47], leading to breakthroughs in a wide variety of tasks such as image processing [116]. To make full use of such advantages, some research extends CNNs to set-based learning problems. In this section, we introduce several CNN-based set function learning methods and divide them into two categories. One is extending convolution to set structural data and the other is integrating CNN with symmetric aggregations. The structure of this section is illustrated in Figure 2(a).
Fig. 2.
Fig. 2. Structure (a) shows the structure of Section 3.1. (b) The structure of Section 3.4. (c) The structure of Section 3.5. (d) The structure of Section 3.6. (e) The structure of Section 3.7. (f) The structure of Section 3.8.
3.1.1 Extending Convolution Operation to Set.
The first strategy for CNNs to learn set function is extending convolution operation to set. Wendler et al. [130] propose a novel class of CNNs for set functions by proposing powerset convolution and pooling. Powerset convolution is designed to be shift equivariant, meaning it commutes with specified shifts on the powerset domain. The shift is defined as modifying a set function \(s(A)\) by removing a subset Q from argument A, denoted as \(T_Qs=(s_{A\setminus Q})_{A\subseteq N}\), where N is the ground set. The convolution is given in Reference [93] as \((h\ast s)_A=\sum _{Q\subseteq N}h_Qs_{A\setminus Q}\), where the filter h is a set function. The powerset convolution layer is constructed by conducting powerset convolutions on multiple channels, summarizing the feature maps as in Reference [15], with both input and output being sets of set functions. Each output set function is derived by convolving the input set functions with corresponding filters, summing the results, and applying the non-linear transformation. Powerset pooling layer can reduce the complexity by aggregating elements into a smaller ground set, mapping the original set function to a new one on this reduced set, and can be implemented in various ways, such as combining elements as in Reference [109] and using a simple max rule. Powerset CNNs consist of multiple powerset convolution and pooling layers, the number of which can be adjusted according to specific tasks. However, the complexity analysis [74] shows that powerset CNNs are impractical for large ground sets as the number of set functions grows exponentially with the size of the ground set. Xu et al. [133] develop SpiderCNN, a novel CNN designed specifically for processing point clouds. The core component of SpiderCNN is the SpiderConv layer, which replaces conventional convolutional layers to enable convolution on point sets. Given a function F defined on a point set \(P\subseteq \mathbb {R}^n\) and a filter \(g:\mathbb {R}^n\rightarrow \mathbb {R}\) within a sphere centered at the origin with radius \(r\in \mathbb {R}\), the SpiderConv can be formulated as
where \(p,q\) are points. Conventional convolution becomes a special case of SipderConv when \(P=\mathbb {Z}^2\) is a regular grid. In SpiderConv, the filter g belongs to a parameterized filter family \(\lbrace g_w\rbrace\), which is piecewise differentiable for w and can be efficiently optimized using stochastic gradient descent (SGD). \(\lbrace g_w\rbrace\) is defined as the product of a step function and a Taylor polynomial, enabling capturing local geodesic information and ensuring expressiveness. SpiderCNN inherits the advantages of CNNs, making it effective to extract deep features and achieve good performance in segmentation tasks.
3.1.2 Combining CNN with Symmetric Aggregation.
The second strategy is combining CNN with symmetric aggregation like mean and max pooling, where the final aggregated feature depends only on the set contents and not their order. To deal with burst image deblurring, Aittla et al. [3] propose a U-Net-inspired [106] framework with symmetric pooling operations. In this framework, each image of the input set is processed individually through identical neural networks with tied weights, producing feature vectors. These feature vectors are computed through symmetry operations such as mean and max pooling. Eventually, the pooled features are processed through further neural network layers, outputting an estimate of the sharp image. In addition, the intermediate pooling layer is introduced, followed by \(1\times 1\) convolutions that fuse global features into local ones. This layer allows the concatenation of the pooled global state back to the local features, enabling information exchange between the set entities. Zhong et al. [151] design a CNN-based architecture called SetNet, which aggregates face descriptors into a compact descriptor. This framework is developed to enhance the efficiency and accuracy of retrieving a set of images that match a given query containing the descriptions of images for multiple identities. SetNet utilizes ResNet-50 [42] to extract features from each image, generating individual descriptors. These descriptors are aggregated into a fixed-length set-level vector using NetVLAD [4], which also helps reduce memory usage and runtime.
3.2 RNN-based Methods
RNNs are efficient in recognizing patterns when handling sequential structure data, such as time series [26, 108], speech [34, 88], and text [111, 118]. The connections in RNNs form directed cycles, enabling them to maintain a hidden state that captures temporal dependencies in input sequences [149]. This characteristic makes RNNs particularly suitable for sequential tasks. In this section, we introduce some works that extend RNNs to set function learning.
Qin et al. [97] present set-RNN, an adaptation of RNN for dealing with multi-label text classification, where the target output is a label set. Previous approaches tackling such tasks either transform the set into a predefined sequence or connect sequence probability with set probability, but these methods lack solid theoretical foundations and perform poorly in practice. The authors propose a novel training objective that maximizes set probability defined as the sum of probabilities across all sequence permutations of the set. During training, a variant of beam search is employed to approximate set probability by identifying the top K highest probability sequences. The same approximation technique is used during prediction to find the label set with the highest probability. This novel objective enhances the flexibility of set-RNN to search the best label orders, enabling it to efficiently tackle multi-label classification tasks. Inspired by Reference [97], Li et al. [68] develop a new approach called set learning, which optimizes the set probability by considering multiple permutations of structured objects. This method is applied to generative information extraction (IE) tasks, where the input is a text \(X=[x_1,x_2,\ldots ]\) and the output is a set of structured objects \(S=\lbrace s_1,s_2,\ldots \rbrace\), with each structured object consisting of several spans from X. Set learning introduces a new method to calculate the set probability, formulated as
where \(\Pi (Y)\) denotes all possible permutations of Y and \(\pi _z(Y)\) is a specific permutation in \(\Pi (Y)\). The set Y has the same size as S, containing all elements of S flattened into sub-sequences. Based on Seq2Seq learning [115], the set learning optimizes the set probability through Equation (1) and reduces the calculation cost through permutation sampling, achieving good performance on IE tasks. However, with the size of training data increasing, the benefits of permutation sampling diminish, while the runtime significantly increases.
3.3 FNN-based Methods
Feedforward neural network (FNN), also known as Multi-Layer Perceptron (MLP), is a fundamental architecture where information flows in one direction, from the input layer through hidden layers to the output layer [13]. It is widely used for various machine learning tasks such as classification [48] and pattern recognition [99]. In this section, we discuss several methods handling set structured data based on FNNs.
Rezatofighi et al. [103] introduce an innovative deep FNN-based approach, Deep Perm-Set Network (DPSN), to address set prediction problems, where the outputs are sets with arbitrary permutation and cardinality. DPSN models the set distributions by defining discrete distributions for set cardinality and permutation variables, as well as a joint distribution over set elements given a fixed cardinality. In the scenario where permutation is fixed during training, the permutation of the output set elements is consistently ordered during training, suitable for tasks such as multi-label classification. The network predicts the cardinality and the state (i.e., existence scores) of each set element by optimizing a loss function that combines cardinality loss (e.g., the negative logarithm of categorical distribution) with state loss (e.g., binary cross-entropy). The fixed permutation simplifies the learning process by eliminating the need to handle varying orderings of set elements. In the scenario of learning the distribution over permutations, the model address tasks where elements order varies during training, such as object detection. DPSN approximates the marginalization over all possible permutations, sampling significant permutations and dynamically determining the best assignment (permutation) for each training instance using the Hungarian algorithm. The learning process optimizes the posterior distribution over the network parameters by jointly considering cardinality, permutation, and state losses. In the scenario where order does not matter, the permutation of output set elements is assumed to be uniformly distributed, applicable to tasks where the specific order of set elements is not important. The network optimizes the cardinality and state losses without considering element order, dynamically determining the assignment between network outputs and ground truth annotations during each SGD iteration. While DPSN is proved to be effective in experiments, its scalability is limited due to the exponential increase in possible permutations with the set size growing.
Yu et al. [139] design a simple framework only with Simplified Fully Connected Networks (SFCNs) for temporal set prediction of user behaviors. This framework first adopts an element embedding layer to learn the representations of set elements. Subsequently, the set representation is computed through the newly designed permutation-invariant functions and an SFCN is applied to capture temporal dependencies among sets. To enable interactions between elements within each set, the newly designed permutation-equivariant function is employed to establish relationships between elements. Following this, another SFCN is used to uncover implicit correlations across multiple embedding channels. Finally, the user representations are aggregated adaptively by average-pooling and the probability of this behavior’s occurrence in the next period-set is calculated using an adaptive fusing module. Notably, this work is the first to show that a simple architecture can effectively deal with temporal set prediction tasks.
3.4 DeepSets-based Methods
Designing novel deep learning methods to tackle set function learning problems has been an active research topic, since Zaheer et al. [140] propose the foundational framework known as DeepSets. This pioneering work establishes key design principles for deep permutation-invariant neural networks and outlines the essential components, such as permutation-equivariant feature extraction and permutation-invariant set pooling. In this section, we introduce the basic concepts of DeepSets and the subsequent advancements. This section is organized as in Figure 2(b).
3.4.1 DeepSets.
To handle learning tasks over set structured data, Zaheer et al. [140] construct a permutation-invariant model. For a countable set X and a set Y, the function \(f:X\rightarrow Y\) is a valid set function, i.e., invariant to the permutation of elements in X, if and only if it can be decomposed into the form: \(\rho (\sum _{x\in X}\phi (x))\), where \(\phi\) and \(\rho\) are appropriate transformations. As for an uncountable set X with fixed size M, any continuous function f defined on X, i.e., \(f:\mathbb {R}^{d\times M}\rightarrow Y\), is permutation-invariant if and only if f can be approximated arbitrarily closely by a function of the form \(\rho (\sum _{x\in X}\phi (x))\). Therefore, any set function f can be represented in this formulation:
Based on this analysis, DeepSets is developed, capable of approximating any permutation-invariant function on the ground set X by using universal function approximators, such as neural networks, for the transformations \(\phi\) and \(\rho\). The model contains two main operations: (1) Each element \(x_m\) of the ground set X is transformed into a representation \(\phi (x_m)\) through the neural network \(\phi\). (2) The representations \(\phi (x_m)\) are summed to produce a single vector, which is then processed through network \(\rho\). The key idea is to aggregate all representations via summation and then apply nonlinear transformations through networks. In particular, the intermediate layers within DeepSets, such as \(\phi\), often exhibit permutation-equivariance, meaning that the processing of each element is independent of the order in which the elements are presented. This property ensures that the order of the elements does not affect their individual processing. Similarly to Definition 2.3, we formally define permutation-equivariance as follows.
Definition 3.1 (Permutation-equivariance).
For each \(\mathcal {X} \in \mathbb {R}^{N \times m}\) and \(\Pi \in \Pi _N\), if \(g(\Pi \mathcal {X})=\Pi g(\mathcal {X})\) always holds, then the function \(g:\mathbb {R}^{N \times m} \rightarrow \mathbb {R}^{N \times n}\) is permutation-equivariant.
The authors propose a novel formulation of permutation-equivariant functions, which can be represented as a neural network layer whose standard form is \(g_{\Theta }(\boldsymbol {x})=\sigma (\Theta \boldsymbol {x})\), where \(\Theta \in \mathbb {R}^{M\times M}\) is the weight vector and \(\sigma :\mathbb {R}\rightarrow \mathbb {R}\) is a nonlinear function such as sigmoid function. It is proved that \(g_{\Theta }:\mathbb {R}^M\rightarrow \mathbb {R}^M\) is permutation-equivariant if and only if all diagonal elements of \(\Theta\) are equal and all off-diagonal elements are tied together, i.e.,
where \(\lambda ,\gamma \in \mathbb {R}\), \(\boldsymbol {1}=[1,\ldots ,1]^\top \in \mathbb {R}^M\), and \(\boldsymbol {I}\in \mathbb {R}^{M\times M}\) is a identity matrix. Therefore, the neural network \(g_{\Theta }(\boldsymbol {x})=\sigma (\Theta \boldsymbol {x})\) is permutation-equivariant if \(\Theta =\lambda \boldsymbol {I}+\gamma (\boldsymbol {11}^\top)\), i.e., \(g(\boldsymbol {x})=\sigma ((\lambda \boldsymbol {I}+\gamma (\boldsymbol {11}^\top)) \boldsymbol {x})\). The layer has several other variations when specifying the operations and parameters. In summary, the permutation-equivariant property of the intermediate layers in DeepSets ensures that each element is treated consistently regardless of its position in the set. The final symmetric aggregation combines these equivariant features in a permutation-invariant manner, ensuring that the order of elements does not affect the output.
3.4.2 Generalization of DeepSets.
There are some works generalizing DeepSets. Margon et al. [80] focus on a principled approach to learning from unordered set elements, particularly when the elements themselves exhibit inherent symmetries. The authors propose the Deep Sets for Symmetric elements (DSS) framework, which generalizes the DeepSets to accommodate additional symmetries of elements. The core innovation is the introduction of DSS layers, incorporating multiple linear layers L that are equivariant to the permutations of the set and the inherent symmetries of the set elements, such as translational symmetry in images and rotational symmetry in 3D shapes. This symmetry is represented by a group H that operates on the elements. Concretely, a DSS layer applies a transformation to each set element while also considering the aggregated information from the entire set. Based on Equation (3), the DSS layer for a set \(\lbrace x_1,\ldots ,x_n\rbrace \subseteq \mathbb {R}^d\) with symmetry group H and feature dimension d is defined by
which generalizes DeepSets by applying linear H-equivariant functions \(L_1^H, L_2^H\). The authors prove that DSS networks are universal approximators, because the individual element-wise networks are universal for the symmetry group H, addressing the issue that restricting a network to be invariant or equivariant may reduce the expressive power [79]. Consequently, DSS layers can represent any function that respects the symmetries of the set elements and the set itself. In summary, the DSS framework extends DeepSets to problems involving symmetric elements, providing a comprehensive and theoretically grounded approach for learning from sets with intrinsic symmetries. Murphy et al. [83] propose Janossy pooling, a novel model for constructing permutation-invariant functions. Janossy pooling provides a universal method by representing a permutation-invariant function as the average of a permutation-sensitive function applied to all possible reorderings of the input sequence. However, the computational cost of summarizing all permutations and backpropagating gradients is pretty high. To solve this issue, the authors develop three approximation methods to trade off complexity and generalization: (1) canonical orderings: elements of the input sequence are reordered according to a predefined criterion, reducing the computational cost by avoiding the need to consider all permutations; (2) k-ary dependencies: the permutation-sensitive function is restricted to depend only on subsets of k elements at a time, reducing the number of permutations considered while still capturing important interactions; (3) permutation sampling: During training, permutations are randomly sampled, leading to fewer permutations. These strategies enable Janossy Pooling to unify and generalize existing methods, achieving competitive performance on various tasks compared to state-of-the-art techniques. Notably, DeepSets can be seen as a special case of Janossy Pooling with 1-ary dependencies, where the function depends on individual elements without considering interactions beyond simple aggregation. In contrast, Janossy Pooling allows for k-ary dependencies, capable of capturing higher-order interactions within the data.
3.4.3 Theoretical Analysis of DeepSets.
There have been several works conducted to analyze the theoretical properties of DeepSets as it is a fundamental set function learning approach. Wagstaff et al. [122] refer to permutation-invariant function f represented by the formulation of Equation (2) as sum-decomposition, where the combination \((\rho ,\phi)\) of function \(\phi :\mathbb {R}\rightarrow Z\) and function \(\rho :Z\rightarrow \mathbb {R}\) is a sum-decomposition with latent space Z for function f, namely, the function f is sum-decomposable via Z. They analyze the limitations of enforcing permutation-invariance using sum-pooling and derive a necessary condition that a sum-decomposition-based model with universal function representation should satisfy. It is demonstrated that only if the dimension L of latent space where the summation is located is no less than the set size N, the sum-decomposition-based models can represent arbitrary continuous functions defined on a set with size N. To resolve the open question regarding the representation capabilities of high-dimensional DeepSets posed in Reference [122], Zweig et al. [153] conduct expressive power analysis of DeepSets. They indicate that DeepSets require an exponentially large width to approximate certain symmetric functions, implying that the dimension L of latent space grows exponentially with the size N and dimension D of the input set. This analysis demonstrates that DeepSets may be inherently inefficient for representing certain high-dimensional symmetric functions unless it is enhanced with mechanisms enabling interactions between set elements. Wang et al. [125] further reveal the relationship between the latent space dimension L and the expressive power of DeepSets [140], overcoming the limitations of previous works that focus solely on one-dimensional features or complex analytic activations, which are impractical due to the exponential growth of L with N and D. Considering high dimensional features, i.e., \(D\gt 1\), the bounds of minimal latent space dimension L are proved to be divided into two categories according to the encoding network \(\phi\): (1) If \(\phi\) applies a linear layer with power mapping, then we can get \(N(D+1)\le L\lt N^5D^2\). (2) If \(\phi\) applies a linear layer and an exponential activation function, then we can get a tighter bound, \(ND\le L\le N^4D^2\). The proposed bounds imply that it is sufficient to model the latent space of DeepSets with L being \(\mathrm{poly}(N,D)\) for the universal approximation of set functions. It is also demonstrated that continuous mappings \(\phi\) and \(\rho\) are crucial for ensuring universal approximation of DeepSets. Table 2 compares the lower bounds of different research.
Table 2. The Comparison among Research on Expressiveness Analysis with Latent Space Dimension \(L\)
3.4.4 Proposing Novel Aggregating Methods.
There are some works trying to propose novel aggregating methods for set function learning. Aggregating inputs into a single representation is a common mechanism in set function learning, such as DeepSets, which utilizes sum-pooling to aggregate element-wise embeddings. Inspired by DeepSets, Abedin et al. [1] employ a set-based deep learning approach called SparseSense to handle the sparse data from passive sensors in human activity recognition (HAR) tasks. Unlike traditional methods that require dense data streams or rely on interpolation to estimate missing data points, SparseSense processes the sparse data directly, mitigating large estimation errors and long recognition delays. This method regards sparse sensor data as sets, allowing the model to focus on extracting discriminative features of all activity categories without relying on temporal correlations. The key idea is to apply a shared embedding network to project each set element into a higher-dimensional space, followed by featurewise maximum pooling to aggregate these embeddings into a fixed-size global representation for activity classification. SparseSense extends DeepSets to HAR, demonstrating that set-based neural networks can effectively handle irregular data points and tolerate missing information. Bartunov et al. [12] introduce an optimization-based aggregation method named Equilibrium Aggregation. This method generalizes existing pooling-based approaches, overcoming the limitations of existing techniques such as sum-pooling, which are constrained by their representational power. The Equilibrium Aggregation models the potential function \(F_\theta (x,y)\), which quantifies the discrepancy between each set element x and aggregation result y, as a learnable neural network with parameter \(\theta\). This layer architecture can be integrated into another multi-layer neural network to aggregate sets. The energy-minimization of Equilibrium Aggregation can be formulated as
where \(R_\theta (y)\) is regularization. The aggregation result y is computed through solving Equation (4) with numerical methods such as gradient descent. The neural network framework with Equilibrium Aggregation can be formulated as \(\rho (\phi _\theta (X))\) where \(\rho\) is a neural network. It is theoretically proved that this framework can universally approximate any continuous permutation-invariant functions if the output of Equation (4) has the same size as the input set. Equilibrium Aggregation provides a more flexible framework than DeepSets by utilizing a learnable potential function, potentially achieving better performance in tasks that require more detailed data representation. Horn et al. [43] propose a novel framework called Set Functions for Time Series (SeFT) for classifying irregularly sampled time series. SeFT regards time-series data as a set of observations, addressing the issues of irregular sampling and unaligned measurements without requiring imputation. This model employs a set function \(f:S\rightarrow {\mathbb {R}^c}\) derived from Equation (2). Denoting \(s_j\) as a single observation of the time series S, the function f can be formulated as
where \(h:\Omega \rightarrow \mathbb {R}^d\) and \(g:\mathbb {R}^d\rightarrow \mathbb {R}^c\) are both neural networks, with h mapping observations from the domain \(\Omega\) to a d-dimensional latent space, and g further mapping this latent representation to the final c-dimensional classification space. The variant of positional encoding [120] is used for time encoding, employing multiple trigonometric functions at different frequencies to convert 1-dimensional time t of each observation into a multi-dimensional input. To handle large observation sets and highlight the most relevant data points, a weighted mean aggregating approach based on scaled dot-product attention with multiple heads is designed to weigh different observations. This aggregating method independently calculates each element’s embedding, achieving a runtime and memory complexity of \({O}(n)\). The SeFT extends the representation of DeepSets specifically to the time series with irregular sampling, where the order of observations is not fixed and might not follow a regular interval.
3.4.5 Extending DeepSets to Specific Scenarios.
There are multiple works proposing methods that build on the foundational concepts of DeepSets, extending these concepts to specific scenarios. Yi et al. [137] introduce CytoSet designed to deal with sets of cells and predict the clinical outcome of patients. Since the order of cells’ profile has no biological relevance in flow and mass cytometry experiments, CytoSet regards the cytometry data as a set and extracts information through a permutation-invariant neural network based on DeepSets. This approach predicts the clinical outcome from the patient sample represented as a set of cells, with each cell characterized by a vector of protein measurements. In the proposed model, several permutation-equivariant blocks, as described in Reference [140], are stacked to transform the representation of each set element. The output of these blocks is processed by max-pooling, which measures the presence of high response cells and produces an embedding vector for the set. This vector is then passed through fully connected layers to predict the clinical outcome. CytoSet extends the concept of DeepSets to clinical cytometry data analysis, generalizing CellCNN [7] and CytoDx [46], and achieving better experiments performance compared to them. Ou et al. [86] present equivariant variational inference for set function learning (EquiVSet) to predict set-valued outputs (subsets) that optimize a certain utility function over a given ground set under the optimal subset (OS) supervision oracle, where the optimal subset provides the maximum utility. They combine energy-based method with DeepSets to construct an appropriate set mass function that increases monotonically with a set utility function. To enable training models on varying ground sets and overcome the instability caused by the high dimension of sets when directly optimizing likelihood, a scalable training and inference algorithm is proposed by utilizing the maximum likelihood principle in conjunction with mean-field inference as a surrogate. The EquiVSet improves DeepSets to model the utility function explicitly and handle more complex tasks involving OS oracles. Wang et al. [124] develop an effective model termed as DTS-ERA, which combines the proposed Deep Temporal Sets (DTS) with Evidential Reinforced Attentions (ERA) to uncover the signature behavioral patterns of multimodal data in behavior analysis of children with autism spectrum disorder. DTS-ERA is implemented in the manner of few-shot learning, enabling it to effectively handle situations with limited data. DTS is a multimodal version of DeepSets, capable of capturing complex temporal and spatial relationships in multimodal data. It is composed of a temporal encoder and a spatial encoder, which generate feature representations that maintain temporal dependencies and spatial locality. These feature representations are then concatenated and aggregated through an average-pooling to obtain the deep-set encoding. In ERA, DTS is combined with reinforcement learning agent where an evidential reward function is designed to learn an epistemic policy, which selects representative embeddings as attention signatures. ERA incorporates evidential learning to estimate uncertainty, allowing the model to distinguish between known and unknown regions effectively, thereby improving the reliability of the predictions.
3.5 PointNet-based Methods
PointNet [95] is another important and pioneering set function learning approach, particularly designed to deal with point clouds, taking the point sets as input and outputting labels. In this section, we introduce the fundamental concepts of PointNet and discuss several relevant works based on it. The structure of this section is outlined in Figure 2(c).
3.5.1 PointNet.
PointNet can directly process point clouds without converting them into regular data structures such as 3D voxel grids, maintaining the inherent properties of point clouds. The components of PointNet are similar to DeepSets while only replacing the sum-pooling with max-pooling. For a finite point set X and its element x, the set function \(f:2^X\rightarrow Y\) whose value corresponds to the semantic label of point set can be approximated by the PointNet formulated as
where \(\phi\) captures features of each point in X. These features are aggregated through max-pooling and then passed to \(\rho\) to obtain the output. Both \(\rho\) and \(\phi\) are neural networks or other parameterized models with learnable parameters. This framework is permutation-invariant, because max-pooling can process arbitrary orders of points in the point set and obtain the same output. It is theoretically proved that PointNet is capable of approximating any continuous set function if the max-pooling layer contains enough neurons.
3.5.2 Theoretical Analysis of PointNet.
Chrisian et al. [18] explore the expressive power of neural networks that use set pooling mechanisms. The authors introduce and analyze a variety of set pooling architectures, such as sum-pooling (DeepSets), max-pooling (PointNet), and average-pooling (normalized-DeepSets). The theoretical analysis reveals that PointNet cannot generally approximate averages of continuous functions over sets (e.g., center-of-mass), and DeepSets is strictly more expressive than PointNet in the constant cardinality setting. This finding implies that the choice of set pooling function has a dramatic impact on the expressiveness of these networks. Unexpectedly, it is also proved that any function that can be uniformly approximated by both PointNet and normalized-DeepSets should be constant under the unbounded cardinality setting.
3.5.3 Improving Capabilities of PointNet.
There are several works that enhance PointNet, extending its applicability to more complex scenarios. To overcome the limitation that PointNet cannot learn the local structures at various scales, Qi et al. [96] develop a hierarchical neural network named PointNet++, which applies PointNet recursively to nested partitions of the input point set. PointNet++ contains multiple set abstraction levels, including sampling layer, grouping layer, and PointNet layer. The sampling layer chooses points to define local regions’ centroids, around which the grouping layer explores neighboring points to build local regions. Then the PointNet layer encodes local region patterns into feature vectors. In particular, the grouping layer has two implementations: multi-scale grouping and multi-resolution grouping. These methods are capable of adaptively aggregating multi-scale features with respect to corresponding point densities, thereby eliminating the impact of varying point set densities on different regions. Generally, PointNet++ begins by extracting local features that capture fine geometric structures within small neighborhoods through PointNet. These local features are then grouped into larger units and further processed to generate higher-level features. Such hierarchical process is repeated iteratively until the comprehensive features of the entire point set are obtained, realizing both robustness and detail capture. While PointNet uses a global max-pooling operation to aggregate features from the entire point set, PointNet++ enhances it by introducing a multi-scale hierarchical learning process, expanding its capabilities to capture detailed local structures and handle varying point densities. In contrast to PointNet that directly processes point cloud by considering each point independently and utilizing max-pooling to aggregate global features, Prokudin et al. [92] design a type of residual representation termed as basis point sets (BPS), which can encode a point cloud into a fixed-length vector, enabling the use of standard machine learning techniques. To construct BPS, the point clouds are normalized to fit a unit ball with radius \(r\in \mathbb {R}\), where we randomly sample \(k\in \mathbb {R}\) points from a uniform distribution to obtain basis point set. By calculating the minimal distance from each basis point to the nearest point in the point cloud, we obtain feature vectors for every point cloud. These feature vectors can be taken as inputs of the learning algorithms. The point cloud classification experiments demonstrate that the framework combining MLP with BPS achieves performance comparable to PointNet, while significantly reducing the number of parameters and computational complexity.
3.6 Set Transformer–based Methods
In this section, we introduce Set Transformer [64], a powerful neural network designed for learning functions on sets, and discuss several set function learning methods built upon it. This section is organized as in Figure 2(d).
3.6.1 Set Transformer.
Set Transformer [64] is an attention-based neural network method capable of modeling interactions across input set elements, which are often overlooked by set pooling methods such as DeepSets [140] and PointNet [95]. Based on Transformer [120], Set Transformer employs permutation-invariant self-attention to capture pairwise and higher-order interactions between elements. Suppose that \(Q,R\in \mathbb {R}^{n\times d}\) are query set and value set respectively, consisting of nd-dimensional vectors. To construct the Set Attention Block (SAB), the authors employ the Multihead Attention Block (MAB), which is a variant of the Transformer’s encoder, with positional encoding and dropout removed. Given matrices \(X,Y\in \mathbb {R}^{n\times d}\), the MAB with parameter \(\omega\) is defined as follows:
where \(\mathrm{LN}\) is layer normalization [8] and \(\mathrm{rF}\) is an arbitrary row-wise feed-forward layer. The SAB can be formulated as \(\mathrm{SAB}(X)=\mathrm{MAB}(X,X)\). The higher-order interactions of elements can be captured by stacking multiple SABs. To reduce the high computational cost associated with self-attention, the Induced Set Attention Block (ISAB) is designed based on SAB and inspired by inducing point methods used in sparse Gaussian processes. The ISAB containing m inducing points I, i.e., m trainable d-dimensional vectors \(I\in \mathbb {R}^{m\times d}\), can be formulated as
where h is permutation-equivariant to X and \(\mathrm{ISAB}_m(X)\) is permutation-invariant to X. In this way, the computational time is reduced from \(O(n^2)\) in SAB to \(O(mn)\) in ISAB. The Pooing by Multihead Attention (PMA) with k seed vectors \(S\in \mathbb {R}^{k\times d}\), i.e., \(\mathrm{PMA}_k(Z)=\mathrm{MAB}(S,\mathrm{rF}(Z))\), is developed to aggregate encoding features set \(Z\in \mathbb {R}^{n\times d}\). This mechanism allows the model to adaptively weigh the importance of different elements in the set, which is particularly useful in scenarios requiring multiple correlated outputs, such as clustering tasks. Generally speaking, in Set Transformer, the input set is encoded by the stacking of SABs or ISABs, followed by aggregation using PMA. The aggregated representation is then passed through a feed-forward network to produce the output. It is theoretically proved that Set Transformer is capable of universally approximating any set function.
3.6.2 Improving Set Transformer.
Through gradient analysis, Zhang et al. [144] indicate that DeepSets and Set Transformer probably suffer from vanishing and exploding gradients when stacking more layers. They also observe that layer normalization discourages performance, because the invariance of layer norm decreases the representation power and drops potentially useful information for prediction. To tackle such issues and make these set neural networks deeper, DeepSets++ (DS++) and Set Transformer++ (ST++) are developed by proposing equivariant residual connections (ERC) and set norm. ERC is a refined residual connection approach adhering to the clean path principle, capable of avoiding potential gradient issues by maintaining a clean path from input to output. The set norm is a novel normalization layer, which standardizes each set over the minimal number of dimensions and transforms features individually. This mechanism preserves most of the mean and variance information, avoiding the invariance issues associated with layer normalization. By integrating ERC and set norm into the encoders of DeepSets and Set Transformer, respectively, the enhanced models DS++ and ST++ are constructed. These improvements enable the models to achieve greater depth and comparable performance, effectively addressing the instability issues in the original versions.
3.6.3 Employing Set Transformer as Encoder.
There are several works employing the Set Transformer as encoders to construct new models for set function learning. Li et al. [67] propose an effective recipe representation learning model named Reciptor, which jointly processes ingredients and cooking instructions. The ingredient set is encoded by Set Transformer, enhancing the model’s ability to capture interdependence among elements. A pretrained skip-instruction model is employed to encode the cooking instructions, generating initial embeddings and providing a context-aware representation of the entire cooking process. These initial embeddings are subsequently processed by a forward long short-term memory network to produce the final instruction embeddings. To further optimize the learned embeddings, the authors utilize a novel knowledge graph-based triplet sampling loss [22], ensuring that semantically related recipes are closer in the latent space. The embeddings are refined by combining a triplet loss with a cosine similarity loss between ingredient and instruction embeddings. The Reciptor outperforms baselines on two newly designed downstream classification tasks. Based on Set Transformer, Gim et al. [38] design a set-based cooking recommender called RecipeBowl, which processes a given set of ingredients and cooking tags to output corresponding ingredient and recipe choices. Set Transformer is employed as the encoder to build a comprehensive representation of the ingredient set. A two-way decoder maps the representation into two distinct embedding spaces: one for predicting missing ingredients and the other for recommending relevant recipes. The model is trained using a combination of negative likelihood loss based on Euclidean distances and cosine embedding loss for the recipe prediction task, ensuring that the predicted ingredients and recipes are aligned with their actual counterparts in the embedding space.
Zhang et al. [142] present efficient algorithms to address the challenges of relational reasoning in cooperative Multi-Agent Reinforcement Learning (MARL) with permutation-invariant agents. They leverage Set Transformer to implement complex relational reasoning among agents in MARL. Two algorithms are proposed, including model-free and model-based offline MARL algorithms. The model-free approach employs transformers to estimate the action-value function, incorporating a pessimistic policy to handle distributional shifts in offline settings. The model-based approach estimates the system dynamics with transformers, also utilizing a pessimistic policy. The key contribution is deriving generalization error bounds for transformers in MARL, demonstrating that these bounds are independent of the number of agents and less sensitive to the depth of the network. Jurewicz et al. [55] develop Set Interdependence Transformer (SIT), an efficient set encoder to solve set-to-sequence tasks. This set-to-sequence model is established by combining the SIT with a permutation decoder. Set transformer serves as the basic set encoder, learning permutation-equivariant representations of individual elements and permutation-invariant representations of the entire set. SIT enhances these representations with an augmented attention mechanism to capture higher-order interdependencies. The permutation decoder uses an improved pointer attention mechanism to select elements, forming coherent output sequences. This approach effectively handles sets of varying cardinalities and generalizes well to unseen set sizes, as shown in experiments.
3.6.4 Extending Set Transformer to Meta-learning.
Lee et al. [66] propose Meta-Interpolation, a universal task augmentation method designed for few-task meta-learning. Meta-Interpolation utilizes Set Transformer to process the embeddings of support and query sets from different tasks and learn a parameterized set function, mapping sets of task embeddings to new embeddings that mix features from different tasks. This process creates new tasks that have unique features drawn from the tasks being interpolated. Bilevel optimization is employed to jointly optimize parameters of the meta-learner and the set function. The upper-level optimization aims to minimize the loss on meta-validation tasks, ensuring that this augmentation strategy improves generalization. The lower-level optimization adapts the meta-learner to augmented tasks, reducing the risk of overfitting to the limited meta-training set. This method theoretically regularizes the meta-learner by enforcing a distribution-dependent regularization, which decreases the Rademacher complexity and thus improves the generalization.
3.6.5 Other Methods Utilizing Attention Mechanisms.
There are multiple works utilizing attention mechanisms to learn set functions, similar to Set Transformer. Girgis et al. [39] develop an encoder–decoder framework, Latent Variable Sequential Set Transformers, termed as AutoBots, to deal with the challenging task of predicting the future trajectories of multiple interacting agents. The permutation-equivariant encoder processes sequences of sets representing the agents states over time, incorporating both temporal and social information through multiple Multi-Head Self-Attention modules. The decoder utilizes multiple matrices of learnable seed parameters, enabling the model to capture multi-modal nature of future trajectories. This process allows for the generation of diverse and socially consistent predictions across the entire scene in a single forward pass. The model achieves state-of-the-art performance, particularly in trajectory predictions that adhere to real-world constraints such as road layouts. Zhao et al. [150] propose Point Transformer, a novel architecture tailored for unordered 3D point sets. Point Transformer mainly contains SortNet and local-global attention module. SortNet is a novel neural network that learns to sort the input point cloud data into a specific order based on selected features. Once the points are sorted, the Point Transformer layer applies local attention to aggregate features of each point from its k nearest neighbors, capturing fine-grained details and local geometric structures within small regions. Following the local attention, global attention aggregates features from the entire point cloud or larger regions, complementing the local information and enabling the network to understand the overall structure. The outputs from the local and global attention modules can be combined, either through concatenation or a weighted sum, to form a comprehensive feature representation for each point. This representation can be used in downstream tasks for learning the underlying shape.
3.7 Deep Set Prediction Network-based Methods
This section introduces Deep Set Prediction Network, an effective method to solve set prediction problems, and discusses relevant works that build on Deep Set Prediction Network (DSPN) to enhance its capabilities. The structure of this section is illustrated in Figure 2(e).
3.7.1 Deep Set Prediction Network.
DSPN [145] is a model designed to predict sets from feature vectors, addressing the issue that previous methods such as RNNs result in discontinuous and inaccurate predictions due to lack of consideration on the unordered nature of sets. DSPN employs the same module for both encoding and decoding processes. Concretely, the encoder \(g_{\text{enc}}\) maps the input set X into the latent space z, i.e., \(z=g_{\text{enc}}(X)\), obtaining the representation of X. The decoder \(g_{dec}\) predicts a set from this representation, i.e., \(\hat{X}=g_{\text{dec}}(z)\), applying gradient descent with a learnable initial guess to find a set whose latent representation matches the input set. This process can be regarded as a nested optimization. In the inner loop, the predicted set is refined iteratively to minimize the difference between its encoding and the target representation. Meanwhile, in the outer loop, the weights of the model are trained by minimizing the loss between the predicted set and the true set. Formally, the representation loss and decoder are defined as
where the permutation-invariant \(L_{\text{repr}}\) compares \(\hat{X}\) with the latent representation of X. Considering \(g_{\text{enc}}\) is a neural network, the gradient descent is utilized for T steps to solve the minimization of Equation (7) from initial set \(\hat{X}^{(0)}\). At the same time, the weights of \(g_{\text{enc}}\) are trained to minimize set loss \(L_{\text{set}}(\hat{X}^{(T)},Y)\), where the \(L_{\text{set}}\) can represent Chamfer loss or pairwise loss, to obtain an appropriate representation z. In general set prediction, there is no set encoder, since the input is usually a vector instead of a set, in which case, a term is added to the loss of outer loop to ensure \(g_{\text{enc}}(X)\approx z\). The DSPN shows significant improvements over traditional methods, particularly in providing accurate set predictions without requiring complex postprocessing. This work opens up new possibilities for set prediction problems.
3.7.2 Improving DSPN.
There are several works that improve DSPN and extend its application to specific scenarios. Zhang et al. [146] design a differentiable set pooling method called FSPool, which consists of two operations, sorting and weighted summation. Instead of treating set elements as whole units, FSPool sorts each feature independently across the set elements. After sorting, a weighted sum is computed, with the weights determined by a learnable calibrator function. FSPool can handle sets of different sizes using a continuous representation of weights. In experiments involving both bounding box and state prediction, the authors combine DSPN and other models, such as MLP, with FSPool, max-pooling, and sum-pooling respectively. The results indicate that simply replacing the pooling function in an existing model with FSPool leads to better results and faster convergence. By replacing the gradient descent updates of DSPN with transformer that provides more expressive and efficient updates, Kosiorek et al. [58] propose Transformer Set Prediction Network (TSPN), where an MLP is utilized to predict the number of points from the input embedding and decide the size of the initial predicted set. TSPN initializes the predicted set with a random set of points sampled from a learned distribution, enhancing flexibility. The transformer is employed to iteratively update set elements, leveraging self-attention mechanism to model dependencies between elements and output the predicted set. Compared to DSPN, TSPN achieves more expressiveness and less computational cost. Zhang et al. [141] develop a framework called Deep Energy-based Set Prediction (DESP), which treats set prediction as a problem of conditional density estimation rather than optimization with set-specific losses. This method utilizes deep energy-based models to capture the distribution of sets given some input features. The energy function \(E_\theta (x,Y)\) assigns a scalar energy to a pair of input features x and a set Y, where a lower energy indicates a higher likelihood of the set. Given the input, the probability of a set is \(P_\theta (Y|x)=\frac{1}{Z(x;\theta)}\exp (-E_\theta (x,Y))\), where \(Z(x;\theta)\) is a partition function. In the proposed framework, two permutation-invariant energy functions \(E_{DS}(x,Y)\) and \(E_{SE}(x,Y)\) are derived from DeepSets and DSPN, respectively. These energy functions can be used to formulate deep energy-based models, which can be trained by minimizing the negative log-likelihood, allowing for the approximation of the true data distribution without requiring explicit pairwise comparison between predicted and ground truth sets. DESP utilizes a stochastically augmented prediction algorithm, which helps explore multiple modes to generate diverse outputs. The final predicted set is determined as the set with the lowest energy among all captured sets. DESP extends the capabilities of DSPN to more effectively handle the inherent complexity and stochasticity of real-world tasks.
Zhang et al. [148] define a relaxation of common set equivariance, the multiset equivariance, which does not require equal elements in a multiset to remain equal after transformation. This property is crucial for handling multisets with duplicate elements, enabling models to process them with different strategies. Additionally, the authors propose exclusive multiset equivariance, which describes models that are multiset equivariant but not set equivariant, aiming to tackle the issue that set-equivariant functions cannot represent certain functions on multisets. It is proved that DSPN satisfies the exclusive multiset equivariance when selecting the appropriate set encoder. To reduce memory and computational requirements, the implicit DSPN (iDSPN) is developed by employing approximate implicit differentiation to replace the gradient descent of DSPN. This method avoids storing intermediate gradient steps by directly computing the gradient at the optimal point, making the optimization process more efficient. iDSPN shows superior performance compared to traditional set-equivariant models, especially in handling multisets and large-scale set prediction tasks.
3.8 Deep Submodular Function-based Methods
Submodular function is a significant subclass of set function, in which case, learning submodular function is an important area within set function learning. In this section, we introduce Deep Submodular Function (DSF), which focuses on learning submodular functions, and discuss related research that improves the capabilities of DSF. The structure of this section is outlined in Figure 2(f). We begin this section by introducing the definition of submodular function.
Definition 3.2 (Submodular Function).
If for two sets \(V,Y\) and a set function \(f: 2^V \rightarrow Y\), given any two sets \(A,B\subseteq V\), then it holds that \(f(A)+f(B) \ge f(A\cup B)+f(A\cap B)\), the set function f is submodular. In particular, when it holds that \(f(A) + f(B) = f(A\cup B) + f(A\cap B)\), then the set function f is modular.
Submodular functions are extensively used in machine learning [28, 49, 64, 140] and they have several important properties: (1) Diminishing returns: For any two sets \(A,B\) such that \(A \subseteq B \subseteq V\) and \(s\notin B\), given a submodular function f, the diminishing returns can be denoted as \(f(A\cup \lbrace s\rbrace)-f(A) \ge f(B\cup \lbrace s\rbrace)-f(B)\), which means the incremental gain of adding an element to a set decreases as the set becomes larger. (2) Natural concavity: Submodular functions are viewed as the discrete analog of concave functions, because the property of diminishing returns is akin to the definition of concavity. (3) Modularity: This property implies additivity, meaning that for disjoint sets A and B, it holds that \(f(A\cup B)=f(A)+f(B)\). Modularity simplifies machine learning tasks by ensuring linearity [14, 27], particularly in feature selection [28] where it facilitates the computation of the relevance of feature sets. (4) Monotonicity: The value of a monotone submodular function does not decrease when additional elements are added to a set.
3.8.1 Deep Submodular Functions.
DSFs [28] are a special class of submodular functions that strictly generalizes many existing submodular functions and inherits some properties from them. For example, DSF can represent decomposable submodular functions, which can be expressed as sums of concave functions composed with modular functions. A notable subclass of these functions is the feature-based submodular function [128], which can be formulated as \(f(X)=\sum _{u\in U}w_u\phi _u(m_u(X))\), where \(\phi _u\) denotes non-deceasing univariate normalized concave function, \(m_u\) represents feature-specific modular function and \(w_u\) is feature weight, all of which are non-negative. To overcome the limitation that features themselves cannot interact in feature-based submodular functions, an additional layer of nested concave functions is employed, i.e., \(f(X)=\sum _{s\in S}\omega _s\phi _s(\sum _{s\in U}w_{s,u}\phi _u(m_u(X)))\), where S denotes a set of meta-features, \(\omega _s\) represents meta-feature weight and \(\phi _s\) is a non-decreasing concave function. The term \(w_{s,u}\) represents the feature weight corresponding to meta-feature s. By recursively applying the above layer, we can derive the DSF. Consider a series of disjoint sets \(V^{(0)},V^{(1)},V^{(2)},\ldots ,V^{(K)}\), where \(V^{(0)}\) is the ground set, \(V^{(1)}\) is features set, \(V^{(2)}\) is meta-features set, \(V^{(3)}\) is meta-meta-features set and by analogy until \(V^{(K)}\), with each set representing a layer. Denoting the size of \(V^{(i)}\) as \(d^i=|V^{(i)}|\), we can employ a matrix \(w^{(i)}\in \mathbb {R}_+^{d^i\times d^{i-1}}, i\in \lbrace 1,2,\ldots ,K\rbrace\) to connect two continuous layers. The element at row \(v^i\) and column \(v^{i-1}\) of matrix \(w^{(i)}\) is denoted by \(w_{v^i}^{i}\), where \(w_{v^i}^{i}:V^{i-1}\rightarrow \mathbb {R}_+\) represents a modular function over set \(V^{i-1}\). In this case, the matrix includes \(d^i\) such modular functions. Moreover, given a non-negative non-decreasing concave function \(\phi _{v^k}:\mathbb {R}_+\rightarrow \mathbb {R}_+\) and any set \(A\subseteq V\), a K-layer DSF \(f:2^V\rightarrow \mathbb {R}_+\) can be formulated as
Having shown the definition of DSF, it can be seen that DSF is composed of multiple layers, with each layer taking a positive linear combination of the previous layer’s outputs followed by the application of a concave function. This hierarchical structure enables DSF to capture complex interactions within the data. The layered structured DSF shares similarities with deep neural networks (DNN), allowing for extending DNN learning techniques to DSF. The authors utilize a max-margin learning approach that is tailored to maintain the submodularity for training DSFs. This learning process involves adjusting the parameters of the DSF to maximize a margin-based objective function, ensuring that the learned function assigns high values to desired subsets while penalizing undesired ones.
3.8.2 Extending DSF to Specific Scenarios.
There are some works that extend DSF to specific scenarios and improve its capabilities. Ghadimi et al. [35] propose a novel model called deep submodular network (DSN), which combines the principles of deep learning with submodular optimization for multi-document summarization. DSN is similar to DSF, with the key difference being that DSN employs both modular and submodular functions to construct network blocks, whereas DSF only utilizes modular functions. Consequently, DSN generalizes DSF, making it applicable in a wider range of scenarios. The DSN utilizes the L-BFGS-B algorithm [20] for training, which is memory-efficient and suitable for maintaining non-negative weights. Manupriya et al. [76] design a novel approach called Submodular Ensembled Attribution for Neural Networks (SEA-NN), which aims to interpret the contribution of each input feature to the neural network’s output, particularly in image-based tasks. The core component of SEA-NN is a submodular score function learned by DSF, which achieves the combination of several existing gradient-based attribution methods, such as Integrated Gradients and Smooth Integrated Gradients, and integrates their score bias. It is trained using heatmaps generated by baseline attribution methods, to increase the score for features that are highly relevant and specific. The learned scoring function re-evaluates the importance of features in the input by assessing the marginal gain of each feature, reducing the attribution scores of redundant features that may be present in the raw attribution maps. The SEA-NN is model-agnostic and can be applied to various scenarios.
3.8.3 Addressing the Limitation of DSF.
DSF models submodular functions as an aggregation of modular concave functions, but it does not provide methods for selecting these concave methods, complicating the practical application. To eliminate this gap, De et al. [25] introduce novel neural networks, FLEXSUBNET, to estimate both monotone and non-monotone submodular functions. FLEXSUBNET models submodular functions by recursively applying concave functions to modular functions. This neural network allows for learning these concave functions from data, enhancing the expressiveness. The core of FLEXSUBNET is a simple recursive chain that is a restriction of complex topology, with each node in the chain sharing the learnable concave function. According to the monotonicity of learned functions, two scenarios are considered: (1) For monotone submodular function, it is learned through a recursive model. At each step, the model calculates a linear combination of a previously computed submodular function and a modular function, which is then processed by a learnable concave function to produce a composed submodular function. (2) For non-monotone submodular functions, it is also learned through a recursive model, where a non-monotone concave function is applied to a modular function. The model can be trained using (set, value) pairs or (perimeter-set, high-value-subset) pairs, with applications in subset selection tasks where high-value subsets need to be extracted from larger sets.
3.9 Other Deep Learning Methods
In this section, we introduce other deep learning methods for set function learning problems, such as GaitSet, RepSet, and so on, which cannot be classified into the above several groups.
Skianis et al. [113] propose RepSet, a novel permutation-invariant neural network designed to address learning problems over sets of vectors. This model generates several hidden sets, with each containing a set of d-dimensional vectors. The correspondence between the input set and these hidden sets is established using a bipartite matching algorithm. These hidden sets can be updated through backpropagation during training to obtain the representation, which is then passed to a fully-connected layer to compute the output. In addition, ApproxRepSet, a relaxed version of RepSet that leverages fast matrix computations, is designed to handle large sets efficiently.
Li et al. [69] design an ordinary differential equations(ODE) - based method called Exchangeable Neural ODE (ExNODE), capable of extracting the interdependencies between set elements. ExNODE can be applied to both generative tasks and discriminative tasks. For generative tasks, the method employs continuous normalizing flows to model the distribution of sets and generate new samples. In set classification tasks, the embedding vector v of input x is \(v=\mathrm{MaxPool}(\mathrm{ExNODE}\:\mathrm{Solve}(\phi (x)))\), where the linear function \(\phi\) expands the feature dimensions of each input set element and the ExNODE learns the feature representations, which are aggregated by max-pooling. This model achieves fewer parameters and greater efficiency in point cloud classification and set likelihood estimation tasks.
Chao et al. [21] develop a gait recognition network called GaitSet, which regards gait as a set of independent frames containing gait silhouettes. In this framework, a CNN is employed to extract features from each frame of the gait set independently, capturing detailed spatial information. The Multilayer Global Pipeline extracts features at different levels from multiple layers of the CNN, combining them to form a comprehensive representation to preserve details of paces. These extracted frame-level features are then aggregated into set-level features by set pooling such as mean-pooling and attention mechanisms. The set-level features are further processed by Horizontal Pyramid Mapping, which splits the feature map into strips at multiple scales. This approach allows the model to capture both global and local features, enhancing the discriminative power of the representation. The experiments demonstrated the model’s effectiveness with a limited number of frames and its ability to integrate information from different levels.
Shi et al. [112] propose a novel machine learning estimation method called Deep Message Passing on Sets (DMPS), which is designed to handle set-structured data by incorporating relational learning, bridging the gap between learning on graphs and learning on sets. This method begins by constructing a latent graph that represents the relational structure between set elements. This is achieved through a deep kernel learning approach, where each set element is transformed into a feature space and a kernel function is applied to capture similarities between elements. The message passing is adopted to update each set element based on a weighted sum of all elements, leveraging relational information. In particular, a stack of message passing is utilized to capture higher-order dependencies between set elements, but resulting in over-smoothing and vanishing gradient. To address such issues, two modules, Set-Denoising and Set-Residual blocks, are designed to integrate with DMPS. The Set-Denoising block eliminates over-smoothing by combining the original and updated features of set elements. Meanwhile, the Set-Residual block maintains distinctive features among set elements, preventing feature homogenization.
Guo et al. [24] develop a prototype-oriented optimal transport (POT) approach to improve representation learning for set-structured data. In this framework, given a set j, a distribution \(Q_j\) over these prototypes is computed, and each set is represented by a vector \(h_j\), which indicates the mixture of prototypes relevant to that set. To align the distribution \(Q_j\) with its empirical distribution \(P_j\) over set elements, this model employs an Optimal Transport (OT) distance, which measures the effort required to transform \(P_j\) into \(Q_j\), providing a natural mechanism for training the model to capture the set’s statistics. The objective is to minimize this OT distance, encouraging the model to learn both global prototypes and set-specific representations \(h_j\) effectively. POT can be integrated into existing architectures like summary networks and applied to different types of tasks, such as few-shot classification and meta-generative modeling.
Zhang et al. [147] design an innovative model that optimizes permutation matrices to learn representations of set-structured data. The key component is the Permutation-Optimization module, which rearranges sets by minimizing a cost function via gradient descent. Given an input set X represented as a matrix \(X = [x_1, x_2, \ldots , x_n]^T\), where each \(x_i\) is a feature vector, the algorithm initializes a permutation matrix \(P^{(0)}\) either uniformly or through linear assignment. The total cost function \(c(P)\) is defined as
where \(C_{ij}\) is the pairwise ordering cost between elements i and j, and k and \(k^\prime\) are element positions. \(C_{ij}\) measures permutation quality to determine the optimal element ordering. However, the overall complexity of the algorithm is \(\Theta (n^3)\) per iteration, making it impractical for large sets.
Previous set encoding methods, such as DeepSets [140] and Set Transformer [64], implicitly assume that the entire set can be stored in memory and accessible simultaneously, while it is unrealistic when processing large sets or streaming data. To overcome this limitation, Bruno et al. [16] define a new property called Mini-Batch Consistency (MBC), which is essential for maintaining consistent set representations across different mini-batches. MBC is defined as the property that ensures the encoding of a full set is equivalent to the aggregation of its mini-batches’ encodings. To obtain the set representation, the authors develop Slot Set Encoder (SSE), where each slot is a learnable vector that interacts with set elements to capture their features. The attention mechanism in SSE computes attention weights using the dot product of the slots and the set elements, followed by a sigmoid activation function, eliminating the batch-dependent normalization issue that breaks MBC. To capture the interactions across set elements, the hierarchical slot set encoder is constructed by combining a stack of SSEs with a final PMA module in Set Transformer. The SSE can process sets in mini-batches, making it suitable for real-world applications with large sets. To overcome the limitations that SSE only supports sigmoid activation and cannot adopt more expressive non-MBC modules, Willette et al. [131] develop the Universal MBC (UMBC) framework, which extends SSE to more activation functions, such as softmax, enabling more expressive set encoders. The authors also propose an efficient training algorithm that approximates the full set gradient by aggregating gradients from subsets of the set, maintaining constant memory overhead. This approximation provides an unbiased gradient estimate, significantly outperforming biased estimates derived from randomly sampled subsets. Notably, the UMBC framework is capable of universally approximating any continuous permutation-invariant function.
4 Other Methods
This section discusses several non-deep learning set function methods, such as kernel and decision trees-based methods.
Kernel-based approaches for learning set functions typically define a distance or similarity measure (kernel) to establish correspondences between sets. This measure is often combined with instance-based machine learning methods, such as Support Vector Machines (SVM). Nikolentzos et al. [84] utilize the Earth Mover’s Distance metric and SVM with Pyramid Match Graph Kernel. Buathong et al. [17] propose more efficient kernel methods by leveraging Reproducing Kernel Hilbert Space embeddings. They introduce Double Sum (DS) kernels, which calculate the sum of kernel evaluations over all pairs of elements. But DS kernels often lack strict positive definiteness, limiting their applicability. To overcome this limitation, the authors develop Deep Embedding kernels, applying a radial kernel in Hilbert space over the canonical distance induced by DS kernels. The proposed kernel methods enhance the Gaussian Process models in prediction and optimization tasks with set-valued inputs. However, these kernel-based methods usually suffer from high computational complexity and memory overhead, since they compare all sets to each other.
Wendler et al. [129] develop novel algorithms for learning Fourier-sparse set functions using non-orthogonal Fourier transforms within a discrete-set signal processing framework [94], which generalizes classical signal processing to set functions. The proposed algorithm, Sparse Set Function Fourier Transform, computes the non-zero Fourier coefficients by utilizing Fourier support, which refers to the set of indices where the Fourier coefficients are non-zero. The Fourier support is determined by iteratively restricting the set function to subsets of its domain and identifying those subsets where the Fourier transform of the restricted function significantly contributes to the original set function. This algorithm requires \(O(nk-k\log k)\) queries and \(O(nk^2)\) operations, where n is the size of the ground set and k is the number of non-zero Fourier coefficients, achieving significant improvements over the naive fast Fourier transform.
Lu et al. [75] propose Set Locality Sensitive Hashing (SLoSH), an efficient algorithm for set retrieval by leveraging Sliced-Wasserstein Embedding (SWE) and Locality-Sensitive Hashing (LSH). The SWE embeds each set into a lower-dimensional space through random linear projections, followed by sorting and calculating the Monge couplings, preserving the properties of the Wasserstein distance. The computational complexity of this embedding process is \(O(Ln(d+\log n))\), where n is the set size and L is the number of projections. Having obtained the vector representation, the LSH, sensitive to the Sliced-Wasserstein distance, is employed to find approximate nearest sets. The authors provide theoretical bounds for the SLoSH, ensuring the computational efficiency of both the embedding and hashing steps.
Feldman et al. [33] focus on the complexity of learning submodular functions on the Boolean hypercube \(\lbrace 0, 1\rbrace ^n\). They prove that any submodular function f can be approximated within \(\epsilon\) in \(\ell _2\) norm by a real-valued decision tree of depth \(O(1/\epsilon ^2)\). The function f is represented as a binary decision tree T with a rank of at most \(4/\epsilon ^2\), ensuring \(\Vert T - f\Vert _2 \le \epsilon\). Leveraging this approximation, the authors develop a Probably Approximately Correct (PAC) learning algorithm for submodular functions. This algorithm runs in time \(\tilde{O}(n^2)\cdot 2^{O(1/\epsilon ^4)}\), where n is the number of variables, significantly improving learning efficiency under the uniform distribution. They also establish an information-theoretic lower bound of \(2^{\Omega (1/{\epsilon }^{2/3})}\) and a computational lower bound of \(n^{\Omega (1/{\epsilon }^{2/3})}\), implying optimality (up to the constant in the power of \(\epsilon\)) of their algorithms. Raskhodnikova et al. [100] introduce a polynomial-time algorithm for learning submodular functions. This method builds on a structural result showing that any submodular function \(f: \lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1,\ldots ,k\rbrace\) can be represented by pseudo-Boolean \(2k\)-disjunctive normal form (DNF) formula, which extends the traditional DNF formula to handle integer-valued functions, enabling learning submodular functions with techniques similar to those for Boolean functions. The authors propose a PAC learning algorithm, which is a generalization of Mansour’s PAC-learner for k-DNFs. This algorithm transforms the submodular function into a pseudo-Boolean k-DNF, applies random restrictions and utilizes Fourier analysis to identify significant coefficients. The algorithm is efficient, with runtime polynomial in n, \(k^{O(k \log k/\epsilon)}\), \(1/\epsilon\), and \(\log (1/\delta)\), where \(\epsilon\) and \(\delta\) are accuracy and confidence parameters respectively. The authors also establish lower bounds on the complexity of learning submodular functions, demonstrating the method’s optimality.
5 Applications and Relevant Datasets
In this section, we introduce several applications of set function learning methods. These methods have shown great potential in scenarios where data can be naturally represented as sets, and the order of elements is not inherently important. As researchers continue to explore the capabilities of set function learning, the range of applications is expanding across various domains that require processing and reasoning over unordered sets of data.
5.1 Point Cloud Processing
Set function learning has the potential to revolutionize point cloud processing by treating the point cloud as a set of vectors, where each vector represents the features of a point. In point cloud applications, set function learning methods are mainly used for classification, segmentation, and detection tasks. As point cloud data are widely used in various applications, set function learning methods are likely to play an increasingly important role in this domain.
Point cloud classification aims to determine the category of objects represented by point clouds. Set function learning models such as DeepSets [140] PointNet [95], Set Transformer [64], and other methods [3, 16, 41, 58, 78, 80, 83, 92] extract relevant features from the entire point set. This enables accurate identification of objects such as cars, trees, and buildings, which is particularly important in fields such as robotics [6, 142] and autonomous driving [61, 152].
Point cloud segmentation is the process of labeling each point in a point cloud with a specific class, allowing for detailed scene understanding, such as distinguishing between vehicles and people in autonomous driving [40, 81]. Advanced models such as PointNet++ [96], Point Transformer [150], and other methods [24, 69, 131, 141, 148] achieve good performance on this task by considering both local and global features.
Point cloud detection focuses on identifying and localizing objects within a point cloud, providing bounding boxes around detected items. This application leverages set function learning models [64, 66, 80, 92, 103, 112, 144] to propose regions of interest and refine these regions for precise localization. Point cloud detection ensures the safe and efficient operation of autonomous driving [30, 134] and robotic navigation [44, 107] in dynamic environments.
Empirical comparisons show that PointNet++ and Point Transformer generally outperform DeepSets, PointNet, and Set Transformer in point cloud tasks on ModelNet40 and ShapeNet datasets [32, 150]. In particular, Point Transformer achieves the best performance in point cloud classification and segmentation by leveraging self-attention mechanism. SpiderCNN and PointNet++ excel in segmentation tasks by incorporating hierarchical and local feature extraction techniques. Meanwhile, DuMLP-Pin offers competitive performance while significantly reducing computational complexity, demonstrating its efficiency for classification and segmentation.
5.2 Set Anomaly Detection
Set anomaly detection is another crucial application, aiming to identify outliers within a set by leveraging set function learning models such as DeepSets [140], PointNet [95], and other variants [80, 86, 96, 141, 150]. The process begins by extracting feature representations for each element in the set, which are subsequently aggregated to produce a unified set representation. The core of anomaly detection is to compare each element’s feature against this aggregated representation, assigning an anomaly score to each element based on its deviation from the overall set pattern. This is typically accomplished through a subsequent layer that evaluates the extent of deviation. The framework then outputs a probability distribution over the elements, with higher probabilities indicating a greater likelihood of being outliers. Set anomaly detection is vital in various domains, such as detecting unusual behavior in sensor networks [91] and identifying outlier faces in image sets [114].
Experiment results on the CelebA dataset highlight different strengths among models for set anomaly detection [32]. PointNet and DeepSets offer simplicity but struggle to capture complex interdependencies within sets, limiting their performance. Set Transformer improves performance by incorporating attention mechanisms to model relationships among set elements. However, DuMLP-Pin outperforms these methods by achieving the highest accuracy while significantly reducing parameter complexity. This makes DuMLP-Pin a competitive choice for anomaly detection tasks, particularly in resource-constrained environments.
5.3 Recommendation Systems
Set function learning methods [64, 140] enhance recommendation systems by modeling user-item interactions as sets and effectively capturing user preferences regardless of the ordering, which facilitates the accurate representation of user profiles. For example, Reciptor [67] and Recipebowl [38] both employ Set Transformer to capture relationships between input elements, enabling more accurate recipe recommendations. Set function learning methods also integrate contextual information into the set, enabling context-aware recommendations [2, 145] that adapt to different situations for more relevant suggestions. Additionally, they ensure diversity [60] and fairness [19] in recommendations by producing sets that cover a wide range of items, improving collaborative filtering [110] through better aggregation of similar user preferences or items. Moreover, these methods require fewer parameters and enable efficient training on large datasets, making them promising for enhancing recommendation system accuracy and relevance.
Empirical results for recommendation task on Recipe1M dataset show that Reciptor and RecipeBowl outperform DeepSets by effectively modeling ingredient relationships and recipe context, with RecipeBowl achieving the best performance [38, 67]. Reciptor excels in cuisine classification and region prediction, while RecipeBowl focuses on context-aware ingredient and recipe recommendations. In contrast, while DeepSets is computationally efficient, it struggles to capture complex dependencies, resulting in weaker performance.
5.4 Set Expansion and Set Retrieval
Set expansion involves identifying new objects that are similar to a given set of objects and retrieving relevant candidates from a large pool. This process is closely related to set retrieval, where the goal is to efficiently retrieve items from a large dataset that match the characteristics of a target set. In the text concept set retrieval task, the goal is to identify and retrieve words that belong to a specific concept or category based on a given set of example words. For example, starting with {apple, orange, pear}, the aim is to retrieve additional related words such as banana and watermelon, which belong to the same “fruit” category. This task can be viewed as set expansion conditioned on a latent semantic concept, where DeepSets [140] and its variants [12, 64, 95] are particularly effective. In computational advertising, set function learning methods [64, 95, 140] improve advertisement targeting by expanding the set of user preferences or behaviors with additional relevant interests, ensuring advertisements more relevant and effective. Experimentally, DeepSets outperforms all the conventional baselines on the COCO dataset for set retrieval tasks [140].
5.5 Time-series Prediction
Set function learning methods [43, 139] in time-series prediction address the challenges posed by irregular, sparse and asynchronous data through treating each time series as an unordered set of observations. This mechanism eliminates the need for data regularization with interpolation, enabling models to operate directly on raw data and capture the inherent information more effectively. Sparsesense [1] processes the sparse and irregular data streams generated from batteryless passive wearables to tackle the problem of HAR. DTS-ERA [124] combines evidential reinforced attention with deep temporal sets for detailed behavioral pattern analysis.
Empirically, SEFT-ATTN achieves competitive performance on mortality prediction tasks by effectively handling asynchronous and unaligned data [43]. SparseSense outperforms traditional baselines in sparse data-stream classification by directly learning from unordered observations without interpolation [1]. DTS-ERA demonstrates superior predictive accuracy on 2D, 3D, and mixed Maze Painting data, further showing its generalization ability in behavior analysis [124].
5.6 Multi-label Classification
Multi-label classification aims to assign multiple labels to a single instance, which is complex due to the dependencies between labels. Set function learning methods [35, 97, 103, 113, 140, 146] can explicitly model these dependencies, improving classification performance. For example, in image tagging, labels such as beach and sun are likely to appear simultaneously, and modeling this relationship can lead to more accurate predictions. In particular, submodular function learning methods [25, 28, 35, 76], with the property of diminishing returns, can be used to capture the idea that adding a label to a smaller set of labels is more informative than adding it to a larger set, which is especially useful for modeling the dependencies between labels.
Experimentally, FSPOOL outperforms DeepSets, PointNet and Janossy Pooling on the CLEVR dataset by utilizing sorting-based pooling [146]. RepSet consistently achieves superior performance across datasets by effectively modeling set relationships with a bipartite matching mechanism [113]. In addition, Set-JDS and set-RNN both demonstrate competitive accuracy across multiple datasets [97, 103].
5.7 Molecular Property Prediction
Set function learning methods [12, 66] have achieved significant progress in molecular property prediction, capable of handling complex molecular datasets and enhancing prediction accuracy. For example, EMTO-CPA [143] applies DeepSets to the design of high-entropy alloys (HEAs). By treating the composition of alloys as sets of elements, DeepSets can predict the properties of novel HEAs more accurately. This approach facilitates the exploration of a vast compositional space and the discovery of new materials with desirable properties. In drug discovery, EquiVSet [86] is utilized for compound selection in virtual screening by modeling the hierarchical selection process of compounds.
Empirical evaluations on molecular property prediction tasks show that EquiVSet outperforms DeepSets on the PDBBind dataset by effectively capturing complex dependencies in molecular structures [86]. Similarly, Equilibrium Aggregation demonstrates superior representational power by optimizing a potential function over molecular sets [12], achieving better performance than GCN and GIN on MOLPCBA [45].
5.8 Amortized Inference
Set function learning methods have found significant applications in amortized inference, where we train neural networks to approximate posterior distributions, thus replacing traditional iterative inference approaches with efficient forward passes. Set Transformer [64] addresses amortized clustering by efficiently mapping datasets to cluster structures through set attention blocks. To overcome the limitation that Set Transformer assumes a fixed number of clusters, Lee et al. [65] propose Deep Amortized Clustering, which extends Set Transformer by incorporating recursive filtering steps, capable of generating varying number of clusters depending on dataset complexity. Building on this foundation, Pakman et al. [87] apply set-based architectures to approximate posterior sampling for probabilistic clustering models, which has been demonstrated effective in applications like spike sorting for high-dimensional neural data. Wang et al. [126] introduce Neural Clustering Processes, a framework that combines set attention with GNN for flexible and efficient amortized clustering. Beyond clustering, set neural architectures have also been applied to general probabilistic inference. For instance, the Neural Process family employs set neural architectures to efficiently model functional variability and uncertainty across datasets [50]. Additionally, Müller et al. [82] develop Prior-Data Fitted Networks, which train set neural networks on synthetic priors, achieving fast and scalable Bayesian inference for structured data.
Experimentally, these methods consistently achieve state-of-the-art performance in amortized inference [65, 82, 87, 126], significantly outperforming traditional methods. For instance, Set Transformer outperforms variational methods in clustering tasks, achieving the highest accuracy on benchmark Gaussian mixtures and real-world datasets [64].
5.9 Other Applications
In addition to the applications we have previously summarized, set function learning is also applied in other domains, such as human activity recognition. GaitSet [21] regards gait sequences as sets of frames, capturing the invariant features of human gait across different views and enhancing the ability to recognize individuals based on their walking patterns. CytoSet [137] leverages set modeling to handle the unordered and variable-sized nature of single-cell cytometry data. By utilizing permutation-invariant neural networks, CytoSet can predict clinical outcomes directly from the set of cells, enhancing the model’s ability to capture complex biological patterns. Similarly, set function learning models such as UMBC [131] can be employed to process high-resolution tissue images, improving the accuracy of cancer detection. Empirically, GaitSet, CytoSet, and UMBC demonstrate state-of-the-art performance in human activity recognition, clinical outcomes prediction, and cancer detection, respectively.
5.10 Relevant Datasets
In this section, we introduce some datasets that are commonly utilized to evaluate set function learning methods.
5.10.1 Point Cloud Dataset.
There are some datasets commonly used for point cloud processing tasks. The ModelNet40 dataset [132] consists of 12,311 CAD models from 40 categories of man-made objects. The ShapeNet dataset [138] contains 16,881 3D shapes from 16 categories, each annotated with 50 distinct parts. The Stanford 3D semantic parsing dataset [5] includes Matterport 3D scans of 271 rooms across six areas, annotated with 13 semantic labels like chair, table, and floor. The Point Cloud MNIST 2D dataset converts MNIST [63] images into 2D point clouds, comprising 60,000 training and 10,000 testing samples, with each set containing 34–35 points. The Oxford Buildings Dataset [90] contains 5,062 images of 11 Oxford landmarks, each with 55 queries for evaluating object retrieval systems.
5.10.2 Image Dataset.
We summarize some image datasets for anomaly detection, set retrieval, and multi-label classification. CelebA dataset [73] contains 202,599 celebrity face images annotated with 40 Boolean attributes, such as “smiling,” “wearing glasses,” and “blonde hair.” The Celebrity Together dataset [151] includes 194,000 images with 546,000 labeled faces, averaging 2.8 faces per image. The MS COCO dataset [70] comprises 123,000 images labeled with per-instance segmentation masks of 80 classes. Each image includes 0 to 18 objects, with most containing 1 to 3 labels.
5.10.3 Recommendation Dataset.
The following datasets are used to evaluate set function learning methods in recommendation systems. Amazon baby registry dataset [36] contains 29,632 baby registries, each listing 5 to 100 products categorized into groups like “toys” and “furniture.” The Recipe1M dataset [77] consists of 1,029,720 cooking recipes with ingredients, instructions, images, and 1,047 semantic categories parsed from titles, covering 507,834 recipes.
5.10.4 Chemical and Biological Dataset.
Here are some datasets used for molecular property and hematocrit level prediction. The Flow-RBC dataset [144] contains 98,240 training and 23,104 test sets, each representing 1,000 red blood cells with volume and hemoglobin content measurements. PDBBind dataset [72] provides experimental binding data for 10,776 biomolecular complexes, including 8,302 protein–ligand and 2,474 other complexes. The BindingDB9 dataset is a public database of measured binding affinities, consisting of 52,273 drug-targets with drug-like small molecules.
5.10.5 Multi-modal Dataset.
The following datasets can be used for object detection and set property prediction. SHIFT15M [56] contains 15 million images and videos captured in diverse driving environments with annotations for object bounding boxes, instance segmentation masks, and semantic labels, covering vehicles, pedestrians, road signs, and more. CLEVR [52] is a visual question answering dataset with 70,000 training images and 700,000 questions, plus additional validation and test sets. Questions fall into five types: existence, counting, integer comparison, attribute queries, and attribute comparisons. Each scene contains 3D-rendered objects characterized by size, shape, material and color, forming 96 unique combinations.
6 Discussion and Future Directions
In this survey, we have comprehensively reviewed and discussed various techniques to solve set function learning problems, covering deep learning and traditional learning methods. By investigating a wide range of methodologies, such as DeepSets [140] and Set Transformer [64], it is evident that significant progress has been achieved in learning complex set functions across various domains, from point cloud processing to recommendation systems. However, there still exist several challenges. A critical challenge is the lack of theoretical breakthroughs. Balcan et al. [11] introduce the probably mostly approximately correct (PMAC) model, extending the PAC model to real-valued functions. They demonstrate that submodular functions can be PMAC-learned with an approximation factor \(O(n^{1/2})\) using a polynomial number of samples. But there still lacks research that focuses on the learnability of general set functions. Another major limitation is that most current methods assume that the entire set can be accessed at once, which is impractical for large sets due to memory constraints. Moreover, in streaming data scenarios, it is crucial that set representations can be updated in real time. Additionally, the potential and advantages of set function learning methods in specific fields have not been fully explored. These challenges highlight several open research directions worthy of further investigation.
—
Theoretical analysis: It is significant to conduct in-depth theoretical analysis for advancing set function learning. This involves analyzing learnability of various classes of set functions, assessing the expressiveness and limitations of different models, and establishing generalization bounds. Additionally, exploring the impact of set size and element distributions on model performance can reveal crucial factors affecting performance. These theoretical advancements provide deeper insights into set function models and valuable guidance for designing more efficient and interpretable frameworks.
—
Mini-batch consistency: It is vital to ensure stable predictions across mini-batches during training for resource-constrained environments. Developing techniques such as consistency regularization and batch normalization specifically designed for set inputs can mitigate instability arising from variations in set size and composition. Furthermore, investigating the impact of set composition, diversity, and size on mini-batch consistency enables the development of more stable and robust training strategies for set function models.
—
Dynamic data handling: In scenarios such as sensor networks, where data streams continuously, it is critical to develop adaptive architectures that can process sets of varying sizes efficiently and handle streaming data. The key idea is developing online learning algorithms for set functions, incrementally updating models without requiring retraining on the entire dataset. Additionally, exploring techniques to manage concept drift, where the data distribution evolves over time, is also important for maintaining sustained model performance in set-based data streams.
—
Domain-specific enhancements: To further exploit the potential of set function learning methods in processing set structured data, it is significant to tailor set function models for specific domains such as multi-object detection, document classification, and drug discovery. These adaptations should preserve permutation-invariance and consider the unique requirements of different data types. By incorporating domain knowledge and optimizing architectures to account for specific relational patterns or employing well-designed loss functions, these specialized models will outperform universal frameworks.
—
Hybrid approaches: Combining set function learning with other machine learning paradigms can significantly improve their applicability and performance. For instance, integrating set function learning with graph neural networks can enhance relational reasoning, while incorporating sequence models enables handling tasks involving both sequential and unordered data. Additionally, exploring the synergy between set function learning and reinforcement learning can unlock new possibilities for complex decision-making in set-based dynamic environments, such as resource allocation and planning.
—
Few-shot and transfer learning: Exploring few-shot and transfer learning techniques is meaningful for improving the generalization ability of set function models with limited data and enabling effective knowledge transfer across related tasks. Meta-learning algorithms specifically designed for set functions can help adapt to new tasks with minimal examples, while transfer learning mechanisms leveraging pre-trained set function models can accelerate domain adaptation. Additionally, employing self-supervised learning that exploits inherent set structures can further enhance few-shot performance for set functions.
—
Graph prediction: Extending the principles of set function learning to graph structures offers significant potential for predicting complex structures and relationships. Developing set-to-graph architectures that effectively transition from unordered sets to structured graph outputs can advance applications such as scene understanding and relationship inference. Moreover, adapting set function learning models for graph-based tasks can improve performance in domains requiring hierarchical reasoning, such as molecular property prediction or knowledge graph construction.
References
[1]
Alireza Abedin, S. Hamid Rezatofighi, Qinfeng Shi, and Damith C. Ranasinghe. 2019. SparseSense: Human activity recognition from highly sparse sensor data-streams using set-based neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’19). Association for the Advancement of Artificial Intelligence, 5780–5786.
Miika Aittala and Frédo Durand. 2018. Burst image deblurring using permutation invariant convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 731–747.
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297–5307.
Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1534–1543.
Eduardo Arnold, Sajjad Mozaffari, and Mehrdad Dianati. 2021. Fast and robust registration of partially overlapping point clouds. IEEE Robot. Autom. Lett. 7, 2 (2021), 1502–1509.
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).
Jun Bai, Chuantao Yin, Hanhua Hong, Jianfei Zhang, Chen Li, Yanmeng Wang, and Wenge Rong. 2023. Permutation invariant training for paraphrase identification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’23). IEEE, 1–5.
Maria-Florina Balcan and Nicholas J. A. Harvey. 2018. Submodular functions: Learnability, structure, and optimization. SIAM J. Comput. 47, 3 (2018), 703–754.
Sergey Bartunov, Fabian B. Fuchs, and Timothy P. Lillicrap. 2022. Equilibrium aggregation: Encoding sets via optimization. In Uncertainty in Artificial Intelligence. PMLR, 139–149.
Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. 2022. Efficient and modular implicit differentiation. Advances in Neural Information Processing Systems, Vol. 35. A Bradford Book, Cambridge, MA, 5230–5242.
Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: Going beyond euclidean data. IEEE Sign. Process. Mag. 34, 4 (2017), 18–42.
Andreis Bruno, Jeffrey Willette, Juho Lee, and Sung Ju Hwang. 2021. Mini-batch consistent slot set encoder for scalable set encoding. In Advances in Neural Information Processing Systems, Vol. 34. A Bradford Book, Cambridge, MA, 21365–21374.
Poompol Buathong, David Ginsbourger, and Tipaluck Krityakierne. 2020. Kernels over sets of finite sets using rkhs embeddings, with application to bayesian (combinatorial) optimization. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 2731–2741.
Christian Bueno and Alan Hylton. 2021. On the representation power of set pooling networks. In Advances in Neural Information Processing Systems, Vol. 34. A Bradord Book, Cambridge, MA, 17170–17182.
Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 5 (1995), 1190–1208.
Hanqing Chao, Yiwei He, Junping Zhang, and Jianfeng Feng. 2019. Gaitset: Regarding gait as a set for cross-view gait recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8126–8133.
Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11 (2010), 1109–1135.
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2017), 834–848.
Dan dan Guo, Long Tian, Minghe Zhang, Mingyuan Zhou, and Hongyuan Zha. 2021. Learning prototype-oriented set representations for meta-learning. In International Conference on Learning Representations.
Abir De and Soumen Chakrabarti. 2022. Neural estimation of submodular functions with applications to differentiable subset selection. In Advances in Neural Information Processing Systems, Vol. 35. A Bradford Book, Cambridge, MA, 19537–19552.
Don Dennis, Durmus Alp Emre Acar, Vikram Mandikal, Vinu Sankar Sadasivan, Venkatesh Saligrama, Harsha Vardhan Simhadri, and Prateek Jain. 2019. Shallow rnn: Accurate time-series classification on resource constrained devices. In Advances in Neural Information Processing Systems, Vol. 32. A Bradford Book, Cambridge, MA.
Benjamin Doerr, Carola Doerr, Aneta Neumann, Frank Neumann, and Andrew Sutton. 2020. Optimization of chance-constrained submodular functions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1460–1467.
Brian W. Dolhansky and Jeff A. Bilmes. 2016. Deep submodular functions: Definitions and learning. In Advances in Neural Information Processing Systems 29 (2016), 3404–3412.
David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, Vol. 28. A Bradford Book, Cambridge, MA, 2224–2232.
Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, and Matthias Nießner. 2020. 3d-mpa: Multi-proposal aggregation for 3d semantic instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9031–9040.
Felix A. Faber, Alexander Lindmaa, O. Anatole Von Lilienfeld, and Rickard Armiento. 2016. Machine learning energies of 2 million elpasolite (A B C 2 D 6) crystals. Phys. Rev. Lett. 117, 13 (2016), 135502.
Jiajun Fei, Ziyu Zhu, Wenlei Liu, Zhidong Deng, Mingyang Li, Huanjun Deng, and Shuo Zhang. 2022. Dumlp-pin: A dual-mlp-dot-product permutation-invariant network for set feature extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 598–606.
Vitaly Feldman, Pravesh Kothari, and Jan Vondrák. 2013. Representation, approximation and learning of submodular functions using low-rank decision trees. In Proceedings of the Conference on Learning Theory. PMLR, 711–740.
Jennifer A. Gillenwater, Alex Kulesza, Emily Fox, and Ben Taskar. 2014. Expectation-maximization for learning determinantal point processes. In Advances in Neural Information Processing Systems, Vol. 27. A Bradford Book, Cambridge, MA.
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning. PMLR, 1263–1272.
Mogan Gim, Donghyeon Park, Michael Spranger, Kana Maruyama, and Jaewoo Kang. 2021. Recipebowl: A cooking recommender for ingredients and recipes using set transformer. IEEE Access 9 (2021), 143623–143633.
Roger Girgis, Florian Golemo, Felipe Codevilla, Martin Weiss, Jim Aldon D’Souza, Samira Ebrahimi Kahou, Felix Heide, and Christopher Pal. 2021. Latent variable sequential set transformers for joint multi-agent motion prediction. In International Conference on Learning Representations.
T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys. 2017. SEMANTIC3D. NET: A new large-scale point cloud classification benchmark. ISPRS Ann. Photogram. Remote Sens. Spat. Inf. Sci. 4 (2017), 91–98.
Jason Hartford, Devon Graham, Kevin Leyton-Brown, and Siamak Ravanbakhsh. 2018. Deep models of interactions across sets. In Proceedings of the International Conference on Machine Learning. PMLR, 1909–1918.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten Borgwardt. 2020. Set functions for time series. In Proceedings of the International Conference on Machine Learning. PMLR, 4353–4363.
Armin Hornung, Kai M. Wurm, Maren Bennewitz, Cyrill Stachniss, and Wolfram Burgard. 2013. OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Auton. Robots 34 (2013), 189–206.
Zicheng Hu, Benjamin S. Glicksberg, and Atul J. Butte. 2019. Robust prediction of clinical outcomes using cytometry data. Bioinformatics 35, 7 (2019), 1197–1203.
Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. 2018. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 984–993.
Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. 2021. Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory. PMLR, 722–754.
Saurav Jha, Dong Gong, Xuesong Wang, Richard E. Turner, and Lina Yao. 2022. The neural process family: Survey, applications and perspectives. arXiv:2209.00517. Retrieved from https://arxiv.org/abs/2209.00517
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901–2910.
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021), 583–589.
In-Soo Jung, Mario Berges, James H. Garrett Jr, and Barnabas Poczos. 2015. Exploration and evaluation of AR, MPCA and KL anomaly detection techniques to embankment dam piezometer data. Adv. Eng. Inf. 29, 4 (2015), 902–917.
Mateusz Jurewicz and Leon Derczynski. 2022. Set interdependence transformer: Set-to-sequence neural networks for permutation learning and structure prediction. In Proceedings on the International Joint Conferences on Artificial Intelligence.
Masanari Kimura, Takuma Nakamura, and Yuki Saito. 2023. SHIFT15M: Fashion-specific dataset for set-to-set matching with several distribution shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3508–3513.
Adam R. Kosiorek, Hyunjik Kim, and Danilo J. Rezende. 2020. Conditional set generation with transformers. arXiv:2006.16841. Retrieved from https://arxiv.org/abs/2006.16841
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 25. A Bradford Book, Cambridge, MA.
Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12697–12705.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 3744–3753.
Seanie Lee, Bruno Andreis, Kenji Kawaguchi, Juho Lee, and Sung Ju Hwang. 2022. Set-based meta-interpolation for few-task meta-learning. In Advances in Neural Information Processing Systems, Vol. 35. A Bradford Book, Cambridge, MA, 6775–6788.
Diya Li and Mohammed J. Zaki. 2020. Reciptor: An effective pretrained model for recipe representation learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1719–1727.
Jiangnan Li, Yice Zhang, Bin Liang, Kam-Fai Wong, and Ruifeng Xu. 2023. Set learning for generative information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 13043–13052.
Yang Li, Haidong Yi, Christopher Bender, Siyuan Shan, and Junier B. Oliva. 2020. Exchangeable neural ode for set modeling. In Advances in Neural Information Processing Systems, Vol. 33. A Bradford Book, Cambridge, MA, 6936–6946.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision. Springer, 740–755.
Yang Liu and Mirella Lapata. 2019. Hierarchical transformers for multi-document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5070–5081.
Zhihai Liu, Yan Li, Li Han, Jie Li, Jie Liu, Zhixiong Zhao, Wei Nie, Yuchen Liu, and Renxiao Wang. 2015. PDB-wide collection of binding data: Current status of the PDBbind database. Bioinformatics 31, 3 (2015), 405–412.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision. 3730–3738.
Yuzhe Lu, Xinran Liu, Andrea Soltoggio, and Soheil Kolouri. 2024. Slosh: Set locality sensitive hashing via sliced-wasserstein embeddings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2566–2576.
Piyushi Manupriya, Tarun Ram Menta, Sakethanath N. Jagarlapudi, and Vineeth N. Balasubramanian. 2022. Improving attribution methods by learning submodular functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 2173–2190.
Javier Marın, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, and Antonio Torralba. 2021. Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1 (2021), 187–203.
Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. 2019. Provably powerful graph networks. In Advances in Neural Information Processing Systems, Vol. 32. A Bradford Book, Cambridge, MA.
Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. 2019. On the universality of invariant networks. In Proceedings of the International Conference on Machine Learning. PMLR, 4363–4371.
Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. 2020. On learning sets of symmetric elements. In Proceedings of the International Conference on Machine Learning. PMLR, 6734–6744.
Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. 2019. Rangenet++: Fast and accurate lidar semantic segmentation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’19). IEEE, 4213–4220.
Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2022. Transformers can do bayesian inference. In International Conference on Learning Representations.
Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. 2018. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. arXiv:1811.01900. Retrieved from https://arxiv.org/abs/1811.01900
Giannis Nikolentzos, Polykarpos Meladianos, and Michalis Vazirgiannis. 2017. Matching node embeddings for graph similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech Matusik. 2019. Speech2face: Learning the face behind a voice. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7539–7548.
Zijing Ou, Tingyang Xu, Qinliang Su, Yingzhen Li, Peilin Zhao, and Yatao Bian. 2022. Learning neural set functions under the optimal subset oracle. In Advances in Neural Information Processing Systems, Vol. 35. A Bradford Book, Cambridge, MA, 35021–35034.
Ari Pakman, Yueqi Wang, Catalin Mitelut, JinHyung Lee, and Liam Paninski. 2020. Neural clustering processes. In Proceedings of the International Conference on Machine Learning. PMLR, 7455–7465.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning. PMLR, 1310–1318.
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
I. Gethzi Ahila Poornima and B. Paramasivan. 2020. Anomaly detection in wireless sensor network using machine learning algorithm. Comput. Commun. 151 (2020), 331–337.
Sergey Prokudin, Christoph Lassner, and Javier Romero. 2019. Efficient learning on point clouds with basis point sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4332–4341.
Markus Püschel. 2018. A discrete signal processing framework for set functions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4359–4363.
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 652–660.
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, Vol. 30. A Bradford Book, Cambridge, MA.
Kechen Qin, Cheng Li, Virgil Pavlu, and Javed Aslam. 2019. Adapting RNN sequence prediction model to multi-label set prediction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 3181–3190.
Akbar Rafiey and Yuichi Yoshida. 2020. Fast and private submodular and k-Submodular functions maximization with matroid constraints. In Proceedings of the International Conference on Machine Learning. PMLR, 7887–7897.
Neelima Rajput and S. K. Verma. 2014. Back propagation feed forward neural network approach for speech recognition. In Proceedings of 3rd International Conference on Reliability, Infocom Technologies and Optimization. IEEE, 1–6.
Sofya Raskhodnikova and Grigory Yaroslavtsev. 2013. Learning pseudo-boolean k-dnf and submodular functions. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1356–1368.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.
Hamid Rezatofighi, Tianyu Zhu, Roman Kaskman, Farbod T. Motlagh, Javen Qinfeng Shi, Anton Milan, Daniel Cremers, Laura Leal-Taixé, and Ian Reid. 2021. Learn to predict sets using feed-forward neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2021), 9011–9025.
S. Hamid Rezatofighi, Vijay Kumar Bg, Anton Milan, Ehsan Abbasnejad, Anthony Dick, and Ian Reid. 2017. Deepsetnet: Predicting sets with deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 5257–5266.
Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2021. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 82–91.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-assisted Intervention (MICCAI’15), Part III 18. Springer, 234–241.
Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H. J. Kelly, and Andrew J. Davison. 2013. Slam++: Simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1352–1359.
Attilio Sbrana, André Luis Debiaso Rossi, and Murilo Coelho Naldi. 2020. N-BEATS-RNN: Deep learning for time series forecasting. In Proceedings of the 19th IEEE International Conference on Machine Learning and Applications (ICMLA’20). IEEE, 765–768.
Robin Scheibler, Saeid Haghighatshoar, and Martin Vetterli. 2015. A fast Hadamard transform for signals with sublinear sparsity in the transform domain. IEEE Trans. Inf. Theory 61, 4 (2015), 2115–2132.
Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web. 111–112.
Eva Sharma, Guoli Ye, Wenning Wei, Rui Zhao, Yao Tian, Jian Wu, Lei He, Ed Lin, and Yifan Gong. 2020. Adaptation of rnn transducer with text-to-speech technology for keyword spotting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7484–7488.
Yifeng Shi, Junier Oliva, and Marc Niethammer. 2020. Deep message passing on sets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5750–5757.
Konstantinos Skianis, Giannis Nikolentzos, Stratis Limnios, and Michalis Vazirgiannis. 2020. Rep the set: Neural networks for learning set representations. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 1410–1420.
Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1891–1898.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Vol. 27. A Bradford Book, Cambridge, MA.
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6105–6114.
Samuel Thomas, Brian Kingsbury, George Saon, and Hong-Kwang J. Kuo. 2022. Integrating text inputs for training and adapting rnn transducer asr models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 8127–8131.
Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 1747–1756.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. A Bradford Book, Cambridge, MA.
Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Sequence to sequence for sets. arXiv:1511.06391. Retrieved from https://arxiv.org/abs/1511.06391
Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A. Osborne. 2019. On the limitations of representing functions on sets. In Proceedings of the International Conference on Machine Learning. PMLR, 6487–6494.
Edward Wagstaff, Fabian B. Fuchs, Martin Engelcke, Michael A. Osborne, and Ingmar Posner. 2022. Universal approximation of functions on sets. J. Mach. Learn. Res. 23, 151 (2022), 1–56.
Dingrong Wang, Deep Shankar Pandey, Krishna Prasad Neupane, Zhiwei Yu, Ervine Zheng, Zhi Zheng, and Qi Yu. 2023. Deep temporal sets with evidential reinforced attentions for unique behavioral pattern discovery. In Proceedings of the International Conference on Machine Learning. PMLR, 36205–36223.
Peihao Wang, Shenghao Yang, Shu Li, Zhangyang Wang, and Pan Li. 2023. Polynomial width is sufficient for set representation with high-dimensional features. In Proceedings of the 12th International Conference on Learning Representations.
Yueqi Wang, Yoonho Lee, Pallab Basu, Juho Lee, Yee Whye Teh, Liam Paninski, and Ari Pakman. 2020. Amortized probabilistic detection of communities in graphs. arXiv:2010.15727. Retrieved from https://arxiv.org/abs/2010.15727
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic graph cnn for learning on point clouds. ACM Trans. Graphics 38, 5 (2019), 1–12.
Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. 2014. Unsupervised submodular subset selection for speech data. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, 4107–4111.
Chris Wendler, Andisheh Amrollahi, Bastian Seifert, Andreas Krause, and Markus Püschel. 2021. Learning set functions that are sparse in non-orthogonal Fourier bases. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10283–10292.
Chris Wendler, Markus Püschel, and Dan Alistarh. 2019. Powerset convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 32. A Bradford Book, Cambridge, MA.
Jeffrey Willette, Seanie Lee, Bruno Andreis, Kenji Kawaguchi, Juho Lee, and Sung Ju Hwang. 2023. Scalable set encoding with universal mini-batch consistency and unbiased full set gradient approximation. In Proceedings of the International Conference on Machine Learning. PMLR, 37008–37041.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1912–1920.
Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. 2018. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV’18). 87–102.
Bin Yang, Wenjie Luo, and Raquel Urtasun. 2018. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7652–7660.
Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13440–13449.
Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. 2017. Learning deep latent space for multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
Haidong Yi and Natalie Stanley. 2021. CytoSet: Predicting clinical outcomes via set-modeling of cytometry data. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 1–8.
Li Yi, Vladimir G. Kim, Duygu Ceylan, I.-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. 2016. A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graphics 35, 6 (2016), 1–12.
Le Yu, Zihang Liu, Tongyu Zhu, Leilei Sun, Bowen Du, and Weifeng Lv. 2023. Predicting temporal sets with simplified fully connected networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 4835–4844.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R. Salakhutdinov, and Alexander J. Smola. 2017. Deep sets. In Advances in Neural Information Processing Systems, Vol. 30. A Bradford Book, Cambridge, MA.
David W. Zhang, Gertjan J. Burghouts, and Cees GM Snoek. 2020. Set prediction without imposing structure as conditional density estimation. In International Conference on Learning Representations.
Fengzhuo Zhang, Boyi Liu, Kaixin Wang, Vincent Tan, Zhuoran Yang, and Zhaoran Wang. 2022. Relational reasoning via set transformers: Provable efficiency and applications to MARL. In Advances in Neural Information Processing Systems, Vol. 35. A Bradford Book, Cambridge, MA, 35825–35838.
Jie Zhang, Chen Cai, George Kim, Yusu Wang, and Wei Chen. 2022. Composition design of high-entropy alloys with deep sets learning. npj Comput. Mater. 8, 1 (2022), 89.
Lily Zhang, Veronica Tozzo, John Higgins, and Rajesh Ranganath. 2022. Set norm and equivariant skip connections: Putting the deep in deep sets. In Proceedings of the International Conference on Machine Learning. PMLR, 26559–26574.
Yan Zhang, Jonathon Hare, and Adam Prugel-Bennett. 2019. Deep set prediction networks. In Advances in Neural Information Processing Systems, Vol. 32. A Bradford Book, Cambridge, MA.
Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. 2019. FSPool: Learning set representations with featurewise sort pooling. In Proceedings of the International Conference on Learning Representations.
Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. 2019. Learning representations of sets through optimized permutations. In Proceedings of the International Conference on Learning Representations.
Yan Zhang, David W. Zhang, Simon Lacoste-Julien, Gertjan J. Burghouts, and Cees G. M. Snoek. 2021. Multiset-equivariant set prediction with approximate implicit differentiation. In Proceedings of the International Conference on Learning Representations.
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H. S. Torr, and Vladlen Koltun. 2021. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16259–16268.
Yujie Zhong, Relja Arandjelovic, and Andrew Zisserman. 2018. Compact deep aggregation for set retrieval. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0–0.
Yin Zhou and Oncel Tuzel. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4490–4499.
Aaron Zweig and Joan Bruna. 2022. Exponential separations in symmetric neural networks. In Advances in Neural Information Processing Systems, Vol. 35. A Bradford Book, Cambridge, MA, 33134–33145.
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing ...
The advances in reinforcement learning have recorded sublime success in various domains. Although the multi-agent domain has been overshadowed by its single-agent counterpart during this progress, multi-agent reinforcement learning gains rapid ...
Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains ...