survey

Open access

Advances in Set Function Learning: A Survey of Techniques and Applications

Authors:

Jiahao Xie,

Guangmo TongAuthors Info & Claims

ACM Computing Surveys, Volume 57, Issue 7

Article No.: 179, Pages 1 - 37

https://doi.org/10.1145/3715905

Published: 21 February 2025 Publication History

PDF eReader

Abstract

Set function learning has emerged as a crucial area in machine learning, addressing the challenge of modeling functions that take sets as inputs. Unlike traditional machine learning that involves fixed-size input vectors where the order of features matters, set function learning demands methods that are invariant to permutations of the input set, presenting a unique and complex problem. This survey provides a comprehensive overview of the current development in set function learning, covering foundational theories, key methodologies, and diverse applications. We categorize and discuss existing approaches, focusing on deep learning approaches, such as DeepSets and Set Transformer-based methods, as well as other notable alternative methods beyond deep learning, offering a complete view of current models. We also introduce various applications and relevant datasets, such as point cloud processing and multi-label classification, highlighting the significant progress achieved by set function learning methods in these domains. Finally, we conclude by summarizing the current state of set function learning approaches and identifying promising future research directions, aiming to guide and inspire further advancements in this promising field.

1 Introduction

Set function learning is an emerging and rapidly developing field within machine learning [122], focusing on learning functions defined on set-structured data. In contrast to conventional learning paradigms where the order of input data significantly affects the learning process, set function learning methods are characterized by their invariance to permutations of input elements [64, 140]. This fundamental property makes them particularly effective for tasks involving unordered data. Conventional models, such as convolutional neural networks (CNNs) [59] and recurrent neural networks (RNNs) [89], have achieved significant success in tasks such as time-series analysis and natural language processing [26, 108, 111, 118], where preserving the order of input data is essential for capturing the underlying structure. However, many real-world applications involve learning from inherently unordered sets [3, 95], where conventional methods struggle, because they rely heavily on input order. For example, in point clouds analysis for three-dimensional (3D) object recognition and reconstruction, the individual points representing an object’s surface are inherently unordered [96, 133, 152]. Traditional methods often require extensive preprocessing, resulting in inefficiencies or data structure distortion. Similarly, in multi-label classification, where a single instance is associated with multiple labels [105, 135, 136], treating these labels as a set is more appropriate than using traditional approaches like binary relevance. Binary relevance treats each label independently, failing to capture complex interdependencies between labels, while set function learning models can effectively capture the underlying structure of unordered data, leading to more accurate and robust predictions.

There is increasing literature proposing novel methods that are capable of handling set-structured data, opening new avenues for machine learning and set-based learning problems [83, 122]. For instance, DeepSets [140] introduces a framework for learning permutation-invariant functions, ensuring that the outputs remain unchanged regardless of the order of input elements. PointNet [95] revolutionizes point clouds processing by directly dealing with raw point sets without requiring voxelization or other preprocessing steps, simplifying the workflow and preserving data fidelity. Set Transformer [64] leverages attention mechanisms to capture complex dependencies among set elements, enhancing the model’s ability to understand intricate relationships within the data. These pioneering works have shown the significant potential of neural networks to effectively model and learn from set-structured data, leading to substantial advancements in various domains such as point cloud processing [80, 103] and recommendation systems [38, 67].

Despite the promising advancements, set function learning faces several unique challenges. A fundamental challenge is ensuring permutation-invariance [123], as the output of a set function learning model should remain unchanged regardless of the order of set elements. Another critical challenge is scalability [86], as set function learning models should be capable of handling inputs ranging from small to large sets, often with varying sizes across different instances. This variability demands models that are flexible enough to adapt to sets with arbitrary cardinality while maintaining consistent performance. Additionally, the combinatorial nature of sets leads to significant computational challenges. As the size of the ground set increases, the number of possible subsets grows exponentially, making normal approaches infeasible for large-scale problems. The highly non-linear and interdependent relationships between set elements further complicate the learning process [64], requiring models that are expressive enough to capture these complex dependencies without becoming computationally intractable. It remains a central challenge to balance these requirements and effectively learn from limited data. Indeed, addressing these interconnected challenges demands specialized model architectures and innovative learning algorithms.

Given the rapidly growing interest in set function learning, we present a comprehensive survey of this promising area, providing researchers with insights into the state-of-the-art advancements. We review breakthrough papers and recent advancements, covering both theoretical foundations and practical implementations. While Kimura et al. [57] conduct literature review for permutation-invariant neural networks, their work only focuses on some typical methods and lacks the discussion of various applications. In contrast, as one of the very first surveys on set function learning, our work serves as a reference for anyone seeking to understand, apply, or advance this field, making several significant contributions. It provides a unified framework for understanding and categorizing diverse approaches to set function learning. The introduction of foundational theories allows interested readers to quickly capture basic concepts and engage in this area. Additionally, the systematical view of the strengths and limitations of different methodologies helps readers select the most appropriate approaches for specific tasks. The extensive discussion of applications across multiple domains underscores the broad impact and potential of set function learning methods, encouraging their adoption in new areas. Furthermore, the introduction of various datasets serves as a valuable resource for set function learning research. Finally, by identifying the challenges and future directions, we offer valuable insights for the research community, potentially inspiring new research ideas and accelerating progress in this significant area.

The rest of this survey is organized as follows: In Section 2, we formally introduce the problem of set function learning and related basic concepts. Section 3 discusses various deep learning methods for solving set function learning problems, while other approaches are mentioned in Section 4. The reviewed set function learning methods are summarized in Table 1. Section 5 describes various applications of set function learning models across different domains and introduces relevant datasets. Finally, we make a conclusion and discuss future directions in Section 6.

Table 1.

Category		Strategy	Reference
Deep learning methods	CNN-based methods	Extend convolution to sets	[130, 133]
	CNN-based methods	Integrate convolution with symmetry aggregation	[3, 151]
	RNN-based methods	Optimize set probability	[68, 97]
	FNN-based methods	Define discrete distributions for set variables	[103]
	FNN-based methods	New permutation-invariant and equivariant-functions	[139]
	DeepSets-based methods	Generalization of DeepSets	[80, 83]
		Expressive power analysis of DeepSets	[122, 125, 140, 153]
		Improve DeepSets through new aggregating approaches	[1, 12, 43]
		Extend DeepSets to specific scenarios	[86, 124, 137]
	PointNet-based methods	Expressive power analysis of PointNet	[18]
	PointNet-based methods	Improve capabilities of PointNet	[92, 95, 96]
	Set Transformer–based methods	Improve Set Transformer under specific condition	[64, 144]
		Employ Set Transformer as encoder	[38, 55, 67, 142]
		Extend Set Transformer to meta-learning	[66]
		Other methods based on attention mechanisms	[39, 150]
	DSPN-based methods	Replace pooling in DSPN with FSPool	[146]
	DSPN-based methods	Improve DSPN under specific conditions	[58, 141, 145, 148]
	DSF-based methods	Improve DSF under specific conditions	[28, 35, 76]
	DSF-based methods	Address the limitation of DSF	[25]
	Other deep learning methods	RepSet for handling sets of vectors	[113]
		Exchangeable Neural ODE (ExNODE)	[69]
		GaitSet for recognizing individual gaits	[21]
		Deep Message Passing on Sets fot relational learning	[112]
		Prototype-oriented optimal transport (POT)	[24]
		Permutation-Optimisation module	[147]
		Mini-Batch Consistency (MBC) and UMBC framework	[16, 131]
Other methods		Kernel methods	[17, 84]
		Sparse Set Function Fourier Transform	[129]
		Set Locality Sensitive Hashing (SLoSH)	[75]
		Decision tree for learning submodular functions	[33]
		PAC learning for submodular function	[100]
Applications	Point cloud processing	Point Transformer, Basis Point Sets, Dumlp-pin, etc.	[32, 92, 95, 96, 127, 150]
	Set anomaly detection	DeepSets, Set Transformer, EquiVSet, etc.	[54, 140]
	Recommendation systems	Recipebowl, Reciptor	[38, 67]
	Set expansion and set retrieval	Equilibrium Aggregation, Set Transformer, PointNet, etc.	[12, 64, 95, 140]
	Time-series prediction	SparseSense, SeFT, Deep Temporal Sets, SFCNTSP, etc.	[1, 43, 124, 139]
	Multi-label classification	FLEXSUBNET, DSF, SEA-NN, set-RNN, etc.	[25, 35, 76, 97, 103, 113, 146]
	Molecular property classification	Equilibrium Aggregation, Meta-Interpolation, EquiVSet, etc.	[12, 66, 86, 143]
	Amortized Inference	Set Transformer, Deep Amortized Clustering, NCP, etc.	[50, 64, 65, 82, 87, 126]
	Human activity recognition	Gaitset	[21]
	Predict clinical outcomes	CytoSet	[137]
	Cancer detection	Universal MBC	[131]

Table 1. A Summary of Reviewed Work

2 Preliminaries

Set function is a type of function defined on set and particularly relevant in machine learning that deals with set structural data, such as point clouds [95, 96, 133], molecular structures [29, 31], and any other unordered collection of elements [37, 104, 141]. We begin this section by introducing the definition of set function.

Definition 2.1 (Set Function).

For two sets \(X,Y\), a set function is defined as a mapping from \(2^{X}\) to Y, i.e., \(f:2^{X}\rightarrow Y\), where Y is the response range and can be any set, such as the set of scalars, vectors, sets, and more complex structures.

There are many common set functions and we provide some examples as follows:

Example 2.1 (Point Set Function [95, 150]).

In point cloud classification, a point cloud is a set of points in 3D space, where each point is represented by its coordinates \((x,y,z) \in \mathbb {R}^3\). The objective is to learn a point set function that can predict the label associated with the input point cloud. Formally, this point set function can be formulated as \(h(\lbrace (x_i,y_i,z_i)\rbrace _{i=1}^n)=l\), where \(\lbrace (x_i,y_i,z_i)\rbrace _{i=1}^n\) denotes the set of coordinates representing the point cloud and l is the corresponding label.

Example 2.2 (Product Cost Summarizing Function [64]).

In a recommendation system, the objective is to learn a product cost summarizing function to recommend cost-effective products to users. Formally, this product cost summarizing function can be formulated as \(c(\lbrace \boldsymbol {f}_i\rbrace _{i=1}^n)=\sum _{i=1}^n(f_{i}^1\cdot f_{i}^2)\), where \(\lbrace \boldsymbol {f}_i\rbrace _{i=1}^n \subseteq \mathbb {R}^d\) is the set of products, \(\boldsymbol {f}_i=(f_{i}^1,f_{i}^2,\ldots ,f_{i}^d)\) is a d-dimensional feature vector of product i, and \(f_{i}^1\) and \(f_{i}^2\) represent price and quantity, respectively.

Example 2.3 (Square Corner Prediction Function [146]).

For object detection in traffic scenes, the objective is to learn a square corner prediction function to predict bounding boxes around objects such as cars. Formally, this square corner prediction function can be formulated as \(g(\lbrace (x_i,y_i)\rbrace _{i=1}^4)=\lbrace (x_i^\prime ,y_i^\prime)\rbrace _{i=1}^4\), where \(\lbrace (x_i,y_i)\rbrace _{i=1}^4\subseteq \mathbb {R}^2\) represents the vertices of a square and \(\lbrace (x_i^\prime ,y_i^\prime)\rbrace _{i=1}^4\subseteq \mathbb {R}^2\) represents four corners of the rotated square.

Having shown some examples of set functions, we introduce supervised set function learning, a branch of set function learning, aiming to learn set functions from labeled training data. The supervised set function learning problem is defined as follows.

Definition 2.2 (Supervised Set Function Learning).

For two sets X and Y, suppose that \(\mathcal {D}\) is an unknown underlying probability distribution over \(2^X\times Y\), from which the training set D is assumed to be sampled, i.e., \(D = \lbrace (x_i,y_i)\rbrace _{i=1}^n\), where each \(x_i\in 2^X\) is an input set and \(y_i\in Y\) is the corresponding target output label. We begin by choosing a hypothesis space \(\mathcal {H}\subseteq \lbrace h:2^X \rightarrow Y\rbrace\). The goal is to find a function \(h\in \mathcal {H}\) that maps input sets \(x_i\) to outputs \(y_i\), such that the loss function \(L_\mathcal {D}(h)\overset{\underset{\mathrm{def}}{}}{=}\mathop {\mathbb {P}}\nolimits _{(x,y)\sim \mathcal {D}}[\mathcal {\ell }(h(x),y)]\) can be minimized, where \(\mathcal {\ell }:\mathcal {H}\times Z\rightarrow \mathbb {R}^+\) measures the difference between the prediction \(h(x)\) and the ground truth y, and \(Z=2^X\times Y\).

Supervised set function learning has applications in various domains such as computer vision (e.g., object detection [145] and scene understanding [95]), natural language processing (e.g., text summarization [10] and relation extraction [71]), and bioinformatics (e.g., protein prediction [51] and drug discovery [53]). There are some example tasks of supervised set function learning as follows. The visualizations of these examples are illustrated in Figure 1.

Fig. 1.

Example 2.4 (Point Cloud Classification [95, 96]).

Let X be the set of all 3D points in the space and Y be the set of object categories such as sphere and cube. Each input set \(x_i \in 2^X\) represents a point cloud and the corresponding label \(y_i \in Y\) represents the object category of the point cloud. The goal is to find a function \(h \in \mathcal {H}\) that classifies each point cloud \(x_i\) into its correct category \(y_i\).

Example 2.5 (Predicting Total Cost of a Set of Products [122]).

Let X be the set of all possible products and Y be the set of possible total costs. Each input set \(x_i \in 2^X\) represents a set of products, where each product is characterized by features such as price, weight, and quantity. The corresponding label \(y_i \in Y\) represents the total cost of the set of products. The goal is to find a function \(h \in \mathcal {H}\) that accurately predicts the total cost \(y_i\) for each set of products \(x_i\).

Example 2.6 (Predicting Corners of the Rotated Square [145, 146]).

Let X be the set of all 2D points in a plane and Y be the set of possible sets of four points. Each input set \(x_i \in 2^X\) represents the vertices of a square rotated by an angle \(\theta\) around the origin. The corresponding label \(y_i \in Y\) represents the four corners of the square after rotation. The goal is to find a function \(h \in \mathcal {H}\) that predicts the four corners \(y_i\) for each set of vertices \(x_i\) given the rotation angle \(\theta\).

With a growing literature focusing on designing novel set function learning methods, we summarize three issues that should be taken into account when designing new set function learning methods, including permutation-invariance, theoretical expressive power, and scalability.

(1) Permutation-invariance: permutation-invariance is a fundamental requirement [141] for set function learning. In tasks such as point cloud processing [95] and molecular property prediction [51], the output of models should remain consistent regardless of the order of the input set elements [53, 96, 104]. This property, known as permutation-invariance, is essential for any learning method that handles set-structured data. To show the significance of permutation-invariance, consider the scenario of using conventional CNNs for point cloud classification. In this case, the input point set should be transformed into an ordered vector before being fed into the network. However, if the points are permuted differently, then the extracted features after convolution and pooling will reflect the new order of points, potentially resulting in different classification outcomes. This variability is undesirable, since the output label should be invariant for classifying the same set of points regardless of their order. This example underscores the necessity of a permutation-invariant hypothesis space in set function learning methods. To define permutation-invariance for functions on matrices, we first introduce the permutation matrix. The permutation matrix is a square binary matrix with exactly one entry of 1 in each row and each column, and 0 elsewhere, representing a permutation of set elements. For example, the permutation matrix \([(0,1,0),(0,0,1),(1,0,0)]\) represents the permutation where the first element goes to the second position, the second element to the third position, and the third element to the first position. Given an input \(\mathcal {X} \in \mathbb {R}^{N \times m}\) consisting of N m-dimensional vectors, a permutation matrix \(\Pi\) belongs to the set of all permutation matrices \(\Pi _N\). Using n to represent the dimension of the output vectors, we can formally define permutation-invariance as

Definition 2.3 (Permutation-invariance).

For each \(\mathcal {X}\in \mathbb {R}^{N \times m}\) and \(\Pi \in \Pi _N\), if \(f(\Pi \mathcal {X})=f(\mathcal {X})\) always holds, then the function \(f:\mathbb {R}^{N \times m} \rightarrow \mathbb {R}^{N \times n}\) is permutation-invariant.

To keep permutation-invariance, various techniques are employed and we summarize three key strategies as follows.

—

Sorting is a straightforward technique employed in various learning models [123], where input set elements are sorted into a canonical ordering before being fed into the model. This mechanism essentially restricts the hypothesis space to inherently permutation-invariant functions, i.e., \(f(\mathrm{sort}(X))\), where X is the input set. However, this restriction may exclude some complex relationships that depend on the original data ordering or structure, biasing the hypothesis space towards functions that work well with particular sorting and limiting the generalization ability of models.

—

Augmenting the training data with various permutations of the input sets is a commonly used technique [121]. The basic idea is to create multiple reordered versions of each input set and include all these versions in the training data. While this strategy keeps the original hypothesis space unchanged, it encourages the model to find approximately permutation-invariant functions within this space. This approach is more flexible than sorting and can be easily combined with existing learning methods such as RNNs [9]. However, augmenting significantly increases the data size and it is computationally infeasible to generate all possible permutations.

—

Aggregating features of set elements through symmetric functions is an effective technique [12]. The key idea is to employ a permutation-invariant function to aggregate feature vectors from all elements in the input set into a unified set-level representation. This approach explicitly builds permutation-invariance into the model architecture and restricts hypothesis space to permutation-invariant functions of the form \(f(g(\phi (x_1), \phi (x_2),\ldots , \phi (x_n)))\), where \(\lbrace x_i\rbrace _{i=1}^n\) is the input set, \(\phi\) is the encoder, g is a symmetric aggregation function such as sum (DeepSets [140]) and max (PoinNet [95]), and f is a task-specific function. In fact, more complex encoders and aggregators, such as attention mechanisms [124], can expand the hypothesis space, potentially capturing higher-order element relationships.

(2) Theoretical expressive power: Expressive power in set function learning refers to the capacity of models to represent and approximate set functions. Set function learning methods should have sufficient expressive power with theoratical guarantee to capture complex relationships between set elements and set-level features [122, 123]. For example, point cloud processing often requires capturing high-level geometric structures and patterns [95]. Insufficient expressive power can lead to underfitting and poor performance in complex tasks.

(3) Scalability: It is vital for set function learning methods to handle input sets of varying sizes and run in polynomial time [64], ensuring their practical applicability [86]. For example, in point cloud processing [96], the number of points representing an object can vary with resolution and sampling method, sometimes reaching millions. Scalable methods can also adapt to dynamic applications with growing data over time [98, 127], allowing for incremental learning and updating the model with newly available data instead of retraining from scratch.

3 Deep Learning Methods

Deep learning methods have become pivotal in addressing various learning problems [62]. The fundamental neural networks, such as CNNs [59] and RNNs [119], have achieved remarkable success in multiple tasks, such as image segmentation [23, 106], object detection [101, 102] and speech synthesis [85, 117]. However, these models implicitly incorporate regularity assumptions in their neural structures, making them less adaptable to irregular data domains such as sets, which lack a fixed permutation [140]. To handle set function learning tasks, several works extend CNNs and RNNs to set function learning, while there is also growing literature developing novel deep learning methods specialized for processing set structural data. In this section, we introduce various deep learning methods designed for set function learning problems, categorizing them into multiple groups: CNN-based methods (Section 3.1), RNN-based methods (Section 3.2), DeepSets-based methods (Section 3.4), PointNet-based methods (Section 3.5), Set Transformer–based methods (Section 3.6), Deep Set Prediction Network–based methods (Section 3.7), Deep Submodular Function–based methods (Section 3.8), and other deep learning methods (Section 3.9) that cannot be classified into above groups.

3.1 CNN-based Methods

CNNs are highly efficient architectures due to their ability to leverage local connectivity and shared weights [47], leading to breakthroughs in a wide variety of tasks such as image processing [116]. To make full use of such advantages, some research extends CNNs to set-based learning problems. In this section, we introduce several CNN-based set function learning methods and divide them into two categories. One is extending convolution to set structural data and the other is integrating CNN with symmetric aggregations. The structure of this section is illustrated in Figure 2(a).

Fig. 2.

3.1.1 Extending Convolution Operation to Set.

The first strategy for CNNs to learn set function is extending convolution operation to set. Wendler et al. [130] propose a novel class of CNNs for set functions by proposing powerset convolution and pooling. Powerset convolution is designed to be shift equivariant, meaning it commutes with specified shifts on the powerset domain. The shift is defined as modifying a set function \(s(A)\) by removing a subset Q from argument A, denoted as \(T_Qs=(s_{A\setminus Q})_{A\subseteq N}\), where N is the ground set. The convolution is given in Reference [93] as \((h\ast s)_A=\sum _{Q\subseteq N}h_Qs_{A\setminus Q}\), where the filter h is a set function. The powerset convolution layer is constructed by conducting powerset convolutions on multiple channels, summarizing the feature maps as in Reference [15], with both input and output being sets of set functions. Each output set function is derived by convolving the input set functions with corresponding filters, summing the results, and applying the non-linear transformation. Powerset pooling layer can reduce the complexity by aggregating elements into a smaller ground set, mapping the original set function to a new one on this reduced set, and can be implemented in various ways, such as combining elements as in Reference [109] and using a simple max rule. Powerset CNNs consist of multiple powerset convolution and pooling layers, the number of which can be adjusted according to specific tasks. However, the complexity analysis [74] shows that powerset CNNs are impractical for large ground sets as the number of set functions grows exponentially with the size of the ground set. Xu et al. [133] develop SpiderCNN, a novel CNN designed specifically for processing point clouds. The core component of SpiderCNN is the SpiderConv layer, which replaces conventional convolutional layers to enable convolution on point sets. Given a function F defined on a point set \(P\subseteq \mathbb {R}^n\) and a filter \(g:\mathbb {R}^n\rightarrow \mathbb {R}\) within a sphere centered at the origin with radius \(r\in \mathbb {R}\), the SpiderConv can be formulated as

\begin{equation*} F\ast g(p)=\sum _{q\in P, \Vert q-p\Vert \le r}F(q)g(p-q), \nonumber \nonumber \end{equation*}

where \(p,q\) are points. Conventional convolution becomes a special case of SipderConv when \(P=\mathbb {Z}^2\) is a regular grid. In SpiderConv, the filter g belongs to a parameterized filter family \(\lbrace g_w\rbrace\), which is piecewise differentiable for w and can be efficiently optimized using stochastic gradient descent (SGD). \(\lbrace g_w\rbrace\) is defined as the product of a step function and a Taylor polynomial, enabling capturing local geodesic information and ensuring expressiveness. SpiderCNN inherits the advantages of CNNs, making it effective to extract deep features and achieve good performance in segmentation tasks.

3.1.2 Combining CNN with Symmetric Aggregation.

The second strategy is combining CNN with symmetric aggregation like mean and max pooling, where the final aggregated feature depends only on the set contents and not their order. To deal with burst image deblurring, Aittla et al. [3] propose a U-Net-inspired [106] framework with symmetric pooling operations. In this framework, each image of the input set is processed individually through identical neural networks with tied weights, producing feature vectors. These feature vectors are computed through symmetry operations such as mean and max pooling. Eventually, the pooled features are processed through further neural network layers, outputting an estimate of the sharp image. In addition, the intermediate pooling layer is introduced, followed by \(1\times 1\) convolutions that fuse global features into local ones. This layer allows the concatenation of the pooled global state back to the local features, enabling information exchange between the set entities. Zhong et al. [151] design a CNN-based architecture called SetNet, which aggregates face descriptors into a compact descriptor. This framework is developed to enhance the efficiency and accuracy of retrieving a set of images that match a given query containing the descriptions of images for multiple identities. SetNet utilizes ResNet-50 [42] to extract features from each image, generating individual descriptors. These descriptors are aggregated into a fixed-length set-level vector using NetVLAD [4], which also helps reduce memory usage and runtime.

3.2 RNN-based Methods

RNNs are efficient in recognizing patterns when handling sequential structure data, such as time series [26, 108], speech [34, 88], and text [111, 118]. The connections in RNNs form directed cycles, enabling them to maintain a hidden state that captures temporal dependencies in input sequences [149]. This characteristic makes RNNs particularly suitable for sequential tasks. In this section, we introduce some works that extend RNNs to set function learning.

Qin et al. [97] present set-RNN, an adaptation of RNN for dealing with multi-label text classification, where the target output is a label set. Previous approaches tackling such tasks either transform the set into a predefined sequence or connect sequence probability with set probability, but these methods lack solid theoretical foundations and perform poorly in practice. The authors propose a novel training objective that maximizes set probability defined as the sum of probabilities across all sequence permutations of the set. During training, a variant of beam search is employed to approximate set probability by identifying the top K highest probability sequences. The same approximation technique is used during prediction to find the label set with the highest probability. This novel objective enhances the flexibility of set-RNN to search the best label orders, enabling it to efficiently tackle multi-label classification tasks. Inspired by Reference [97], Li et al. [68] develop a new approach called set learning, which optimizes the set probability by considering multiple permutations of structured objects. This method is applied to generative information extraction (IE) tasks, where the input is a text \(X=[x_1,x_2,\ldots ]\) and the output is a set of structured objects \(S=\lbrace s_1,s_2,\ldots \rbrace\), with each structured object consisting of several spans from X. Set learning introduces a new method to calculate the set probability, formulated as

\begin{equation} p(S|X=p(Y|X))=\sum _{\pi _z(Y)\in \Pi (Y)}P(\pi _z(Y)), \end{equation}

(1)

where \(\Pi (Y)\) denotes all possible permutations of Y and \(\pi _z(Y)\) is a specific permutation in \(\Pi (Y)\). The set Y has the same size as S, containing all elements of S flattened into sub-sequences. Based on Seq2Seq learning [115], the set learning optimizes the set probability through Equation (1) and reduces the calculation cost through permutation sampling, achieving good performance on IE tasks. However, with the size of training data increasing, the benefits of permutation sampling diminish, while the runtime significantly increases.

3.3 FNN-based Methods

Feedforward neural network (FNN), also known as Multi-Layer Perceptron (MLP), is a fundamental architecture where information flows in one direction, from the input layer through hidden layers to the output layer [13]. It is widely used for various machine learning tasks such as classification [48] and pattern recognition [99]. In this section, we discuss several methods handling set structured data based on FNNs.

Rezatofighi et al. [103] introduce an innovative deep FNN-based approach, Deep Perm-Set Network (DPSN), to address set prediction problems, where the outputs are sets with arbitrary permutation and cardinality. DPSN models the set distributions by defining discrete distributions for set cardinality and permutation variables, as well as a joint distribution over set elements given a fixed cardinality. In the scenario where permutation is fixed during training, the permutation of the output set elements is consistently ordered during training, suitable for tasks such as multi-label classification. The network predicts the cardinality and the state (i.e., existence scores) of each set element by optimizing a loss function that combines cardinality loss (e.g., the negative logarithm of categorical distribution) with state loss (e.g., binary cross-entropy). The fixed permutation simplifies the learning process by eliminating the need to handle varying orderings of set elements. In the scenario of learning the distribution over permutations, the model address tasks where elements order varies during training, such as object detection. DPSN approximates the marginalization over all possible permutations, sampling significant permutations and dynamically determining the best assignment (permutation) for each training instance using the Hungarian algorithm. The learning process optimizes the posterior distribution over the network parameters by jointly considering cardinality, permutation, and state losses. In the scenario where order does not matter, the permutation of output set elements is assumed to be uniformly distributed, applicable to tasks where the specific order of set elements is not important. The network optimizes the cardinality and state losses without considering element order, dynamically determining the assignment between network outputs and ground truth annotations during each SGD iteration. While DPSN is proved to be effective in experiments, its scalability is limited due to the exponential increase in possible permutations with the set size growing.

Yu et al. [139] design a simple framework only with Simplified Fully Connected Networks (SFCNs) for temporal set prediction of user behaviors. This framework first adopts an element embedding layer to learn the representations of set elements. Subsequently, the set representation is computed through the newly designed permutation-invariant functions and an SFCN is applied to capture temporal dependencies among sets. To enable interactions between elements within each set, the newly designed permutation-equivariant function is employed to establish relationships between elements. Following this, another SFCN is used to uncover implicit correlations across multiple embedding channels. Finally, the user representations are aggregated adaptively by average-pooling and the probability of this behavior’s occurrence in the next period-set is calculated using an adaptive fusing module. Notably, this work is the first to show that a simple architecture can effectively deal with temporal set prediction tasks.

3.4 DeepSets-based Methods

Designing novel deep learning methods to tackle set function learning problems has been an active research topic, since Zaheer et al. [140] propose the foundational framework known as DeepSets. This pioneering work establishes key design principles for deep permutation-invariant neural networks and outlines the essential components, such as permutation-equivariant feature extraction and permutation-invariant set pooling. In this section, we introduce the basic concepts of DeepSets and the subsequent advancements. This section is organized as in Figure 2(b).

3.4.1 DeepSets.

To handle learning tasks over set structured data, Zaheer et al. [140] construct a permutation-invariant model. For a countable set X and a set Y, the function \(f:X\rightarrow Y\) is a valid set function, i.e., invariant to the permutation of elements in X, if and only if it can be decomposed into the form: \(\rho (\sum _{x\in X}\phi (x))\), where \(\phi\) and \(\rho\) are appropriate transformations. As for an uncountable set X with fixed size M, any continuous function f defined on X, i.e., \(f:\mathbb {R}^{d\times M}\rightarrow Y\), is permutation-invariant if and only if f can be approximated arbitrarily closely by a function of the form \(\rho (\sum _{x\in X}\phi (x))\). Therefore, any set function f can be represented in this formulation:

\begin{equation} f(X)=\rho \left(\sum _{x\in X}\phi (x)\right). \end{equation}

(2)

Based on this analysis, DeepSets is developed, capable of approximating any permutation-invariant function on the ground set X by using universal function approximators, such as neural networks, for the transformations \(\phi\) and \(\rho\). The model contains two main operations: (1) Each element \(x_m\) of the ground set X is transformed into a representation \(\phi (x_m)\) through the neural network \(\phi\). (2) The representations \(\phi (x_m)\) are summed to produce a single vector, which is then processed through network \(\rho\). The key idea is to aggregate all representations via summation and then apply nonlinear transformations through networks. In particular, the intermediate layers within DeepSets, such as \(\phi\), often exhibit permutation-equivariance, meaning that the processing of each element is independent of the order in which the elements are presented. This property ensures that the order of the elements does not affect their individual processing. Similarly to Definition 2.3, we formally define permutation-equivariance as follows.

Definition 3.1 (Permutation-equivariance).

For each \(\mathcal {X} \in \mathbb {R}^{N \times m}\) and \(\Pi \in \Pi _N\), if \(g(\Pi \mathcal {X})=\Pi g(\mathcal {X})\) always holds, then the function \(g:\mathbb {R}^{N \times m} \rightarrow \mathbb {R}^{N \times n}\) is permutation-equivariant.

The authors propose a novel formulation of permutation-equivariant functions, which can be represented as a neural network layer whose standard form is \(g_{\Theta }(\boldsymbol {x})=\sigma (\Theta \boldsymbol {x})\), where \(\Theta \in \mathbb {R}^{M\times M}\) is the weight vector and \(\sigma :\mathbb {R}\rightarrow \mathbb {R}\) is a nonlinear function such as sigmoid function. It is proved that \(g_{\Theta }:\mathbb {R}^M\rightarrow \mathbb {R}^M\) is permutation-equivariant if and only if all diagonal elements of \(\Theta\) are equal and all off-diagonal elements are tied together, i.e.,

\begin{equation} \Theta =\lambda \boldsymbol {I}+\gamma (\boldsymbol {11}^\top), \end{equation}

(3)

where \(\lambda ,\gamma \in \mathbb {R}\), \(\boldsymbol {1}=[1,\ldots ,1]^\top \in \mathbb {R}^M\), and \(\boldsymbol {I}\in \mathbb {R}^{M\times M}\) is a identity matrix. Therefore, the neural network \(g_{\Theta }(\boldsymbol {x})=\sigma (\Theta \boldsymbol {x})\) is permutation-equivariant if \(\Theta =\lambda \boldsymbol {I}+\gamma (\boldsymbol {11}^\top)\), i.e., \(g(\boldsymbol {x})=\sigma ((\lambda \boldsymbol {I}+\gamma (\boldsymbol {11}^\top)) \boldsymbol {x})\). The layer has several other variations when specifying the operations and parameters. In summary, the permutation-equivariant property of the intermediate layers in DeepSets ensures that each element is treated consistently regardless of its position in the set. The final symmetric aggregation combines these equivariant features in a permutation-invariant manner, ensuring that the order of elements does not affect the output.

3.4.2 Generalization of DeepSets.

There are some works generalizing DeepSets. Margon et al. [80] focus on a principled approach to learning from unordered set elements, particularly when the elements themselves exhibit inherent symmetries. The authors propose the Deep Sets for Symmetric elements (DSS) framework, which generalizes the DeepSets to accommodate additional symmetries of elements. The core innovation is the introduction of DSS layers, incorporating multiple linear layers L that are equivariant to the permutations of the set and the inherent symmetries of the set elements, such as translational symmetry in images and rotational symmetry in 3D shapes. This symmetry is represented by a group H that operates on the elements. Concretely, a DSS layer applies a transformation to each set element while also considering the aggregated information from the entire set. Based on Equation (3), the DSS layer for a set \(\lbrace x_1,\ldots ,x_n\rbrace \subseteq \mathbb {R}^d\) with symmetry group H and feature dimension d is defined by

\begin{equation*} L(X)_i=L_1^H(x_i)+L_2^H\left(\sum _{j\ne i}^nx_j\right), \nonumber \nonumber \end{equation*}

which generalizes DeepSets by applying linear H-equivariant functions \(L_1^H, L_2^H\). The authors prove that DSS networks are universal approximators, because the individual element-wise networks are universal for the symmetry group H, addressing the issue that restricting a network to be invariant or equivariant may reduce the expressive power [79]. Consequently, DSS layers can represent any function that respects the symmetries of the set elements and the set itself. In summary, the DSS framework extends DeepSets to problems involving symmetric elements, providing a comprehensive and theoretically grounded approach for learning from sets with intrinsic symmetries. Murphy et al. [83] propose Janossy pooling, a novel model for constructing permutation-invariant functions. Janossy pooling provides a universal method by representing a permutation-invariant function as the average of a permutation-sensitive function applied to all possible reorderings of the input sequence. However, the computational cost of summarizing all permutations and backpropagating gradients is pretty high. To solve this issue, the authors develop three approximation methods to trade off complexity and generalization: (1) canonical orderings: elements of the input sequence are reordered according to a predefined criterion, reducing the computational cost by avoiding the need to consider all permutations; (2) k-ary dependencies: the permutation-sensitive function is restricted to depend only on subsets of k elements at a time, reducing the number of permutations considered while still capturing important interactions; (3) permutation sampling: During training, permutations are randomly sampled, leading to fewer permutations. These strategies enable Janossy Pooling to unify and generalize existing methods, achieving competitive performance on various tasks compared to state-of-the-art techniques. Notably, DeepSets can be seen as a special case of Janossy Pooling with 1-ary dependencies, where the function depends on individual elements without considering interactions beyond simple aggregation. In contrast, Janossy Pooling allows for k-ary dependencies, capable of capturing higher-order interactions within the data.

3.4.3 Theoretical Analysis of DeepSets.

There have been several works conducted to analyze the theoretical properties of DeepSets as it is a fundamental set function learning approach. Wagstaff et al. [122] refer to permutation-invariant function f represented by the formulation of Equation (2) as sum-decomposition, where the combination \((\rho ,\phi)\) of function \(\phi :\mathbb {R}\rightarrow Z\) and function \(\rho :Z\rightarrow \mathbb {R}\) is a sum-decomposition with latent space Z for function f, namely, the function f is sum-decomposable via Z. They analyze the limitations of enforcing permutation-invariance using sum-pooling and derive a necessary condition that a sum-decomposition-based model with universal function representation should satisfy. It is demonstrated that only if the dimension L of latent space where the summation is located is no less than the set size N, the sum-decomposition-based models can represent arbitrary continuous functions defined on a set with size N. To resolve the open question regarding the representation capabilities of high-dimensional DeepSets posed in Reference [122], Zweig et al. [153] conduct expressive power analysis of DeepSets. They indicate that DeepSets require an exponentially large width to approximate certain symmetric functions, implying that the dimension L of latent space grows exponentially with the size N and dimension D of the input set. This analysis demonstrates that DeepSets may be inherently inefficient for representing certain high-dimensional symmetric functions unless it is enhanced with mechanisms enabling interactions between set elements. Wang et al. [125] further reveal the relationship between the latent space dimension L and the expressive power of DeepSets [140], overcoming the limitations of previous works that focus solely on one-dimensional features or complex analytic activations, which are impractical due to the exponential growth of L with N and D. Considering high dimensional features, i.e., \(D\gt 1\), the bounds of minimal latent space dimension L are proved to be divided into two categories according to the encoding network \(\phi\): (1) If \(\phi\) applies a linear layer with power mapping, then we can get \(N(D+1)\le L\lt N^5D^2\). (2) If \(\phi\) applies a linear layer and an exponential activation function, then we can get a tighter bound, \(ND\le L\le N^4D^2\). The proposed bounds imply that it is sufficient to model the latent space of DeepSets with L being \(\mathrm{poly}(N,D)\) for the universal approximation of set functions. It is also demonstrated that continuous mappings \(\phi\) and \(\rho\) are crucial for ensuring universal approximation of DeepSets. Table 2 compares the lower bounds of different research.

Table 2.

Research	DeepSets [140]	Wagstaff et al. [122]	Zweig et al. [153]	Wang et al. [125]
\(L\)	\(N+1\)	\(N\)	\(\exp (\min \lbrace \sqrt {N},D\rbrace)\)	\(\mathrm{poly}(N,D)\)
\(D\)	\(D=1\)	\(D=1\)	\(D\gt 1\)	\(D\gt 1\)

Table 2. The Comparison among Research on Expressiveness Analysis with Latent Space Dimension \(L\)

3.4.4 Proposing Novel Aggregating Methods.

There are some works trying to propose novel aggregating methods for set function learning. Aggregating inputs into a single representation is a common mechanism in set function learning, such as DeepSets, which utilizes sum-pooling to aggregate element-wise embeddings. Inspired by DeepSets, Abedin et al. [1] employ a set-based deep learning approach called SparseSense to handle the sparse data from passive sensors in human activity recognition (HAR) tasks. Unlike traditional methods that require dense data streams or rely on interpolation to estimate missing data points, SparseSense processes the sparse data directly, mitigating large estimation errors and long recognition delays. This method regards sparse sensor data as sets, allowing the model to focus on extracting discriminative features of all activity categories without relying on temporal correlations. The key idea is to apply a shared embedding network to project each set element into a higher-dimensional space, followed by featurewise maximum pooling to aggregate these embeddings into a fixed-size global representation for activity classification. SparseSense extends DeepSets to HAR, demonstrating that set-based neural networks can effectively handle irregular data points and tolerate missing information. Bartunov et al. [12] introduce an optimization-based aggregation method named Equilibrium Aggregation. This method generalizes existing pooling-based approaches, overcoming the limitations of existing techniques such as sum-pooling, which are constrained by their representational power. The Equilibrium Aggregation models the potential function \(F_\theta (x,y)\), which quantifies the discrepancy between each set element x and aggregation result y, as a learnable neural network with parameter \(\theta\). This layer architecture can be integrated into another multi-layer neural network to aggregate sets. The energy-minimization of Equilibrium Aggregation can be formulated as

\begin{equation} \phi _\theta (X)=\arg \min \limits _{y}\,\left(R_\theta (y)+\sum _{i=1}^NF_\theta (x_i,y)\right), \end{equation}

(4)

where \(R_\theta (y)\) is regularization. The aggregation result y is computed through solving Equation (4) with numerical methods such as gradient descent. The neural network framework with Equilibrium Aggregation can be formulated as \(\rho (\phi _\theta (X))\) where \(\rho\) is a neural network. It is theoretically proved that this framework can universally approximate any continuous permutation-invariant functions if the output of Equation (4) has the same size as the input set. Equilibrium Aggregation provides a more flexible framework than DeepSets by utilizing a learnable potential function, potentially achieving better performance in tasks that require more detailed data representation. Horn et al. [43] propose a novel framework called Set Functions for Time Series (SeFT) for classifying irregularly sampled time series. SeFT regards time-series data as a set of observations, addressing the issues of irregular sampling and unaligned measurements without requiring imputation. This model employs a set function \(f:S\rightarrow {\mathbb {R}^c}\) derived from Equation (2). Denoting \(s_j\) as a single observation of the time series S, the function f can be formulated as

\begin{equation*} f(S)=g\left(\frac{1}{|S|}\sum _{s_j\in S}h(s_j)\right), \nonumber \nonumber \end{equation*}

where \(h:\Omega \rightarrow \mathbb {R}^d\) and \(g:\mathbb {R}^d\rightarrow \mathbb {R}^c\) are both neural networks, with h mapping observations from the domain \(\Omega\) to a d-dimensional latent space, and g further mapping this latent representation to the final c-dimensional classification space. The variant of positional encoding [120] is used for time encoding, employing multiple trigonometric functions at different frequencies to convert 1-dimensional time t of each observation into a multi-dimensional input. To handle large observation sets and highlight the most relevant data points, a weighted mean aggregating approach based on scaled dot-product attention with multiple heads is designed to weigh different observations. This aggregating method independently calculates each element’s embedding, achieving a runtime and memory complexity of \({O}(n)\). The SeFT extends the representation of DeepSets specifically to the time series with irregular sampling, where the order of observations is not fixed and might not follow a regular interval.

3.4.5 Extending DeepSets to Specific Scenarios.

There are multiple works proposing methods that build on the foundational concepts of DeepSets, extending these concepts to specific scenarios. Yi et al. [137] introduce CytoSet designed to deal with sets of cells and predict the clinical outcome of patients. Since the order of cells’ profile has no biological relevance in flow and mass cytometry experiments, CytoSet regards the cytometry data as a set and extracts information through a permutation-invariant neural network based on DeepSets. This approach predicts the clinical outcome from the patient sample represented as a set of cells, with each cell characterized by a vector of protein measurements. In the proposed model, several permutation-equivariant blocks, as described in Reference [140], are stacked to transform the representation of each set element. The output of these blocks is processed by max-pooling, which measures the presence of high response cells and produces an embedding vector for the set. This vector is then passed through fully connected layers to predict the clinical outcome. CytoSet extends the concept of DeepSets to clinical cytometry data analysis, generalizing CellCNN [7] and CytoDx [46], and achieving better experiments performance compared to them. Ou et al. [86] present equivariant variational inference for set function learning (EquiVSet) to predict set-valued outputs (subsets) that optimize a certain utility function over a given ground set under the optimal subset (OS) supervision oracle, where the optimal subset provides the maximum utility. They combine energy-based method with DeepSets to construct an appropriate set mass function that increases monotonically with a set utility function. To enable training models on varying ground sets and overcome the instability caused by the high dimension of sets when directly optimizing likelihood, a scalable training and inference algorithm is proposed by utilizing the maximum likelihood principle in conjunction with mean-field inference as a surrogate. The EquiVSet improves DeepSets to model the utility function explicitly and handle more complex tasks involving OS oracles. Wang et al. [124] develop an effective model termed as DTS-ERA, which combines the proposed Deep Temporal Sets (DTS) with Evidential Reinforced Attentions (ERA) to uncover the signature behavioral patterns of multimodal data in behavior analysis of children with autism spectrum disorder. DTS-ERA is implemented in the manner of few-shot learning, enabling it to effectively handle situations with limited data. DTS is a multimodal version of DeepSets, capable of capturing complex temporal and spatial relationships in multimodal data. It is composed of a temporal encoder and a spatial encoder, which generate feature representations that maintain temporal dependencies and spatial locality. These feature representations are then concatenated and aggregated through an average-pooling to obtain the deep-set encoding. In ERA, DTS is combined with reinforcement learning agent where an evidential reward function is designed to learn an epistemic policy, which selects representative embeddings as attention signatures. ERA incorporates evidential learning to estimate uncertainty, allowing the model to distinguish between known and unknown regions effectively, thereby improving the reliability of the predictions.

3.5 PointNet-based Methods

PointNet [95] is another important and pioneering set function learning approach, particularly designed to deal with point clouds, taking the point sets as input and outputting labels. In this section, we introduce the fundamental concepts of PointNet and discuss several relevant works based on it. The structure of this section is outlined in Figure 2(c).

3.5.1 PointNet.

PointNet can directly process point clouds without converting them into regular data structures such as 3D voxel grids, maintaining the inherent properties of point clouds. The components of PointNet are similar to DeepSets while only replacing the sum-pooling with max-pooling. For a finite point set X and its element x, the set function \(f:2^X\rightarrow Y\) whose value corresponds to the semantic label of point set can be approximated by the PointNet formulated as

\begin{equation*} f(X)=\rho \left(\mathop {\max }_{x\in X}\phi (x)\right), \nonumber \nonumber \end{equation*}

where \(\phi\) captures features of each point in X. These features are aggregated through max-pooling and then passed to \(\rho\) to obtain the output. Both \(\rho\) and \(\phi\) are neural networks or other parameterized models with learnable parameters. This framework is permutation-invariant, because max-pooling can process arbitrary orders of points in the point set and obtain the same output. It is theoretically proved that PointNet is capable of approximating any continuous set function if the max-pooling layer contains enough neurons.

3.5.2 Theoretical Analysis of PointNet.

Chrisian et al. [18] explore the expressive power of neural networks that use set pooling mechanisms. The authors introduce and analyze a variety of set pooling architectures, such as sum-pooling (DeepSets), max-pooling (PointNet), and average-pooling (normalized-DeepSets). The theoretical analysis reveals that PointNet cannot generally approximate averages of continuous functions over sets (e.g., center-of-mass), and DeepSets is strictly more expressive than PointNet in the constant cardinality setting. This finding implies that the choice of set pooling function has a dramatic impact on the expressiveness of these networks. Unexpectedly, it is also proved that any function that can be uniformly approximated by both PointNet and normalized-DeepSets should be constant under the unbounded cardinality setting.

3.5.3 Improving Capabilities of PointNet.

There are several works that enhance PointNet, extending its applicability to more complex scenarios. To overcome the limitation that PointNet cannot learn the local structures at various scales, Qi et al. [96] develop a hierarchical neural network named PointNet++, which applies PointNet recursively to nested partitions of the input point set. PointNet++ contains multiple set abstraction levels, including sampling layer, grouping layer, and PointNet layer. The sampling layer chooses points to define local regions’ centroids, around which the grouping layer explores neighboring points to build local regions. Then the PointNet layer encodes local region patterns into feature vectors. In particular, the grouping layer has two implementations: multi-scale grouping and multi-resolution grouping. These methods are capable of adaptively aggregating multi-scale features with respect to corresponding point densities, thereby eliminating the impact of varying point set densities on different regions. Generally, PointNet++ begins by extracting local features that capture fine geometric structures within small neighborhoods through PointNet. These local features are then grouped into larger units and further processed to generate higher-level features. Such hierarchical process is repeated iteratively until the comprehensive features of the entire point set are obtained, realizing both robustness and detail capture. While PointNet uses a global max-pooling operation to aggregate features from the entire point set, PointNet++ enhances it by introducing a multi-scale hierarchical learning process, expanding its capabilities to capture detailed local structures and handle varying point densities. In contrast to PointNet that directly processes point cloud by considering each point independently and utilizing max-pooling to aggregate global features, Prokudin et al. [92] design a type of residual representation termed as basis point sets (BPS), which can encode a point cloud into a fixed-length vector, enabling the use of standard machine learning techniques. To construct BPS, the point clouds are normalized to fit a unit ball with radius \(r\in \mathbb {R}\), where we randomly sample \(k\in \mathbb {R}\) points from a uniform distribution to obtain basis point set. By calculating the minimal distance from each basis point to the nearest point in the point cloud, we obtain feature vectors for every point cloud. These feature vectors can be taken as inputs of the learning algorithms. The point cloud classification experiments demonstrate that the framework combining MLP with BPS achieves performance comparable to PointNet, while significantly reducing the number of parameters and computational complexity.

3.6 Set Transformer–based Methods

In this section, we introduce Set Transformer [64], a powerful neural network designed for learning functions on sets, and discuss several set function learning methods built upon it. This section is organized as in Figure 2(d).

3.6.1 Set Transformer.

Set Transformer [64] is an attention-based neural network method capable of modeling interactions across input set elements, which are often overlooked by set pooling methods such as DeepSets [140] and PointNet [95]. Based on Transformer [120], Set Transformer employs permutation-invariant self-attention to capture pairwise and higher-order interactions between elements. Suppose that \(Q,R\in \mathbb {R}^{n\times d}\) are query set and value set respectively, consisting of n d-dimensional vectors. To construct the Set Attention Block (SAB), the authors employ the Multihead Attention Block (MAB), which is a variant of the Transformer’s encoder, with positional encoding and dropout removed. Given matrices \(X,Y\in \mathbb {R}^{n\times d}\), the MAB with parameter \(\omega\) is defined as follows:

\begin{equation} \mathrm{MAB}(Q,V)=\mathrm{LN}\left(H+\mathrm{rF}(h)\right), \,\, h=\mathrm{LN}\left(X+\mathrm{Multihead}(X,Y,Y;\omega)\right), \end{equation}

(5)

where \(\mathrm{LN}\) is layer normalization [8] and \(\mathrm{rF}\) is an arbitrary row-wise feed-forward layer. The SAB can be formulated as \(\mathrm{SAB}(X)=\mathrm{MAB}(X,X)\). The higher-order interactions of elements can be captured by stacking multiple SABs. To reduce the high computational cost associated with self-attention, the Induced Set Attention Block (ISAB) is designed based on SAB and inspired by inducing point methods used in sparse Gaussian processes. The ISAB containing m inducing points I, i.e., m trainable d-dimensional vectors \(I\in \mathbb {R}^{m\times d}\), can be formulated as

\begin{equation} \begin{split} &\mathrm{ISAB}_m(X)=\mathrm{MAB}(X,h) \in \mathbb {R}^{n\times d},\, \,h=\mathrm{MAB}(I,X) \in \mathbb {R}^{m\times d}, \end{split} \end{equation}

(6)

where h is permutation-equivariant to X and \(\mathrm{ISAB}_m(X)\) is permutation-invariant to X. In this way, the computational time is reduced from \(O(n^2)\) in SAB to \(O(mn)\) in ISAB. The Pooing by Multihead Attention (PMA) with k seed vectors \(S\in \mathbb {R}^{k\times d}\), i.e., \(\mathrm{PMA}_k(Z)=\mathrm{MAB}(S,\mathrm{rF}(Z))\), is developed to aggregate encoding features set \(Z\in \mathbb {R}^{n\times d}\). This mechanism allows the model to adaptively weigh the importance of different elements in the set, which is particularly useful in scenarios requiring multiple correlated outputs, such as clustering tasks. Generally speaking, in Set Transformer, the input set is encoded by the stacking of SABs or ISABs, followed by aggregation using PMA. The aggregated representation is then passed through a feed-forward network to produce the output. It is theoretically proved that Set Transformer is capable of universally approximating any set function.

3.6.2 Improving Set Transformer.

Through gradient analysis, Zhang et al. [144] indicate that DeepSets and Set Transformer probably suffer from vanishing and exploding gradients when stacking more layers. They also observe that layer normalization discourages performance, because the invariance of layer norm decreases the representation power and drops potentially useful information for prediction. To tackle such issues and make these set neural networks deeper, DeepSets++ (DS++) and Set Transformer++ (ST++) are developed by proposing equivariant residual connections (ERC) and set norm. ERC is a refined residual connection approach adhering to the clean path principle, capable of avoiding potential gradient issues by maintaining a clean path from input to output. The set norm is a novel normalization layer, which standardizes each set over the minimal number of dimensions and transforms features individually. This mechanism preserves most of the mean and variance information, avoiding the invariance issues associated with layer normalization. By integrating ERC and set norm into the encoders of DeepSets and Set Transformer, respectively, the enhanced models DS++ and ST++ are constructed. These improvements enable the models to achieve greater depth and comparable performance, effectively addressing the instability issues in the original versions.

3.6.3 Employing Set Transformer as Encoder.

There are several works employing the Set Transformer as encoders to construct new models for set function learning. Li et al. [67] propose an effective recipe representation learning model named Reciptor, which jointly processes ingredients and cooking instructions. The ingredient set is encoded by Set Transformer, enhancing the model’s ability to capture interdependence among elements. A pretrained skip-instruction model is employed to encode the cooking instructions, generating initial embeddings and providing a context-aware representation of the entire cooking process. These initial embeddings are subsequently processed by a forward long short-term memory network to produce the final instruction embeddings. To further optimize the learned embeddings, the authors utilize a novel knowledge graph-based triplet sampling loss [22], ensuring that semantically related recipes are closer in the latent space. The embeddings are refined by combining a triplet loss with a cosine similarity loss between ingredient and instruction embeddings. The Reciptor outperforms baselines on two newly designed downstream classification tasks. Based on Set Transformer, Gim et al. [38] design a set-based cooking recommender called RecipeBowl, which processes a given set of ingredients and cooking tags to output corresponding ingredient and recipe choices. Set Transformer is employed as the encoder to build a comprehensive representation of the ingredient set. A two-way decoder maps the representation into two distinct embedding spaces: one for predicting missing ingredients and the other for recommending relevant recipes. The model is trained using a combination of negative likelihood loss based on Euclidean distances and cosine embedding loss for the recipe prediction task, ensuring that the predicted ingredients and recipes are aligned with their actual counterparts in the embedding space.

Zhang et al. [142] present efficient algorithms to address the challenges of relational reasoning in cooperative Multi-Agent Reinforcement Learning (MARL) with permutation-invariant agents. They leverage Set Transformer to implement complex relational reasoning among agents in MARL. Two algorithms are proposed, including model-free and model-based offline MARL algorithms. The model-free approach employs transformers to estimate the action-value function, incorporating a pessimistic policy to handle distributional shifts in offline settings. The model-based approach estimates the system dynamics with transformers, also utilizing a pessimistic policy. The key contribution is deriving generalization error bounds for transformers in MARL, demonstrating that these bounds are independent of the number of agents and less sensitive to the depth of the network. Jurewicz et al. [55] develop Set Interdependence Transformer (SIT), an efficient set encoder to solve set-to-sequence tasks. This set-to-sequence model is established by combining the SIT with a permutation decoder. Set transformer serves as the basic set encoder, learning permutation-equivariant representations of individual elements and permutation-invariant representations of the entire set. SIT enhances these representations with an augmented attention mechanism to capture higher-order interdependencies. The permutation decoder uses an improved pointer attention mechanism to select elements, forming coherent output sequences. This approach effectively handles sets of varying cardinalities and generalizes well to unseen set sizes, as shown in experiments.

3.6.4 Extending Set Transformer to Meta-learning.

Lee et al. [66] propose Meta-Interpolation, a universal task augmentation method designed for few-task meta-learning. Meta-Interpolation utilizes Set Transformer to process the embeddings of support and query sets from different tasks and learn a parameterized set function, mapping sets of task embeddings to new embeddings that mix features from different tasks. This process creates new tasks that have unique features drawn from the tasks being interpolated. Bilevel optimization is employed to jointly optimize parameters of the meta-learner and the set function. The upper-level optimization aims to minimize the loss on meta-validation tasks, ensuring that this augmentation strategy improves generalization. The lower-level optimization adapts the meta-learner to augmented tasks, reducing the risk of overfitting to the limited meta-training set. This method theoretically regularizes the meta-learner by enforcing a distribution-dependent regularization, which decreases the Rademacher complexity and thus improves the generalization.

3.6.5 Other Methods Utilizing Attention Mechanisms.

There are multiple works utilizing attention mechanisms to learn set functions, similar to Set Transformer. Girgis et al. [39] develop an encoder–decoder framework, Latent Variable Sequential Set Transformers, termed as AutoBots, to deal with the challenging task of predicting the future trajectories of multiple interacting agents. The permutation-equivariant encoder processes sequences of sets representing the agents states over time, incorporating both temporal and social information through multiple Multi-Head Self-Attention modules. The decoder utilizes multiple matrices of learnable seed parameters, enabling the model to capture multi-modal nature of future trajectories. This process allows for the generation of diverse and socially consistent predictions across the entire scene in a single forward pass. The model achieves state-of-the-art performance, particularly in trajectory predictions that adhere to real-world constraints such as road layouts. Zhao et al. [150] propose Point Transformer, a novel architecture tailored for unordered 3D point sets. Point Transformer mainly contains SortNet and local-global attention module. SortNet is a novel neural network that learns to sort the input point cloud data into a specific order based on selected features. Once the points are sorted, the Point Transformer layer applies local attention to aggregate features of each point from its k nearest neighbors, capturing fine-grained details and local geometric structures within small regions. Following the local attention, global attention aggregates features from the entire point cloud or larger regions, complementing the local information and enabling the network to understand the overall structure. The outputs from the local and global attention modules can be combined, either through concatenation or a weighted sum, to form a comprehensive feature representation for each point. This representation can be used in downstream tasks for learning the underlying shape.

3.7 Deep Set Prediction Network-based Methods

This section introduces Deep Set Prediction Network, an effective method to solve set prediction problems, and discusses relevant works that build on Deep Set Prediction Network (DSPN) to enhance its capabilities. The structure of this section is illustrated in Figure 2(e).

3.7.1 Deep Set Prediction Network.

DSPN [145] is a model designed to predict sets from feature vectors, addressing the issue that previous methods such as RNNs result in discontinuous and inaccurate predictions due to lack of consideration on the unordered nature of sets. DSPN employs the same module for both encoding and decoding processes. Concretely, the encoder \(g_{\text{enc}}\) maps the input set X into the latent space z, i.e., \(z=g_{\text{enc}}(X)\), obtaining the representation of X. The decoder \(g_{dec}\) predicts a set from this representation, i.e., \(\hat{X}=g_{\text{dec}}(z)\), applying gradient descent with a learnable initial guess to find a set whose latent representation matches the input set. This process can be regarded as a nested optimization. In the inner loop, the predicted set is refined iteratively to minimize the difference between its encoding and the target representation. Meanwhile, in the outer loop, the weights of the model are trained by minimizing the loss between the predicted set and the true set. Formally, the representation loss and decoder are defined as

\begin{equation} \begin{split}L_{\text{repr}}(\hat{X},z)=||g_{\text{enc}}(\hat{X})-z||^2, \,\,\,\,\, g_{\text{dec}}(z)=\mathop {\arg \min }\limits _{\hat{X}}L_{\text{repr}}(\hat{X},z), \end{split} \end{equation}

(7)

where the permutation-invariant \(L_{\text{repr}}\) compares \(\hat{X}\) with the latent representation of X. Considering \(g_{\text{enc}}\) is a neural network, the gradient descent is utilized for T steps to solve the minimization of Equation (7) from initial set \(\hat{X}^{(0)}\). At the same time, the weights of \(g_{\text{enc}}\) are trained to minimize set loss \(L_{\text{set}}(\hat{X}^{(T)},Y)\), where the \(L_{\text{set}}\) can represent Chamfer loss or pairwise loss, to obtain an appropriate representation z. In general set prediction, there is no set encoder, since the input is usually a vector instead of a set, in which case, a term is added to the loss of outer loop to ensure \(g_{\text{enc}}(X)\approx z\). The DSPN shows significant improvements over traditional methods, particularly in providing accurate set predictions without requiring complex postprocessing. This work opens up new possibilities for set prediction problems.

3.7.2 Improving DSPN.

There are several works that improve DSPN and extend its application to specific scenarios. Zhang et al. [146] design a differentiable set pooling method called FSPool, which consists of two operations, sorting and weighted summation. Instead of treating set elements as whole units, FSPool sorts each feature independently across the set elements. After sorting, a weighted sum is computed, with the weights determined by a learnable calibrator function. FSPool can handle sets of different sizes using a continuous representation of weights. In experiments involving both bounding box and state prediction, the authors combine DSPN and other models, such as MLP, with FSPool, max-pooling, and sum-pooling respectively. The results indicate that simply replacing the pooling function in an existing model with FSPool leads to better results and faster convergence. By replacing the gradient descent updates of DSPN with transformer that provides more expressive and efficient updates, Kosiorek et al. [58] propose Transformer Set Prediction Network (TSPN), where an MLP is utilized to predict the number of points from the input embedding and decide the size of the initial predicted set. TSPN initializes the predicted set with a random set of points sampled from a learned distribution, enhancing flexibility. The transformer is employed to iteratively update set elements, leveraging self-attention mechanism to model dependencies between elements and output the predicted set. Compared to DSPN, TSPN achieves more expressiveness and less computational cost. Zhang et al. [141] develop a framework called Deep Energy-based Set Prediction (DESP), which treats set prediction as a problem of conditional density estimation rather than optimization with set-specific losses. This method utilizes deep energy-based models to capture the distribution of sets given some input features. The energy function \(E_\theta (x,Y)\) assigns a scalar energy to a pair of input features x and a set Y, where a lower energy indicates a higher likelihood of the set. Given the input, the probability of a set is \(P_\theta (Y|x)=\frac{1}{Z(x;\theta)}\exp (-E_\theta (x,Y))\), where \(Z(x;\theta)\) is a partition function. In the proposed framework, two permutation-invariant energy functions \(E_{DS}(x,Y)\) and \(E_{SE}(x,Y)\) are derived from DeepSets and DSPN, respectively. These energy functions can be used to formulate deep energy-based models, which can be trained by minimizing the negative log-likelihood, allowing for the approximation of the true data distribution without requiring explicit pairwise comparison between predicted and ground truth sets. DESP utilizes a stochastically augmented prediction algorithm, which helps explore multiple modes to generate diverse outputs. The final predicted set is determined as the set with the lowest energy among all captured sets. DESP extends the capabilities of DSPN to more effectively handle the inherent complexity and stochasticity of real-world tasks.

Zhang et al. [148] define a relaxation of common set equivariance, the multiset equivariance, which does not require equal elements in a multiset to remain equal after transformation. This property is crucial for handling multisets with duplicate elements, enabling models to process them with different strategies. Additionally, the authors propose exclusive multiset equivariance, which describes models that are multiset equivariant but not set equivariant, aiming to tackle the issue that set-equivariant functions cannot represent certain functions on multisets. It is proved that DSPN satisfies the exclusive multiset equivariance when selecting the appropriate set encoder. To reduce memory and computational requirements, the implicit DSPN (iDSPN) is developed by employing approximate implicit differentiation to replace the gradient descent of DSPN. This method avoids storing intermediate gradient steps by directly computing the gradient at the optimal point, making the optimization process more efficient. iDSPN shows superior performance compared to traditional set-equivariant models, especially in handling multisets and large-scale set prediction tasks.

3.8 Deep Submodular Function-based Methods

Submodular function is a significant subclass of set function, in which case, learning submodular function is an important area within set function learning. In this section, we introduce Deep Submodular Function (DSF), which focuses on learning submodular functions, and discuss related research that improves the capabilities of DSF. The structure of this section is outlined in Figure 2(f). We begin this section by introducing the definition of submodular function.

Definition 3.2 (Submodular Function).

If for two sets \(V,Y\) and a set function \(f: 2^V \rightarrow Y\), given any two sets \(A,B\subseteq V\), then it holds that \(f(A)+f(B) \ge f(A\cup B)+f(A\cap B)\), the set function f is submodular. In particular, when it holds that \(f(A) + f(B) = f(A\cup B) + f(A\cap B)\), then the set function f is modular.

Submodular functions are extensively used in machine learning [28, 49, 64, 140] and they have several important properties: (1) Diminishing returns: For any two sets \(A,B\) such that \(A \subseteq B \subseteq V\) and \(s\notin B\), given a submodular function f, the diminishing returns can be denoted as \(f(A\cup \lbrace s\rbrace)-f(A) \ge f(B\cup \lbrace s\rbrace)-f(B)\), which means the incremental gain of adding an element to a set decreases as the set becomes larger. (2) Natural concavity: Submodular functions are viewed as the discrete analog of concave functions, because the property of diminishing returns is akin to the definition of concavity. (3) Modularity: This property implies additivity, meaning that for disjoint sets A and B, it holds that \(f(A\cup B)=f(A)+f(B)\). Modularity simplifies machine learning tasks by ensuring linearity [14, 27], particularly in feature selection [28] where it facilitates the computation of the relevance of feature sets. (4) Monotonicity: The value of a monotone submodular function does not decrease when additional elements are added to a set.

3.8.1 Deep Submodular Functions.

DSFs [28] are a special class of submodular functions that strictly generalizes many existing submodular functions and inherits some properties from them. For example, DSF can represent decomposable submodular functions, which can be expressed as sums of concave functions composed with modular functions. A notable subclass of these functions is the feature-based submodular function [128], which can be formulated as \(f(X)=\sum _{u\in U}w_u\phi _u(m_u(X))\), where \(\phi _u\) denotes non-deceasing univariate normalized concave function, \(m_u\) represents feature-specific modular function and \(w_u\) is feature weight, all of which are non-negative. To overcome the limitation that features themselves cannot interact in feature-based submodular functions, an additional layer of nested concave functions is employed, i.e., \(f(X)=\sum _{s\in S}\omega _s\phi _s(\sum _{s\in U}w_{s,u}\phi _u(m_u(X)))\), where S denotes a set of meta-features, \(\omega _s\) represents meta-feature weight and \(\phi _s\) is a non-decreasing concave function. The term \(w_{s,u}\) represents the feature weight corresponding to meta-feature s. By recursively applying the above layer, we can derive the DSF. Consider a series of disjoint sets \(V^{(0)},V^{(1)},V^{(2)},\ldots ,V^{(K)}\), where \(V^{(0)}\) is the ground set, \(V^{(1)}\) is features set, \(V^{(2)}\) is meta-features set, \(V^{(3)}\) is meta-meta-features set and by analogy until \(V^{(K)}\), with each set representing a layer. Denoting the size of \(V^{(i)}\) as \(d^i=|V^{(i)}|\), we can employ a matrix \(w^{(i)}\in \mathbb {R}_+^{d^i\times d^{i-1}}, i\in \lbrace 1,2,\ldots ,K\rbrace\) to connect two continuous layers. The element at row \(v^i\) and column \(v^{i-1}\) of matrix \(w^{(i)}\) is denoted by \(w_{v^i}^{i}\), where \(w_{v^i}^{i}:V^{i-1}\rightarrow \mathbb {R}_+\) represents a modular function over set \(V^{i-1}\). In this case, the matrix includes \(d^i\) such modular functions. Moreover, given a non-negative non-decreasing concave function \(\phi _{v^k}:\mathbb {R}_+\rightarrow \mathbb {R}_+\) and any set \(A\subseteq V\), a K-layer DSF \(f:2^V\rightarrow \mathbb {R}_+\) can be formulated as

\begin{equation*} f(A)=\phi _{v^K}\left(\sum _{v^{K-1}\in V^{K-1}}w_{v^K}^{(K)}\left(v^{K-1}\right)\phi _{v^{K-1}}\left(\ldots \sum _{v^2\in V^{(2)}}w_{v^3}^{(3)}\left(v^2\right)\phi _{v^2}\left(\sum _{v^1\in V^{(1)}}w_{v^2}^{(2)}\left(v^1\right)\phi _{v^1}\left(\sum _{a\in A}w_{v^1}^{(1)}\left(a\right)\right)\right)\right)\right). \nonumber \nonumber \end{equation*}

Having shown the definition of DSF, it can be seen that DSF is composed of multiple layers, with each layer taking a positive linear combination of the previous layer’s outputs followed by the application of a concave function. This hierarchical structure enables DSF to capture complex interactions within the data. The layered structured DSF shares similarities with deep neural networks (DNN), allowing for extending DNN learning techniques to DSF. The authors utilize a max-margin learning approach that is tailored to maintain the submodularity for training DSFs. This learning process involves adjusting the parameters of the DSF to maximize a margin-based objective function, ensuring that the learned function assigns high values to desired subsets while penalizing undesired ones.

3.8.2 Extending DSF to Specific Scenarios.

There are some works that extend DSF to specific scenarios and improve its capabilities. Ghadimi et al. [35] propose a novel model called deep submodular network (DSN), which combines the principles of deep learning with submodular optimization for multi-document summarization. DSN is similar to DSF, with the key difference being that DSN employs both modular and submodular functions to construct network blocks, whereas DSF only utilizes modular functions. Consequently, DSN generalizes DSF, making it applicable in a wider range of scenarios. The DSN utilizes the L-BFGS-B algorithm [20] for training, which is memory-efficient and suitable for maintaining non-negative weights. Manupriya et al. [76] design a novel approach called Submodular Ensembled Attribution for Neural Networks (SEA-NN), which aims to interpret the contribution of each input feature to the neural network’s output, particularly in image-based tasks. The core component of SEA-NN is a submodular score function learned by DSF, which achieves the combination of several existing gradient-based attribution methods, such as Integrated Gradients and Smooth Integrated Gradients, and integrates their score bias. It is trained using heatmaps generated by baseline attribution methods, to increase the score for features that are highly relevant and specific. The learned scoring function re-evaluates the importance of features in the input by assessing the marginal gain of each feature, reducing the attribution scores of redundant features that may be present in the raw attribution maps. The SEA-NN is model-agnostic and can be applied to various scenarios.

3.8.3 Addressing the Limitation of DSF.

DSF models submodular functions as an aggregation of modular concave functions, but it does not provide methods for selecting these concave methods, complicating the practical application. To eliminate this gap, De et al. [25] introduce novel neural networks, FLEXSUBNET, to estimate both monotone and non-monotone submodular functions. FLEXSUBNET models submodular functions by recursively applying concave functions to modular functions. This neural network allows for learning these concave functions from data, enhancing the expressiveness. The core of FLEXSUBNET is a simple recursive chain that is a restriction of complex topology, with each node in the chain sharing the learnable concave function. According to the monotonicity of learned functions, two scenarios are considered: (1) For monotone submodular function, it is learned through a recursive model. At each step, the model calculates a linear combination of a previously computed submodular function and a modular function, which is then processed by a learnable concave function to produce a composed submodular function. (2) For non-monotone submodular functions, it is also learned through a recursive model, where a non-monotone concave function is applied to a modular function. The model can be trained using (set, value) pairs or (perimeter-set, high-value-subset) pairs, with applications in subset selection tasks where high-value subsets need to be extracted from larger sets.

3.9 Other Deep Learning Methods

In this section, we introduce other deep learning methods for set function learning problems, such as GaitSet, RepSet, and so on, which cannot be classified into the above several groups.

Skianis et al. [113] propose RepSet, a novel permutation-invariant neural network designed to address learning problems over sets of vectors. This model generates several hidden sets, with each containing a set of d-dimensional vectors. The correspondence between the input set and these hidden sets is established using a bipartite matching algorithm. These hidden sets can be updated through backpropagation during training to obtain the representation, which is then passed to a fully-connected layer to compute the output. In addition, ApproxRepSet, a relaxed version of RepSet that leverages fast matrix computations, is designed to handle large sets efficiently.

Li et al. [69] design an ordinary differential equations(ODE) - based method called Exchangeable Neural ODE (ExNODE), capable of extracting the interdependencies between set elements. ExNODE can be applied to both generative tasks and discriminative tasks. For generative tasks, the method employs continuous normalizing flows to model the distribution of sets and generate new samples. In set classification tasks, the embedding vector v of input x is \(v=\mathrm{MaxPool}(\mathrm{ExNODE}\:\mathrm{Solve}(\phi (x)))\), where the linear function \(\phi\) expands the feature dimensions of each input set element and the ExNODE learns the feature representations, which are aggregated by max-pooling. This model achieves fewer parameters and greater efficiency in point cloud classification and set likelihood estimation tasks.

Chao et al. [21] develop a gait recognition network called GaitSet, which regards gait as a set of independent frames containing gait silhouettes. In this framework, a CNN is employed to extract features from each frame of the gait set independently, capturing detailed spatial information. The Multilayer Global Pipeline extracts features at different levels from multiple layers of the CNN, combining them to form a comprehensive representation to preserve details of paces. These extracted frame-level features are then aggregated into set-level features by set pooling such as mean-pooling and attention mechanisms. The set-level features are further processed by Horizontal Pyramid Mapping, which splits the feature map into strips at multiple scales. This approach allows the model to capture both global and local features, enhancing the discriminative power of the representation. The experiments demonstrated the model’s effectiveness with a limited number of frames and its ability to integrate information from different levels.

Shi et al. [112] propose a novel machine learning estimation method called Deep Message Passing on Sets (DMPS), which is designed to handle set-structured data by incorporating relational learning, bridging the gap between learning on graphs and learning on sets. This method begins by constructing a latent graph that represents the relational structure between set elements. This is achieved through a deep kernel learning approach, where each set element is transformed into a feature space and a kernel function is applied to capture similarities between elements. The message passing is adopted to update each set element based on a weighted sum of all elements, leveraging relational information. In particular, a stack of message passing is utilized to capture higher-order dependencies between set elements, but resulting in over-smoothing and vanishing gradient. To address such issues, two modules, Set-Denoising and Set-Residual blocks, are designed to integrate with DMPS. The Set-Denoising block eliminates over-smoothing by combining the original and updated features of set elements. Meanwhile, the Set-Residual block maintains distinctive features among set elements, preventing feature homogenization.

Guo et al. [24] develop a prototype-oriented optimal transport (POT) approach to improve representation learning for set-structured data. In this framework, given a set j, a distribution \(Q_j\) over these prototypes is computed, and each set is represented by a vector \(h_j\), which indicates the mixture of prototypes relevant to that set. To align the distribution \(Q_j\) with its empirical distribution \(P_j\) over set elements, this model employs an Optimal Transport (OT) distance, which measures the effort required to transform \(P_j\) into \(Q_j\), providing a natural mechanism for training the model to capture the set’s statistics. The objective is to minimize this OT distance, encouraging the model to learn both global prototypes and set-specific representations \(h_j\) effectively. POT can be integrated into existing architectures like summary networks and applied to different types of tasks, such as few-shot classification and meta-generative modeling.

Zhang et al. [147] design an innovative model that optimizes permutation matrices to learn representations of set-structured data. The key component is the Permutation-Optimization module, which rearranges sets by minimizing a cost function via gradient descent. Given an input set X represented as a matrix \(X = [x_1, x_2, \ldots , x_n]^T\), where each \(x_i\) is a feature vector, the algorithm initializes a permutation matrix \(P^{(0)}\) either uniformly or through linear assignment. The total cost function \(c(P)\) is defined as

\begin{equation*} c(P) = \sum _{i,j} C_{ij} \left(\sum _k P_{ik} \left(\sum _{k^\prime \gt k} P_{jk^\prime } - \sum _{k^\prime \lt k} P_{ j k^\prime } \right)\right), \nonumber \nonumber \end{equation*}

where \(C_{ij}\) is the pairwise ordering cost between elements i and j, and k and \(k^\prime\) are element positions. \(C_{ij}\) measures permutation quality to determine the optimal element ordering. However, the overall complexity of the algorithm is \(\Theta (n^3)\) per iteration, making it impractical for large sets.

Previous set encoding methods, such as DeepSets [140] and Set Transformer [64], implicitly assume that the entire set can be stored in memory and accessible simultaneously, while it is unrealistic when processing large sets or streaming data. To overcome this limitation, Bruno et al. [16] define a new property called Mini-Batch Consistency (MBC), which is essential for maintaining consistent set representations across different mini-batches. MBC is defined as the property that ensures the encoding of a full set is equivalent to the aggregation of its mini-batches’ encodings. To obtain the set representation, the authors develop Slot Set Encoder (SSE), where each slot is a learnable vector that interacts with set elements to capture their features. The attention mechanism in SSE computes attention weights using the dot product of the slots and the set elements, followed by a sigmoid activation function, eliminating the batch-dependent normalization issue that breaks MBC. To capture the interactions across set elements, the hierarchical slot set encoder is constructed by combining a stack of SSEs with a final PMA module in Set Transformer. The SSE can process sets in mini-batches, making it suitable for real-world applications with large sets. To overcome the limitations that SSE only supports sigmoid activation and cannot adopt more expressive non-MBC modules, Willette et al. [131] develop the Universal MBC (UMBC) framework, which extends SSE to more activation functions, such as softmax, enabling more expressive set encoders. The authors also propose an efficient training algorithm that approximates the full set gradient by aggregating gradients from subsets of the set, maintaining constant memory overhead. This approximation provides an unbiased gradient estimate, significantly outperforming biased estimates derived from randomly sampled subsets. Notably, the UMBC framework is capable of universally approximating any continuous permutation-invariant function.

4 Other Methods

This section discusses several non-deep learning set function methods, such as kernel and decision trees-based methods.

Kernel-based approaches for learning set functions typically define a distance or similarity measure (kernel) to establish correspondences between sets. This measure is often combined with instance-based machine learning methods, such as Support Vector Machines (SVM). Nikolentzos et al. [84] utilize the Earth Mover’s Distance metric and SVM with Pyramid Match Graph Kernel. Buathong et al. [17] propose more efficient kernel methods by leveraging Reproducing Kernel Hilbert Space embeddings. They introduce Double Sum (DS) kernels, which calculate the sum of kernel evaluations over all pairs of elements. But DS kernels often lack strict positive definiteness, limiting their applicability. To overcome this limitation, the authors develop Deep Embedding kernels, applying a radial kernel in Hilbert space over the canonical distance induced by DS kernels. The proposed kernel methods enhance the Gaussian Process models in prediction and optimization tasks with set-valued inputs. However, these kernel-based methods usually suffer from high computational complexity and memory overhead, since they compare all sets to each other.

Wendler et al. [129] develop novel algorithms for learning Fourier-sparse set functions using non-orthogonal Fourier transforms within a discrete-set signal processing framework [94], which generalizes classical signal processing to set functions. The proposed algorithm, Sparse Set Function Fourier Transform, computes the non-zero Fourier coefficients by utilizing Fourier support, which refers to the set of indices where the Fourier coefficients are non-zero. The Fourier support is determined by iteratively restricting the set function to subsets of its domain and identifying those subsets where the Fourier transform of the restricted function significantly contributes to the original set function. This algorithm requires \(O(nk-k\log k)\) queries and \(O(nk^2)\) operations, where n is the size of the ground set and k is the number of non-zero Fourier coefficients, achieving significant improvements over the naive fast Fourier transform.

Lu et al. [75] propose Set Locality Sensitive Hashing (SLoSH), an efficient algorithm for set retrieval by leveraging Sliced-Wasserstein Embedding (SWE) and Locality-Sensitive Hashing (LSH). The SWE embeds each set into a lower-dimensional space through random linear projections, followed by sorting and calculating the Monge couplings, preserving the properties of the Wasserstein distance. The computational complexity of this embedding process is \(O(Ln(d+\log n))\), where n is the set size and L is the number of projections. Having obtained the vector representation, the LSH, sensitive to the Sliced-Wasserstein distance, is employed to find approximate nearest sets. The authors provide theoretical bounds for the SLoSH, ensuring the computational efficiency of both the embedding and hashing steps.

Feldman et al. [33] focus on the complexity of learning submodular functions on the Boolean hypercube \(\lbrace 0, 1\rbrace ^n\). They prove that any submodular function f can be approximated within \(\epsilon\) in \(\ell _2\) norm by a real-valued decision tree of depth \(O(1/\epsilon ^2)\). The function f is represented as a binary decision tree T with a rank of at most \(4/\epsilon ^2\), ensuring \(\Vert T - f\Vert _2 \le \epsilon\). Leveraging this approximation, the authors develop a Probably Approximately Correct (PAC) learning algorithm for submodular functions. This algorithm runs in time \(\tilde{O}(n^2)\cdot 2^{O(1/\epsilon ^4)}\), where n is the number of variables, significantly improving learning efficiency under the uniform distribution. They also establish an information-theoretic lower bound of \(2^{\Omega (1/{\epsilon }^{2/3})}\) and a computational lower bound of \(n^{\Omega (1/{\epsilon }^{2/3})}\), implying optimality (up to the constant in the power of \(\epsilon\)) of their algorithms. Raskhodnikova et al. [100] introduce a polynomial-time algorithm for learning submodular functions. This method builds on a structural result showing that any submodular function \(f: \lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1,\ldots ,k\rbrace\) can be represented by pseudo-Boolean \(2k\)-disjunctive normal form (DNF) formula, which extends the traditional DNF formula to handle integer-valued functions, enabling learning submodular functions with techniques similar to those for Boolean functions. The authors propose a PAC learning algorithm, which is a generalization of Mansour’s PAC-learner for k-DNFs. This algorithm transforms the submodular function into a pseudo-Boolean k-DNF, applies random restrictions and utilizes Fourier analysis to identify significant coefficients. The algorithm is efficient, with runtime polynomial in n, \(k^{O(k \log k/\epsilon)}\), \(1/\epsilon\), and \(\log (1/\delta)\), where \(\epsilon\) and \(\delta\) are accuracy and confidence parameters respectively. The authors also establish lower bounds on the complexity of learning submodular functions, demonstrating the method’s optimality.

5 Applications and Relevant Datasets

In this section, we introduce several applications of set function learning methods. These methods have shown great potential in scenarios where data can be naturally represented as sets, and the order of elements is not inherently important. As researchers continue to explore the capabilities of set function learning, the range of applications is expanding across various domains that require processing and reasoning over unordered sets of data.

5.1 Point Cloud Processing

Set function learning has the potential to revolutionize point cloud processing by treating the point cloud as a set of vectors, where each vector represents the features of a point. In point cloud applications, set function learning methods are mainly used for classification, segmentation, and detection tasks. As point cloud data are widely used in various applications, set function learning methods are likely to play an increasingly important role in this domain.

Point cloud classification aims to determine the category of objects represented by point clouds. Set function learning models such as DeepSets [140] PointNet [95], Set Transformer [64], and other methods [3, 16, 41, 58, 78, 80, 83, 92] extract relevant features from the entire point set. This enables accurate identification of objects such as cars, trees, and buildings, which is particularly important in fields such as robotics [6, 142] and autonomous driving [61, 152].

Point cloud segmentation is the process of labeling each point in a point cloud with a specific class, allowing for detailed scene understanding, such as distinguishing between vehicles and people in autonomous driving [40, 81]. Advanced models such as PointNet++ [96], Point Transformer [150], and other methods [24, 69, 131, 141, 148] achieve good performance on this task by considering both local and global features.

Point cloud detection focuses on identifying and localizing objects within a point cloud, providing bounding boxes around detected items. This application leverages set function learning models [64, 66, 80, 92, 103, 112, 144] to propose regions of interest and refine these regions for precise localization. Point cloud detection ensures the safe and efficient operation of autonomous driving [30, 134] and robotic navigation [44, 107] in dynamic environments.

Empirical comparisons show that PointNet++ and Point Transformer generally outperform DeepSets, PointNet, and Set Transformer in point cloud tasks on ModelNet40 and ShapeNet datasets [32, 150]. In particular, Point Transformer achieves the best performance in point cloud classification and segmentation by leveraging self-attention mechanism. SpiderCNN and PointNet++ excel in segmentation tasks by incorporating hierarchical and local feature extraction techniques. Meanwhile, DuMLP-Pin offers competitive performance while significantly reducing computational complexity, demonstrating its efficiency for classification and segmentation.

5.2 Set Anomaly Detection

Set anomaly detection is another crucial application, aiming to identify outliers within a set by leveraging set function learning models such as DeepSets [140], PointNet [95], and other variants [80, 86, 96, 141, 150]. The process begins by extracting feature representations for each element in the set, which are subsequently aggregated to produce a unified set representation. The core of anomaly detection is to compare each element’s feature against this aggregated representation, assigning an anomaly score to each element based on its deviation from the overall set pattern. This is typically accomplished through a subsequent layer that evaluates the extent of deviation. The framework then outputs a probability distribution over the elements, with higher probabilities indicating a greater likelihood of being outliers. Set anomaly detection is vital in various domains, such as detecting unusual behavior in sensor networks [91] and identifying outlier faces in image sets [114].

Experiment results on the CelebA dataset highlight different strengths among models for set anomaly detection [32]. PointNet and DeepSets offer simplicity but struggle to capture complex interdependencies within sets, limiting their performance. Set Transformer improves performance by incorporating attention mechanisms to model relationships among set elements. However, DuMLP-Pin outperforms these methods by achieving the highest accuracy while significantly reducing parameter complexity. This makes DuMLP-Pin a competitive choice for anomaly detection tasks, particularly in resource-constrained environments.

5.3 Recommendation Systems

Set function learning methods [64, 140] enhance recommendation systems by modeling user-item interactions as sets and effectively capturing user preferences regardless of the ordering, which facilitates the accurate representation of user profiles. For example, Reciptor [67] and Recipebowl [38] both employ Set Transformer to capture relationships between input elements, enabling more accurate recipe recommendations. Set function learning methods also integrate contextual information into the set, enabling context-aware recommendations [2, 145] that adapt to different situations for more relevant suggestions. Additionally, they ensure diversity [60] and fairness [19] in recommendations by producing sets that cover a wide range of items, improving collaborative filtering [110] through better aggregation of similar user preferences or items. Moreover, these methods require fewer parameters and enable efficient training on large datasets, making them promising for enhancing recommendation system accuracy and relevance.

Empirical results for recommendation task on Recipe1M dataset show that Reciptor and RecipeBowl outperform DeepSets by effectively modeling ingredient relationships and recipe context, with RecipeBowl achieving the best performance [38, 67]. Reciptor excels in cuisine classification and region prediction, while RecipeBowl focuses on context-aware ingredient and recipe recommendations. In contrast, while DeepSets is computationally efficient, it struggles to capture complex dependencies, resulting in weaker performance.

5.4 Set Expansion and Set Retrieval

Set expansion involves identifying new objects that are similar to a given set of objects and retrieving relevant candidates from a large pool. This process is closely related to set retrieval, where the goal is to efficiently retrieve items from a large dataset that match the characteristics of a target set. In the text concept set retrieval task, the goal is to identify and retrieve words that belong to a specific concept or category based on a given set of example words. For example, starting with {apple, orange, pear}, the aim is to retrieve additional related words such as banana and watermelon, which belong to the same “fruit” category. This task can be viewed as set expansion conditioned on a latent semantic concept, where DeepSets [140] and its variants [12, 64, 95] are particularly effective. In computational advertising, set function learning methods [64, 95, 140] improve advertisement targeting by expanding the set of user preferences or behaviors with additional relevant interests, ensuring advertisements more relevant and effective. Experimentally, DeepSets outperforms all the conventional baselines on the COCO dataset for set retrieval tasks [140].

5.5 Time-series Prediction

Set function learning methods [43, 139] in time-series prediction address the challenges posed by irregular, sparse and asynchronous data through treating each time series as an unordered set of observations. This mechanism eliminates the need for data regularization with interpolation, enabling models to operate directly on raw data and capture the inherent information more effectively. Sparsesense [1] processes the sparse and irregular data streams generated from batteryless passive wearables to tackle the problem of HAR. DTS-ERA [124] combines evidential reinforced attention with deep temporal sets for detailed behavioral pattern analysis.

Empirically, SEFT-ATTN achieves competitive performance on mortality prediction tasks by effectively handling asynchronous and unaligned data [43]. SparseSense outperforms traditional baselines in sparse data-stream classification by directly learning from unordered observations without interpolation [1]. DTS-ERA demonstrates superior predictive accuracy on 2D, 3D, and mixed Maze Painting data, further showing its generalization ability in behavior analysis [124].

5.6 Multi-label Classification

Multi-label classification aims to assign multiple labels to a single instance, which is complex due to the dependencies between labels. Set function learning methods [35, 97, 103, 113, 140, 146] can explicitly model these dependencies, improving classification performance. For example, in image tagging, labels such as beach and sun are likely to appear simultaneously, and modeling this relationship can lead to more accurate predictions. In particular, submodular function learning methods [25, 28, 35, 76], with the property of diminishing returns, can be used to capture the idea that adding a label to a smaller set of labels is more informative than adding it to a larger set, which is especially useful for modeling the dependencies between labels.

Experimentally, FSPOOL outperforms DeepSets, PointNet and Janossy Pooling on the CLEVR dataset by utilizing sorting-based pooling [146]. RepSet consistently achieves superior performance across datasets by effectively modeling set relationships with a bipartite matching mechanism [113]. In addition, Set-JDS and set-RNN both demonstrate competitive accuracy across multiple datasets [97, 103].

5.7 Molecular Property Prediction

Set function learning methods [12, 66] have achieved significant progress in molecular property prediction, capable of handling complex molecular datasets and enhancing prediction accuracy. For example, EMTO-CPA [143] applies DeepSets to the design of high-entropy alloys (HEAs). By treating the composition of alloys as sets of elements, DeepSets can predict the properties of novel HEAs more accurately. This approach facilitates the exploration of a vast compositional space and the discovery of new materials with desirable properties. In drug discovery, EquiVSet [86] is utilized for compound selection in virtual screening by modeling the hierarchical selection process of compounds.

Empirical evaluations on molecular property prediction tasks show that EquiVSet outperforms DeepSets on the PDBBind dataset by effectively capturing complex dependencies in molecular structures [86]. Similarly, Equilibrium Aggregation demonstrates superior representational power by optimizing a potential function over molecular sets [12], achieving better performance than GCN and GIN on MOLPCBA [45].

5.8 Amortized Inference

Set function learning methods have found significant applications in amortized inference, where we train neural networks to approximate posterior distributions, thus replacing traditional iterative inference approaches with efficient forward passes. Set Transformer [64] addresses amortized clustering by efficiently mapping datasets to cluster structures through set attention blocks. To overcome the limitation that Set Transformer assumes a fixed number of clusters, Lee et al. [65] propose Deep Amortized Clustering, which extends Set Transformer by incorporating recursive filtering steps, capable of generating varying number of clusters depending on dataset complexity. Building on this foundation, Pakman et al. [87] apply set-based architectures to approximate posterior sampling for probabilistic clustering models, which has been demonstrated effective in applications like spike sorting for high-dimensional neural data. Wang et al. [126] introduce Neural Clustering Processes, a framework that combines set attention with GNN for flexible and efficient amortized clustering. Beyond clustering, set neural architectures have also been applied to general probabilistic inference. For instance, the Neural Process family employs set neural architectures to efficiently model functional variability and uncertainty across datasets [50]. Additionally, Müller et al. [82] develop Prior-Data Fitted Networks, which train set neural networks on synthetic priors, achieving fast and scalable Bayesian inference for structured data.

Experimentally, these methods consistently achieve state-of-the-art performance in amortized inference [65, 82, 87, 126], significantly outperforming traditional methods. For instance, Set Transformer outperforms variational methods in clustering tasks, achieving the highest accuracy on benchmark Gaussian mixtures and real-world datasets [64].

5.9 Other Applications

In addition to the applications we have previously summarized, set function learning is also applied in other domains, such as human activity recognition. GaitSet [21] regards gait sequences as sets of frames, capturing the invariant features of human gait across different views and enhancing the ability to recognize individuals based on their walking patterns. CytoSet [137] leverages set modeling to handle the unordered and variable-sized nature of single-cell cytometry data. By utilizing permutation-invariant neural networks, CytoSet can predict clinical outcomes directly from the set of cells, enhancing the model’s ability to capture complex biological patterns. Similarly, set function learning models such as UMBC [131] can be employed to process high-resolution tissue images, improving the accuracy of cancer detection. Empirically, GaitSet, CytoSet, and UMBC demonstrate state-of-the-art performance in human activity recognition, clinical outcomes prediction, and cancer detection, respectively.

5.10 Relevant Datasets

In this section, we introduce some datasets that are commonly utilized to evaluate set function learning methods.

5.10.1 Point Cloud Dataset.

There are some datasets commonly used for point cloud processing tasks. The ModelNet40 dataset [132] consists of 12,311 CAD models from 40 categories of man-made objects. The ShapeNet dataset [138] contains 16,881 3D shapes from 16 categories, each annotated with 50 distinct parts. The Stanford 3D semantic parsing dataset [5] includes Matterport 3D scans of 271 rooms across six areas, annotated with 13 semantic labels like chair, table, and floor. The Point Cloud MNIST 2D dataset converts MNIST [63] images into 2D point clouds, comprising 60,000 training and 10,000 testing samples, with each set containing 34–35 points. The Oxford Buildings Dataset [90] contains 5,062 images of 11 Oxford landmarks, each with 55 queries for evaluating object retrieval systems.

5.10.2 Image Dataset.

We summarize some image datasets for anomaly detection, set retrieval, and multi-label classification. CelebA dataset [73] contains 202,599 celebrity face images annotated with 40 Boolean attributes, such as “smiling,” “wearing glasses,” and “blonde hair.” The Celebrity Together dataset [151] includes 194,000 images with 546,000 labeled faces, averaging 2.8 faces per image. The MS COCO dataset [70] comprises 123,000 images labeled with per-instance segmentation masks of 80 classes. Each image includes 0 to 18 objects, with most containing 1 to 3 labels.

5.10.3 Recommendation Dataset.

The following datasets are used to evaluate set function learning methods in recommendation systems. Amazon baby registry dataset [36] contains 29,632 baby registries, each listing 5 to 100 products categorized into groups like “toys” and “furniture.” The Recipe1M dataset [77] consists of 1,029,720 cooking recipes with ingredients, instructions, images, and 1,047 semantic categories parsed from titles, covering 507,834 recipes.

5.10.4 Chemical and Biological Dataset.

Here are some datasets used for molecular property and hematocrit level prediction. The Flow-RBC dataset [144] contains 98,240 training and 23,104 test sets, each representing 1,000 red blood cells with volume and hemoglobin content measurements. PDBBind dataset [72] provides experimental binding data for 10,776 biomolecular complexes, including 8,302 protein–ligand and 2,474 other complexes. The BindingDB9 dataset is a public database of measured binding affinities, consisting of 52,273 drug-targets with drug-like small molecules.

5.10.5 Multi-modal Dataset.

The following datasets can be used for object detection and set property prediction. SHIFT15M [56] contains 15 million images and videos captured in diverse driving environments with annotations for object bounding boxes, instance segmentation masks, and semantic labels, covering vehicles, pedestrians, road signs, and more. CLEVR [52] is a visual question answering dataset with 70,000 training images and 700,000 questions, plus additional validation and test sets. Questions fall into five types: existence, counting, integer comparison, attribute queries, and attribute comparisons. Each scene contains 3D-rendered objects characterized by size, shape, material and color, forming 96 unique combinations.

6 Discussion and Future Directions

In this survey, we have comprehensively reviewed and discussed various techniques to solve set function learning problems, covering deep learning and traditional learning methods. By investigating a wide range of methodologies, such as DeepSets [140] and Set Transformer [64], it is evident that significant progress has been achieved in learning complex set functions across various domains, from point cloud processing to recommendation systems. However, there still exist several challenges. A critical challenge is the lack of theoretical breakthroughs. Balcan et al. [11] introduce the probably mostly approximately correct (PMAC) model, extending the PAC model to real-valued functions. They demonstrate that submodular functions can be PMAC-learned with an approximation factor \(O(n^{1/2})\) using a polynomial number of samples. But there still lacks research that focuses on the learnability of general set functions. Another major limitation is that most current methods assume that the entire set can be accessed at once, which is impractical for large sets due to memory constraints. Moreover, in streaming data scenarios, it is crucial that set representations can be updated in real time. Additionally, the potential and advantages of set function learning methods in specific fields have not been fully explored. These challenges highlight several open research directions worthy of further investigation.

—

Theoretical analysis: It is significant to conduct in-depth theoretical analysis for advancing set function learning. This involves analyzing learnability of various classes of set functions, assessing the expressiveness and limitations of different models, and establishing generalization bounds. Additionally, exploring the impact of set size and element distributions on model performance can reveal crucial factors affecting performance. These theoretical advancements provide deeper insights into set function models and valuable guidance for designing more efficient and interpretable frameworks.

—

Mini-batch consistency: It is vital to ensure stable predictions across mini-batches during training for resource-constrained environments. Developing techniques such as consistency regularization and batch normalization specifically designed for set inputs can mitigate instability arising from variations in set size and composition. Furthermore, investigating the impact of set composition, diversity, and size on mini-batch consistency enables the development of more stable and robust training strategies for set function models.

—

Dynamic data handling: In scenarios such as sensor networks, where data streams continuously, it is critical to develop adaptive architectures that can process sets of varying sizes efficiently and handle streaming data. The key idea is developing online learning algorithms for set functions, incrementally updating models without requiring retraining on the entire dataset. Additionally, exploring techniques to manage concept drift, where the data distribution evolves over time, is also important for maintaining sustained model performance in set-based data streams.

—

Domain-specific enhancements: To further exploit the potential of set function learning methods in processing set structured data, it is significant to tailor set function models for specific domains such as multi-object detection, document classification, and drug discovery. These adaptations should preserve permutation-invariance and consider the unique requirements of different data types. By incorporating domain knowledge and optimizing architectures to account for specific relational patterns or employing well-designed loss functions, these specialized models will outperform universal frameworks.

—

Hybrid approaches: Combining set function learning with other machine learning paradigms can significantly improve their applicability and performance. For instance, integrating set function learning with graph neural networks can enhance relational reasoning, while incorporating sequence models enables handling tasks involving both sequential and unordered data. Additionally, exploring the synergy between set function learning and reinforcement learning can unlock new possibilities for complex decision-making in set-based dynamic environments, such as resource allocation and planning.

—

Few-shot and transfer learning: Exploring few-shot and transfer learning techniques is meaningful for improving the generalization ability of set function models with limited data and enabling effective knowledge transfer across related tasks. Meta-learning algorithms specifically designed for set functions can help adapt to new tasks with minimal examples, while transfer learning mechanisms leveraging pre-trained set function models can accelerate domain adaptation. Additionally, employing self-supervised learning that exploits inherent set structures can further enhance few-shot performance for set functions.

—

Graph prediction: Extending the principles of set function learning to graph structures offers significant potential for predicting complex structures and relationships. Developing set-to-graph architectures that effectively transition from unordered sets to structured graph outputs can advance applications such as scene understanding and relationship inference. Moreover, adapting set function learning models for graph-based tasks can improve performance in domains requiring hierarchical reasoning, such as molecular property prediction or knowledge graph construction.

References

[1]

Alireza Abedin, S. Hamid Rezatofighi, Qinfeng Shi, and Damith C. Ranasinghe. 2019. SparseSense: Human activity recognition from highly sparse sensor data-streams using set-based neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’19). Association for the Advancement of Artificial Intelligence, 5780–5786.

Abstract

1 Introduction

2 Preliminaries

3 Deep Learning Methods

3.1 CNN-based Methods

3.1.1 Extending Convolution Operation to Set.

3.1.2 Combining CNN with Symmetric Aggregation.

3.2 RNN-based Methods

3.3 FNN-based Methods

3.4 DeepSets-based Methods

3.4.1 DeepSets.

3.4.2 Generalization of DeepSets.

3.4.3 Theoretical Analysis of DeepSets.

3.4.4 Proposing Novel Aggregating Methods.

3.4.5 Extending DeepSets to Specific Scenarios.

3.5 PointNet-based Methods

3.5.1 PointNet.

3.5.2 Theoretical Analysis of PointNet.

3.5.3 Improving Capabilities of PointNet.

3.6 Set Transformer–based Methods

3.6.1 Set Transformer.

3.6.2 Improving Set Transformer.

3.6.3 Employing Set Transformer as Encoder.

3.6.4 Extending Set Transformer to Meta-learning.

3.6.5 Other Methods Utilizing Attention Mechanisms.

3.7 Deep Set Prediction Network-based Methods

3.7.1 Deep Set Prediction Network.

3.7.2 Improving DSPN.

3.8 Deep Submodular Function-based Methods

3.8.1 Deep Submodular Functions.

3.8.2 Extending DSF to Specific Scenarios.

3.8.3 Addressing the Limitation of DSF.

3.9 Other Deep Learning Methods

4 Other Methods

5 Applications and Relevant Datasets

5.1 Point Cloud Processing

5.2 Set Anomaly Detection

5.3 Recommendation Systems

5.4 Set Expansion and Set Retrieval

5.5 Time-series Prediction

5.6 Multi-label Classification

5.7 Molecular Property Prediction

5.8 Amortized Inference

5.9 Other Applications

5.10 Relevant Datasets

5.10.1 Point Cloud Dataset.

5.10.2 Image Dataset.

5.10.3 Recommendation Dataset.

5.10.4 Chemical and Biological Dataset.

5.10.5 Multi-modal Dataset.

6 Discussion and Future Directions

References

Index Terms

Recommendations

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Multi-agent deep reinforcement learning: a survey

Deep reinforcement learning in computer vision: a comprehensive survey

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media