[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx
1. Unbiased Scene Graph
Generation in Videos
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/06/07
Sayak Nag et al.
CVPR 2023
2. 2
Introduction
Figure 1. (a) Long-tailed distribution of the predicate classes in Action Genome. (b) Visual relationship or
predicate classification performance of two SOTA dynamic SGG methods, namely STTran and TRACE, falls off
significantly for the tail classes
3. 3
Introduction
Figure 2. Noisy scene graph annotations in Action Genome increase the uncertainty of predicted scene graphs
4. 4
Introduction
Figure 3. Occlusion and motion blur caused by moving objects in video renders off-the-self object detectors
such as FasterRCNN ineffective in producing consistent object classification
5. 5
Method
Figure 4. The object detector generates initial object proposals for each RGB frame in a video. The proposals are then passed to the
OSPU, where they are first linked into sequences based on the object detector’s confidence scores. These sequences are processed
with a transformer encoder to generate temporally consistent object embeddings for improved object classification. The proposals
and semantic information of each subject-object pair are passed to the PEG to generate a spatio-temporal representation of their
relationships. Modeled as a spatio-temporal transformer, the PEG’s encoder learns the spatial context of the relationships and its
decoder learns their temporal dependencies. Due to the long-tail nature of the relationship/predicate classes, a Memory Bank in
conjunction with the MDU is used during training to debias the PEG, enabling the production ofmore generalizable predicate
embeddings. Finally, a K-component GMM head classifies the PEG embeddings and models the uncertainty associated with each
predicate class for a given subject-object pair
6. 6
Background
Scene graphs
● Nodes represent objects (pretrain Faster R-CNN with ResNet-101 as object detector => detect objects in
each video frame)
○ Extract bounding boxes, classes, and visual features
● Edges represent relationships between objects
○ Analyze pairs of objects to predict relationships (spatial, semantic, or action-based relationships)
○ Based spatio-temporal transformer [1] to analyze the features of objects pairs and their relative
positions across frames
○ => Generate embeddings that encode the likelihood and type of relationship between object pairs
○ => Embeddings help construct the edges by predicting the relationships
○ Type of relationship eg. holding, near,...
○ Directional information: subject -> object
● Graph constructed by combining node and edge to represent the entire scene at each frame
● Integrate scene graphs over consecutive frames
○ Match nodes and edges across frames to maintain consistent object and relationship over time
○ Continuously update graph as new frames arised
[1] Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV 2021
7. 7
Background
● Gt = {St, Rt, Ot} of each frame It in a video V = {I1, I2, …, IT}
● St = {s1
t, s2
t, … , sN(t)
t}
● Ot = {o1
t, o2
t, … , oN(t)
t}
8. 8
Method
Object Detection and Temporal Consistency
● Use pretrain, obtain set of objects Ot = {oi
t}i=1
N(t) in each frame where oi
t = {bi
t, vi
t, coi
t}, with b is bounding
box, v is RoIAligned proposal feature of o and c is predict class
● Apply Object Sequence Processing Unit (OSPU), utilize transformer encoder, set of sequences:
● Apply muli-head self-attention to learn long-term temporal dependencies
● Input X, single attention head A:
● Multi-head attention
● Normalize and pass through FFN, output
where is fixed positional encodings
9. 9
Method
Predicate Embedding Generator (PEG)
● Information of each subject-object pair to generate an embedding - summarize the relationship between
nodes
● Each pair (i, j)
where vi, vj are subject and object features, uij is feature map of union box computed by RoIAlign, si, sj are
semantic glove embeddings of subject and object class from final object logits, fv, fu are FFN, fbox is bounding
box to feature map projection
● Output of spatial encoder
● Input of temporal dencoder
● Output of temporal decoder
● Final set of predicate embeddings
11. 11
Method
Uncertainty Attenuated Predicate Classification
● Address the noisy annotations in SGG data, model the predicate classification head as K component
GMM
● Given sample embedding, mean, variance, mixture weights
15. 15
Experiment
Dataset
● Action Genome dataset - largest benchmark dataset for video SGG
● 234,253 annotated frames with 476,229 bounding boxes for 35 object classes (without person)
● Total of 1,715,568 annotated predicate instances for 26 relationship classes
● Standard metrics Recall@K (R@K) and mean-Recall@K for K = [10,20,50]
20. 20
Conclusion
● The difficulty in generating dynamic scene graphs from videos can be attributed to several factors ranging
from imbalance predicate class distribution, video dynamics, temporal fluctuation of prediction
● Existing methods on dynamic SGG focused only on achieving high recall values, which are known to be
biased towards head classes
● Propose TEMPURA for dynamic SGG, can compensate for those biases
● Outperform SOTA in terms of mean recall metric, showing its efficacy in long-term unbiased visual
relationship learning from videos