Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Unbiased Scene Graph
Generation in Videos
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/06/07
Sayak Nag et al.
CVPR 2023
2
Introduction
Figure 1. (a) Long-tailed distribution of the predicate classes in Action Genome. (b) Visual relationship or
predicate classification performance of two SOTA dynamic SGG methods, namely STTran and TRACE, falls off
significantly for the tail classes
3
Introduction
Figure 2. Noisy scene graph annotations in Action Genome increase the uncertainty of predicted scene graphs
4
Introduction
Figure 3. Occlusion and motion blur caused by moving objects in video renders off-the-self object detectors
such as FasterRCNN ineffective in producing consistent object classification
5
Method
Figure 4. The object detector generates initial object proposals for each RGB frame in a video. The proposals are then passed to the
OSPU, where they are first linked into sequences based on the object detector’s confidence scores. These sequences are processed
with a transformer encoder to generate temporally consistent object embeddings for improved object classification. The proposals
and semantic information of each subject-object pair are passed to the PEG to generate a spatio-temporal representation of their
relationships. Modeled as a spatio-temporal transformer, the PEG’s encoder learns the spatial context of the relationships and its
decoder learns their temporal dependencies. Due to the long-tail nature of the relationship/predicate classes, a Memory Bank in
conjunction with the MDU is used during training to debias the PEG, enabling the production ofmore generalizable predicate
embeddings. Finally, a K-component GMM head classifies the PEG embeddings and models the uncertainty associated with each
predicate class for a given subject-object pair
6
Background
Scene graphs
● Nodes represent objects (pretrain Faster R-CNN with ResNet-101 as object detector => detect objects in
each video frame)
○ Extract bounding boxes, classes, and visual features
● Edges represent relationships between objects
○ Analyze pairs of objects to predict relationships (spatial, semantic, or action-based relationships)
○ Based spatio-temporal transformer [1] to analyze the features of objects pairs and their relative
positions across frames
○ => Generate embeddings that encode the likelihood and type of relationship between object pairs
○ => Embeddings help construct the edges by predicting the relationships
○ Type of relationship eg. holding, near,...
○ Directional information: subject -> object
● Graph constructed by combining node and edge to represent the entire scene at each frame
● Integrate scene graphs over consecutive frames
○ Match nodes and edges across frames to maintain consistent object and relationship over time
○ Continuously update graph as new frames arised
[1] Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV 2021
7
Background
● Gt = {St, Rt, Ot} of each frame It in a video V = {I1, I2, …, IT}
● St = {s1
t, s2
t, … , sN(t)
t}
● Ot = {o1
t, o2
t, … , oN(t)
t}
8
Method
Object Detection and Temporal Consistency
● Use pretrain, obtain set of objects Ot = {oi
t}i=1
N(t) in each frame where oi
t = {bi
t, vi
t, coi
t}, with b is bounding
box, v is RoIAligned proposal feature of o and c is predict class
● Apply Object Sequence Processing Unit (OSPU), utilize transformer encoder, set of sequences:
● Apply muli-head self-attention to learn long-term temporal dependencies
● Input X, single attention head A:
● Multi-head attention
● Normalize and pass through FFN, output
where is fixed positional encodings
9
Method
Predicate Embedding Generator (PEG)
● Information of each subject-object pair to generate an embedding - summarize the relationship between
nodes
● Each pair (i, j)
where vi, vj are subject and object features, uij is feature map of union box computed by RoIAlign, si, sj are
semantic glove embeddings of subject and object class from final object logits, fv, fu are FFN, fbox is bounding
box to feature map projection
● Output of spatial encoder
● Input of temporal dencoder
● Output of temporal decoder
● Final set of predicate embeddings
10
Method
Memory guided Debiasing
● Long-tailed bias in SGG dataset
11
Method
Uncertainty Attenuated Predicate Classification
● Address the noisy annotations in SGG data, model the predicate classification head as K component
GMM
● Given sample embedding, mean, variance, mixture weights
12
Method
Uncertainty Attenuated Predicate Classification
13
Method
Uncertainty Attenuated Predicate Classification
14
Experiment
Training
● OSPU and GMM head start firing from the first epoch itself
● Train loss:
15
Experiment
Dataset
● Action Genome dataset - largest benchmark dataset for video SGG
● 234,253 annotated frames with 476,229 bounding boxes for 35 object classes (without person)
● Total of 1,715,568 annotated predicate instances for 26 relationship classes
● Standard metrics Recall@K (R@K) and mean-Recall@K for K = [10,20,50]
16
Experiment
Comparison
17
Experiment
Comparison
18
Experiment
Ablation Studies
19
Experiment
Ablation Studies
20
Conclusion
● The difficulty in generating dynamic scene graphs from videos can be attributed to several factors ranging
from imbalance predicate class distribution, video dynamics, temporal fluctuation of prediction
● Existing methods on dynamic SGG focused only on achieving high recall values, which are known to be
biased towards head classes
● Propose TEMPURA for dynamic SGG, can compensate for those biases
● Outperform SOTA in terms of mean recall metric, showing its efficacy in long-term unbiased visual
relationship learning from videos

More Related Content

[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx

  • 1. Unbiased Scene Graph Generation in Videos Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: osfa19730@catholic.ac.kr 2024/06/07 Sayak Nag et al. CVPR 2023
  • 2. 2 Introduction Figure 1. (a) Long-tailed distribution of the predicate classes in Action Genome. (b) Visual relationship or predicate classification performance of two SOTA dynamic SGG methods, namely STTran and TRACE, falls off significantly for the tail classes
  • 3. 3 Introduction Figure 2. Noisy scene graph annotations in Action Genome increase the uncertainty of predicted scene graphs
  • 4. 4 Introduction Figure 3. Occlusion and motion blur caused by moving objects in video renders off-the-self object detectors such as FasterRCNN ineffective in producing consistent object classification
  • 5. 5 Method Figure 4. The object detector generates initial object proposals for each RGB frame in a video. The proposals are then passed to the OSPU, where they are first linked into sequences based on the object detector’s confidence scores. These sequences are processed with a transformer encoder to generate temporally consistent object embeddings for improved object classification. The proposals and semantic information of each subject-object pair are passed to the PEG to generate a spatio-temporal representation of their relationships. Modeled as a spatio-temporal transformer, the PEG’s encoder learns the spatial context of the relationships and its decoder learns their temporal dependencies. Due to the long-tail nature of the relationship/predicate classes, a Memory Bank in conjunction with the MDU is used during training to debias the PEG, enabling the production ofmore generalizable predicate embeddings. Finally, a K-component GMM head classifies the PEG embeddings and models the uncertainty associated with each predicate class for a given subject-object pair
  • 6. 6 Background Scene graphs ● Nodes represent objects (pretrain Faster R-CNN with ResNet-101 as object detector => detect objects in each video frame) ○ Extract bounding boxes, classes, and visual features ● Edges represent relationships between objects ○ Analyze pairs of objects to predict relationships (spatial, semantic, or action-based relationships) ○ Based spatio-temporal transformer [1] to analyze the features of objects pairs and their relative positions across frames ○ => Generate embeddings that encode the likelihood and type of relationship between object pairs ○ => Embeddings help construct the edges by predicting the relationships ○ Type of relationship eg. holding, near,... ○ Directional information: subject -> object ● Graph constructed by combining node and edge to represent the entire scene at each frame ● Integrate scene graphs over consecutive frames ○ Match nodes and edges across frames to maintain consistent object and relationship over time ○ Continuously update graph as new frames arised [1] Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV 2021
  • 7. 7 Background ● Gt = {St, Rt, Ot} of each frame It in a video V = {I1, I2, …, IT} ● St = {s1 t, s2 t, … , sN(t) t} ● Ot = {o1 t, o2 t, … , oN(t) t}
  • 8. 8 Method Object Detection and Temporal Consistency ● Use pretrain, obtain set of objects Ot = {oi t}i=1 N(t) in each frame where oi t = {bi t, vi t, coi t}, with b is bounding box, v is RoIAligned proposal feature of o and c is predict class ● Apply Object Sequence Processing Unit (OSPU), utilize transformer encoder, set of sequences: ● Apply muli-head self-attention to learn long-term temporal dependencies ● Input X, single attention head A: ● Multi-head attention ● Normalize and pass through FFN, output where is fixed positional encodings
  • 9. 9 Method Predicate Embedding Generator (PEG) ● Information of each subject-object pair to generate an embedding - summarize the relationship between nodes ● Each pair (i, j) where vi, vj are subject and object features, uij is feature map of union box computed by RoIAlign, si, sj are semantic glove embeddings of subject and object class from final object logits, fv, fu are FFN, fbox is bounding box to feature map projection ● Output of spatial encoder ● Input of temporal dencoder ● Output of temporal decoder ● Final set of predicate embeddings
  • 10. 10 Method Memory guided Debiasing ● Long-tailed bias in SGG dataset
  • 11. 11 Method Uncertainty Attenuated Predicate Classification ● Address the noisy annotations in SGG data, model the predicate classification head as K component GMM ● Given sample embedding, mean, variance, mixture weights
  • 14. 14 Experiment Training ● OSPU and GMM head start firing from the first epoch itself ● Train loss:
  • 15. 15 Experiment Dataset ● Action Genome dataset - largest benchmark dataset for video SGG ● 234,253 annotated frames with 476,229 bounding boxes for 35 object classes (without person) ● Total of 1,715,568 annotated predicate instances for 26 relationship classes ● Standard metrics Recall@K (R@K) and mean-Recall@K for K = [10,20,50]
  • 20. 20 Conclusion ● The difficulty in generating dynamic scene graphs from videos can be attributed to several factors ranging from imbalance predicate class distribution, video dynamics, temporal fluctuation of prediction ● Existing methods on dynamic SGG focused only on achieving high recall values, which are known to be biased towards head classes ● Propose TEMPURA for dynamic SGG, can compensate for those biases ● Outperform SOTA in terms of mean recall metric, showing its efficacy in long-term unbiased visual relationship learning from videos