Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DANCE: Deep Learning-Assisted Analysis of ProteiN Sequences Using Chaos Enhanced Kaleidoscopic Images

Taslim Murad, Prakash Chourasia, Sarwan Ali, Murray Patterson*
*Corresponding Author Department of Computer Science, Georgia State University
Atlanta, GA, USA
{tmurad2,pchourasia1,sali85}@student.gsu.edu, mpatterson30@gsu.edu
Abstract

Cancer, a complex disease characterized by uncontrolled cell growth, requires accurate identification of the cancer type to determine suitable treatment strategies. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immunotherapies. However, analyzing these intricate biomolecules necessitates efficient representations that capture their structural and functional information. T-cell protein sequences pose unique challenges due to their relatively smaller lengths compared to other biomolecules. Traditional vector-based embedding methods may encounter problems such as loss of information when representing these sequences. Therefore, an image-based representation approach becomes a preferred choice for efficient embeddings, allowing for the preservation of essential details and enabling comprehensive analysis of T-cell protein sequences. In this paper, we propose to generate images from the protein sequences using the idea of Chaos Game Representation (CGR). For this purpose, we design images using the Kaleidoscopic images approach. This Deep Learning-Assisted Analysis of ProteiN Sequences Using Chaos Enhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. The resulting kaleidoscopic images exhibit symmetrical patterns that offer a visually captivating representation of the protein sequences. To investigate this approach’s effectiveness, we perform the classification of the T cell receptors (TCRs) protein sequences in terms of their respective target cancer cells, as TCRs are known for their immune response against cancer disease. Prior to classification, the TCR sequences are converted into images using the DANCE method. We employ deep-learning vision models to perform the classification of the generated images to obtain insights into the relationship between the visual patterns observed in the generated kaleidoscopic images and the underlying protein properties. By combining CGR-based image generation with deep learning classification, this study opens novel possibilities in the protein analysis domain.

Index Terms:
Chaos Game Representation, Molecular Sequence Analysis, Supervised Analysis

I Introduction

Understanding and effectively analyzing T cell receptors (TCRs), crucial proteins involved in recognizing antigens associated with cancer, holds immense importance in cancer research and treatment [1]. Recent advancements in sequencing technologies have enabled comprehensive profiling of TCR repertoires, unveiling TCRs with potent anti-cancer activity and paving the way for TCR-based immunotherapies [2]. However, the analysis of TCR protein sequences presents unique challenges. Compared to other biomolecules, TCR sequences are relatively shorter [3], making traditional vector-based embedding methods less suitable due to the potential loss of critical information.

Traditional embedding methods have been widely used for representing protein sequences [4, 5], aiming to capture their structural [6] and functional characteristics [7]. These methods typically involve transforming the protein sequences into fixed-length vectors that encode relevant sequence information [8]. Common approaches include one-hot encoding [9], frequency-based encoding [10], and position-specific scoring matrices [5]. While these methods have provided valuable insights into protein analysis, they also come with certain drawbacks. One of the problems with these methods is that the important local and long-range interactions within the sequence may be overlooked [11]. Another challenge is the dimensionality of the embedding space [12]. Protein sequences can be quite long, resulting in high-dimensional vectors. Furthermore, traditional embedding methods may struggle to capture fine-grained details and subtle variations in protein sequences [13]. They often treat each amino acid as independent, disregarding the context and spatial arrangements that are crucial for understanding protein structure and function [10].

Considering the drawbacks of traditional embedding methods, there is a need for a more advanced and efficient representation-learning approach that can overcome these limitations. Image-based representations, such as the Chaos Game Representation (CGR) [14] approach utilized in this study, offer a promising alternative by preserving sequential information, capturing spatial relationships, and enabling a more comprehensive analysis of protein sequences. Using the image-based representation also opens up the whole domain of deep learning for vision to be applied directly on the protein-based images, which is not possible in the case of traditional vector embeddings as deep learning methods do not perform well on tabular data [15].

I-A Chaos Game Representation (CGR)

The CGR works by applying recursive chaos game rules on the protein sequences to generate the images [15]. In this method, a central seed point is established, and successive iterations are performed using a set of predefined rules. With each iteration, the seed point is displaced based on the specific amino acid encountered in the sequence. The resulting movement generates patterns that unfold into symmetrical and visually captivating kaleidoscopic images [16]. The choice to use the kaleidoscopic-based image generation using the Chaos Game Representation (CGR) method is justified by its ability to generate visually captivating images that exhibit symmetrical patterns. While other CGR methods exist, such as n-flakes [15], the kaleidoscopic approach offers a unique aesthetic appeal that enhances the visualization of protein sequences. See Figre 1 for an example of a kaleidoscopic shape image generated using chaos game representation.

Refer to caption
Figure 1: A kaleidoscopic shape image generated using chaos game representation for a sample sequence “ACQRSTAGTACGT”.

The kaleidoscopic shape images generated through CGR provide a visually engaging representation of the underlying protein sequences. The symmetrical patterns created by the recursive chaos game rules reflect the inherent symmetries and repetitive motifs within the protein sequences. This can facilitate the identification of structural and functional patterns that may be important for understanding protein properties. Furthermore, kaleidoscopic images offer an intuitive and visually accessible representation that can aid in the interpretation and analysis of protein sequences. The symmetrical nature of the patterns can help highlight and emphasize important features or regions within the sequence, allowing for a more intuitive understanding of the sequence’s structural and functional characteristics. By utilizing the kaleidoscopic approach, this study harnesses the unique visual properties of the generated images to provide a novel and aesthetically appealing representation of protein sequences. This visual representation can enhance the exploration and analysis of protein data, potentially leading to new insights and discoveries in the field of bioinformatics.

Deep learning has emerged as a powerful tool for image classification tasks [17]. In this paper, we leverage deep learning techniques to perform classification on the generated chaos images. We design and train deep learning models, such as convolutional neural networks (CNNs), to learn the intricate patterns and features present in the chaos images. By training these models on the training set and evaluating their performance on the validation set, we aim to achieve an accurate and reliable classification of the protein sequences based on their visual representations.

The combination of chaos image generation and deep learning classification opens up new avenues for protein analysis and bioinformatics research [15]. The application of deep learning models to classify the chaos images allows us to explore the relationship between the visual patterns observed in the kaleidoscopic images and the assigned labels. This classification can potentially uncover meaningful associations between specific visual patterns and protein characteristics, such as functional domains, secondary structures, or evolutionary relationships.

This paper makes several key contributions to the field of protein analysis and classification using the Chaos Game Representation (CGR) approach. Our contributions can be summarized as follows:

  1. 1.

    Introducing the use of CGR for generating kaleidoscopic images of protein sequences: We showcase the application of CGR in visualizing protein sequences by recursively applying chaos game rules. Our proposed method, called Deep Learning-Assisted Analysis of ProteiN Sequences Using Chaos Enhanced Kaleidoscopic Images (DANCE), generates visually captivating kaleidoscopic shape images that capture the structural and functional characteristics of proteins.

  2. 2.

    Demonstrating the effectiveness of DANCE images for protein sequence classification: We explore the utilization of DANCE images as visual representations for protein sequence classification. By employing deep learning image classifiers on the DANCE images, and demonstrate their efficacy in accurately categorizing protein sequences based on the visual patterns.

  3. 3.

    Investigating the relationship between visual patterns in DANCE images and protein properties: We analyze the relationship between the visual patterns observed in the DANCE images and the underlying protein properties. This exploration provides insights into how the kaleidoscopic shape reflects structural motifs, protein domains, secondary structures, and other relevant features.

  4. 4.

    Bridging the gap between visual representations and protein classification: This paper addresses the gap in existing research by integrating CGR-based DANCE images with deep learning techniques for protein sequence classification. We demonstrate the synergy between visual representations and computational models, enhancing our understanding of protein sequences comprehensively and intuitively.

The remainder of this paper is organized as follows. Section II provides a comprehensive review of related work in protein visualization and classification techniques. Section III outlines the methodology employed in this study, including the CGR-based chaos image generation process, and the deep learning classification framework. Section IV presents the experimental results and performance evaluation of the proposed approach. Section V discusses the findings and implications of the study, emphasizing the significance of the contributions. Finally, Section VI concludes the paper.

II Related Work

In this section, we review the existing research and techniques in the fields of protein visualization, chaos game representation, and deep learning-based classification. We discuss the advancements, limitations, and gaps in current approaches.

Sparse encoding [18] uses a one-hot binary vector of length 20 to represent each amino acid in a protein sequence. However, this approach suffers from inefficiency and redundancy due to its high-dimensional and sparse nature. Amino Acid Composition [19] offers an alternative protein representation by considering the local compositions of amino acids and their twins. However, it does not consider the sequence order, limiting its effectiveness. Physicochemical Properties [20] incorporate the molecular components’ physicochemical properties to predict protein structure and function. However, the challenge lies in determining effective encoding for unknown physicochemical properties involved in protein folding. It is important to note that these feature engineering-based methods are domain-specific and may lack generalizability across different data types.

The structural-based encoding methods include Quantitative Structure-Activity Relationship (QSAR) [21] and General Structure encoding [22]. QSAR utilizes chemical properties to describe the amino acids in a sequence, but it focuses solely on the molecules rather than encoding the entire residue. However, QSAR may be susceptible to false correlations resulting from experimental errors in biological data. On the other hand, General Structure encoding maps structural information (e.g., residue depth, 3D shape, secondary structure) of the protein sequence into a numerical representation. However, its performance is limited by the availability of known protein structures.

Protein visualization techniques have played a crucial role in understanding protein structure and function [23]. Traditional methods, such as ribbon diagrams [24] and space-filling models [25], provide valuable insights into the three-dimensional (3D) structure of proteins. However, these techniques often struggle to capture the intricate details of protein sequences and their relationships [26].

The Chaos Game Representation (CGR) has emerged as a powerful tool for visualizing DNA and RNA sequences [27]. By recursively applying chaos game rules to generate fractal-like patterns, CGR enables the visualization of sequence properties and motifs [28]. However, its application in protein sequence analysis remains relatively unexplored.

Deep learning techniques have revolutionized various domains, including image classification [29] and natural language processing [30]. In recent years, deep learning has also been applied to protein sequence classification tasks [31]. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have shown promising results in extracting meaningful features from protein sequences and achieving high classification accuracy.

Despite the advancements in protein visualization [32], CGR [15], and deep learning-based classification [33], there exists a significant gap in the literature regarding the application of CGR to generate the kaleidoscopic shape of protein sequences. Most existing research focuses on either 3D protein structure visualization or DNA/RNA sequence analysis using CGR [28, 15]. The potential of kaleidoscopic representations for capturing complex patterns and relationships within protein sequences remains largely unexplored.

III Proposed Approach

Our proposed approach, DANCE, combines the Chaos Game Representation (CGR) with advanced deep learning techniques to classify protein sequences effectively. This innovative method harnesses the power of visual representation and neural networks to capture complex patterns in protein sequences, aiming for improved accuracy and robustness in classification tasks.

The Chaos Game Representation (CGR) is a method originally designed for visualizing sequences in a two-dimensional space. In the context of protein sequences, CGR converts linear sequences of amino acids into a 2D image, where each amino acid is mapped to a specific coordinate based on a set of predefined rules. These rules associate each amino acid with specific coordinates in the image, allowing us to create a visually informative representation of the protein sequence. The final output of this mapping process is a 2D image where the spatial distribution of pixels represents the sequence of amino acids in the protein. This image captures both the sequence order and the amino acid composition, offering a rich visual representation of the protein’s structure. Our proposed approach comprised several steps, which we will now discuss one by one.

III-A Assign numerical Coordinates To Amino Acids

The first step is to assign fixed x-axis and y-axis coordinate values to each of the 20202020 possible amino acids in protein sequences. Although this assignment of coordinate values could be random, the only criterion is that the values should be unique. Each amino acid must be assigned a unique pair of coordinates. This uniqueness is essential to ensure that each amino acid can be distinctly represented and identified in the CGR image, avoiding any ambiguity or overlap between different amino acids. The proper assignment of coordinates is crucial for the CGR process because it determines how the amino acids are represented in the final 2D image. Accurate and unique coordinate assignment allows for clear and effective visualization of protein sequences, capturing their compositional and sequential characteristics in a manner that can be analyzed by deep learning models for various classification tasks. The x- and y-axis values assigned to each amino acid are given in Table I.

TABLE I: Amino acids with corresponding x- and y-axis values.
Amino Acid x-axis y-axis Amino Acid x-axis y-axis
A 0.5 0.5 M 0.5 0.0
C 1.0 0.5 N 0.25 0.5
D 0.5 1.0 P 1.0 0.0
E 0.0 0.5 Q 0.0 1.0
F 1.0 1.0 R 0.5 0.25
G 0.25 0.25 S 0.75 0.5
H 0.75 0.25 T 0.5 0.75
I 0.75 0.75 V 0.0 0.0
K 0.25 0.75 W 1.0 0.25
L 0.75 0.0 Y 1.0 0.75

III-B Recursively Generating DANCE Images

The pseudocode to generate the Kaleidoscope shape images is given in Algorithm 1. This method takes a protein sequence as an input along with the recursion depth, initial position of the central seed point, initial angle of rotation, and scale factor for the replication. It recursively calls itself, keeps updating coordinate values, adding coordinates in the plot, and reducing the depth. When depth reaches 0absent0\leq 0≤ 0, the algorithm terminates (i.e. stopping criteria met) and the resultant plot is the final DANCE-based image for the given protein sequence. The variables depth, initial position (pos), angle, and scale are the hyperparameters, whole values are tuned using a standard validation set approach. The initial optimal values selected for the depth, initial position (pos), angle, and scale are 4444, (0, 0), 0, and 10101010, respectively. After the recursive process terminates, we get the DANCE (Kaleidoscope shape) image (see Figure 1 for an example). Once the Kaleidoscope shape image is generated, it is used as input for deep learning-based classifiers. The deep learning models analyze these images to classify the protein sequences, leveraging the visual patterns created by the CGR method to extract meaningful features for accurate classification. Figure 1 illustrates a sample Kaleidoscope shape image generated using this method. The image showcases the intricate patterns that result from the recursive plotting of amino acid coordinates, demonstrating the effectiveness of the CGR technique in visualizing protein sequences and providing a unique and detailed representation of protein sequences, facilitating enhanced analysis and classification through deep learning models.

Algorithm 1 Generate Kaleidoscope (DANCE)
1:  Input: Set \mathcal{M}caligraphic_M of (m𝑚mitalic_m-mer) minimizers on alphabet ΣΣ\Sigmaroman_Σ
2:  Output: ViralVectors based embedding V𝑉Vitalic_V
3:  GenKaleidoscope(seq,depth,pos,angle,scale𝑠𝑒𝑞𝑑𝑒𝑝𝑡𝑝𝑜𝑠𝑎𝑛𝑔𝑙𝑒𝑠𝑐𝑎𝑙𝑒seq,depth,pos,angle,scaleitalic_s italic_e italic_q , italic_d italic_e italic_p italic_t italic_h , italic_p italic_o italic_s , italic_a italic_n italic_g italic_l italic_e , italic_s italic_c italic_a italic_l italic_e)
4:  if depth0𝑑𝑒𝑝𝑡0depth\leq 0italic_d italic_e italic_p italic_t italic_h ≤ 0 then
5:     return
6:  end if
7:  x,ypos𝑥𝑦𝑝𝑜𝑠x,y\leftarrow positalic_x , italic_y ← italic_p italic_o italic_s
8:  dxscalecos(angle)𝑑𝑥𝑠𝑐𝑎𝑙𝑒𝑎𝑛𝑔𝑙𝑒dx\leftarrow scale\cdot\cos(angle)italic_d italic_x ← italic_s italic_c italic_a italic_l italic_e ⋅ roman_cos ( italic_a italic_n italic_g italic_l italic_e )
9:  dyscalesin(angle)𝑑𝑦𝑠𝑐𝑎𝑙𝑒𝑎𝑛𝑔𝑙𝑒dy\leftarrow scale\cdot\sin(angle)italic_d italic_y ← italic_s italic_c italic_a italic_l italic_e ⋅ roman_sin ( italic_a italic_n italic_g italic_l italic_e )
10:  for AminoAcid𝐴𝑚𝑖𝑛𝑜𝐴𝑐𝑖𝑑AminoAciditalic_A italic_m italic_i italic_n italic_o italic_A italic_c italic_i italic_d in seq𝑠𝑒𝑞seqitalic_s italic_e italic_q do
11:     x,yx+dx,y+dyformulae-sequence𝑥𝑦𝑥𝑑𝑥𝑦𝑑𝑦x,y\leftarrow x+dx,y+dyitalic_x , italic_y ← italic_x + italic_d italic_x , italic_y + italic_d italic_y
12:     cx,cyCoordinateRule(AminoAcid)𝑐𝑥𝑐𝑦CoordinateRule𝐴𝑚𝑖𝑛𝑜𝐴𝑐𝑖𝑑cx,cy\leftarrow\textsc{CoordinateRule}(AminoAcid)italic_c italic_x , italic_c italic_y ← CoordinateRule ( italic_A italic_m italic_i italic_n italic_o italic_A italic_c italic_i italic_d ) {from Table I}
13:     plt.plot([x,cx𝑥𝑐𝑥x,cxitalic_x , italic_c italic_x], [y,cy𝑦𝑐𝑦y,cyitalic_y , italic_c italic_y], color=color𝑐𝑜𝑙𝑜𝑟coloritalic_c italic_o italic_l italic_o italic_r)
14:     plt.plot([x,cx𝑥𝑐𝑥x,cxitalic_x , italic_c italic_x], [y,cy𝑦𝑐𝑦y,-cyitalic_y , - italic_c italic_y], color=color𝑐𝑜𝑙𝑜𝑟coloritalic_c italic_o italic_l italic_o italic_r)
15:     plt.plot([x,cx𝑥𝑐𝑥-x,cx- italic_x , italic_c italic_x], [y,cy𝑦𝑐𝑦-y,cy- italic_y , italic_c italic_y], color=color𝑐𝑜𝑙𝑜𝑟coloritalic_c italic_o italic_l italic_o italic_r)
16:     plt.plot([x,cx𝑥𝑐𝑥-x,cx- italic_x , italic_c italic_x], [y,cy𝑦𝑐𝑦-y,-cy- italic_y , - italic_c italic_y], color=color𝑐𝑜𝑙𝑜𝑟coloritalic_c italic_o italic_l italic_o italic_r)
17:     GenKaleidoscope(seq,depth1,(x,y),angle,scale)𝑠𝑒𝑞𝑑𝑒𝑝𝑡1𝑥𝑦𝑎𝑛𝑔𝑙𝑒𝑠𝑐𝑎𝑙𝑒(seq,depth-1,(x,y),angle,scale)( italic_s italic_e italic_q , italic_d italic_e italic_p italic_t italic_h - 1 , ( italic_x , italic_y ) , italic_a italic_n italic_g italic_l italic_e , italic_s italic_c italic_a italic_l italic_e )
18:     GenKaleidoscope(seq,depth1,(x,y),angle,scale)𝑠𝑒𝑞𝑑𝑒𝑝𝑡1𝑥𝑦𝑎𝑛𝑔𝑙𝑒𝑠𝑐𝑎𝑙𝑒(seq,depth-1,(x,-y),angle,scale)( italic_s italic_e italic_q , italic_d italic_e italic_p italic_t italic_h - 1 , ( italic_x , - italic_y ) , italic_a italic_n italic_g italic_l italic_e , italic_s italic_c italic_a italic_l italic_e )
19:     GenKaleidoscope(seq,depth1,(x,y),angle,scale)𝑠𝑒𝑞𝑑𝑒𝑝𝑡1𝑥𝑦𝑎𝑛𝑔𝑙𝑒𝑠𝑐𝑎𝑙𝑒(seq,depth-1,(-x,y),angle,scale)( italic_s italic_e italic_q , italic_d italic_e italic_p italic_t italic_h - 1 , ( - italic_x , italic_y ) , italic_a italic_n italic_g italic_l italic_e , italic_s italic_c italic_a italic_l italic_e )
20:     GenKaleidoscope(seq,depth1,(x,y),angle,scale)𝑠𝑒𝑞𝑑𝑒𝑝𝑡1𝑥𝑦𝑎𝑛𝑔𝑙𝑒𝑠𝑐𝑎𝑙𝑒(seq,depth-1,(-x,-y),angle,scale)( italic_s italic_e italic_q , italic_d italic_e italic_p italic_t italic_h - 1 , ( - italic_x , - italic_y ) , italic_a italic_n italic_g italic_l italic_e , italic_s italic_c italic_a italic_l italic_e )
21:     depthdepth1𝑑𝑒𝑝𝑡𝑑𝑒𝑝𝑡1depth\leftarrow depth-1italic_d italic_e italic_p italic_t italic_h ← italic_d italic_e italic_p italic_t italic_h - 1
22:  end for

IV Experimental Setup

This section presents details regarding the dataset used and the evaluation metrics employed in the experiments. The experiments were performed on a computer system equipped with an Intel(R) Core i5 processor, 32 GB of memory, and a 64-bit Windows 10 operating system. The models were implemented using the Python programming language. For the sake of reproducibility, we have made our preprocessed data and code available online 111The preprocessed data and code can be accessed in the published version of this work..

For assessing the effectiveness of the deep learning models, we measure several performance metrics, including average accuracy, precision, recall, F1 (weighted), F1 (macro), ROC-AUC, and training runtime. In the case of multi-class classification, we adopt the one-vs-rest approach to utilize binary classification-based evaluation metrics. This approach enables us to evaluate the model’s performance across multiple classes. By using these metrics, we ensure a thorough evaluation of our deep learning models, addressing various aspects of performance from accuracy and error rates to computational efficiency. This comprehensive assessment helps in fine-tuning the models and making informed decisions about their deployment and application.

IV-A Dataset Statistics

The TCR sequence data used in this study was obtained from TCRdb, a comprehensive database for T-cell receptor sequences known for its powerful search function [34]. TCRdb contains a vast collection of over 277 million sequences derived from more than 8265 TCR-Seq samples, encompassing various tissues, clinical conditions, and cell types. In this study, our focus was on identifying and extracting data related to the five most prevalent types of cancer based on their incidence rates. To ensure a representative subset of the data while preserving the distribution of target labels (cancer types), we employed the Stratified ShuffleSplit method. Through this approach, we randomly extracted a total of 14205142051420514205 TCR sequences for four different types of cancers. The dataset’s statistics used for experimentation are presented in Table II.

TABLE II: Dataset Statistics for the t-cell receptor protein sequences.
Target Label (Cancer Type) |Sequences|𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠|Sequences|| italic_S italic_e italic_q italic_u italic_e italic_n italic_c italic_e italic_s |
HeadNeck 5230
Ovarian 583
Pancreatic 2887
Retroperitoneal 5505

IV-B Feature Engineering Baselines

In addition to the Chaos method [15], which serves as the state-of-the-art (SOTA) approach for comparison, we incorporate two numerical feature vector-based sequence embedding generation methods as baselines. The following sections provide detailed descriptions of these baselines.

IV-B1 One Hot Encoding (OHE) [9]

OHE (One-Hot Encoding) is an algorithm used to transform a sequence into a numerical representation. It creates a binary feature vector for each character in the sequence, and these binary vectors are then concatenated to represent the entire sequence. While OHE is a simple and intuitive method, the resulting vectors tend to be highly sparse, leading to challenges related to the curse of dimensionality.

IV-B2 Wasserstein Distance Guided Representation Learning (WDGRL) [35]

This is an unsupervised domain adaptation technique that aims to transform high-dimensional vectors into low-dimensional representations. This approach utilizes neural networks to determine the Wasserstein distance (WD) between the encoded distributions of the source and target data. By optimizing the feature extractor network and minimizing the estimated WD, WDGRL obtains effective representations of the input data features. WDGRL operates on the feature vectors generated by the OHE method.

IV-B3 Efficient Kernel [36]

Authors in [36] propose a kernel-based method for molecular sequence classification, addressing challenges in detecting diseases using molecular data. The approach involves creating a kernel matrix using normalized pairwise k𝑘kitalic_k-mer distances, optimized via the Sinkhorn-Knopp algorithm, followed by kernel PCA to reduce dimensionality. We use this method with the logistic regression classifier (i.e. a commonly used classifier in the literature) as a baseline for cancer prediction.

IV-C Classification Models

To perform the classification of TCRs with respect to their cancer activity type we are employing two types of deep learning (DL) models, vision models & tabular models.

The vision models consist of a set of DL classifiers that are applicable to the image dataset, and they are used to classify the TCR images generated by our proposed approach and the Chaos baseline. This set has 4444 custom convolution neural network (CNN) models along 2222 pre-trained classifiers. The custom classifiers are known as 1111-Layer CNN, 2222-Layer CNN, 3333-Layer CNN & 4444-Layer CNN. Their names indicate the number of hidden Block layers present in them. For instance, in 4444-Layer CNN 4444 Block layers exist and a Block layer has a Convolution layer followed by a ReLu activation function and a Max-Pool layer with a kernel size of 5x5 and stride of 2x2. In each of the custom models, the final layer comprises 2222 fully connected layers with the ReLu activation function and Softmax classification layer. These custom CNN classifiers illustrate the impact of increasing the number of layers in a classifier on the performance of the classifier. Moreover, the impact of transfer learning is observed by using the pre-trained models for the TCR classification task. We employ two pre-trained models, VGG-19 [37] and RESNET-50 [38], as both of them are very popular image classifiers. Furthermore, the 80-20% train-test split is used for training the vision models based on stratified sampling. This sampling technique is known to preserve the proportions between the classes. The input images are of size 380×380380380380\times 380380 × 380. The training hyper-parameters used are 0.0030.0030.0030.003 learning rate, 64646464 batch size, 10101010 epochs, and ADAM optimizer chosen after fine-tuning the models. Additionally, the negative log-likelihood (NLL) [39] loss function is used as a training loss function because it’s known to be a cross-entropy loss function for multi-class problems.

The tabular CNN classifiers take vector data as input and these models are applied to the vectors generated from the feature-engineering-based baselines (OHE & WDGRL). The tab CNN set contains 3333-Layer Tab CNN & 4444-Layer Tab CNN model. Their names imply the number of hidden linear layers in them, like the 4444-Layer Tab CNN model has 4444 hidden fully connected layers. In both models, the hidden layers are followed by a final classification linear layer. Their training hyper-parameters are 0.0030.0030.0030.003 learning rate, 64646464 batch size, 10101010 epochs, ADAM optimizer, and NLL loss function. They also follow the 80-20% train-test split in the training. Moreover, the WDGRL technique generates the vectors of dimension 10101010, while OHE uses a zero padding strategy to make its vectors the same length.

V Results and Discussion

This section deals with the classification results of TCRs based on their cancer activity type using various DL classifiers. The results are summarized in Table III.

V-A Comparision with feature-engineering-based baselines

The results illustrate that the feature-engineering-based baselines (OHE & WDGRL) achieve lower performance using the tabular CNN models as compared with our image-based method (DANCE) for all the evaluation metrics except the train run time. We can also observe that DANCE outperforms the efficient kernel method for all evaluation metrics, showing that the image-based approach captures the underlying sequence patterns more effectively than kernel-based embeddings. This suggests that transforming sequences into an image format allows deep learning models to better leverage spatial relationships and local dependencies within the data, leading to superior predictive performance. Additionally, DANCE’s ability to outperform kernel-based methods highlights the advantage of using convolutional architectures for sequence classification, as they excel at recognizing complex structures in visual representations, which are often missed by traditional vector-based or kernel methods.

V-B Comparision with image-based baseline

We can observe that our method (DANCE) is outperforming the image-based baseline (Chaos) for all the evaluation metrics. This indicates that the images generated by DANCE are more informative in terms of classification performance than the images created by Chaos.

Moreover, DANCE is achieving the highest performance for almost all the evaluation parameters corresponding to the 3333-Layer CNN model, along with the 1111-Layer CNN model also yielding optimal values for accuracy, recall, and AUC ROX scores. We can notice that increasing the number of layers to 3333 layers is increasing the performance for most of the metrics, while more than 3333 layers are demonstrating a decreased performance. One reason for that could be the gradient vanishing issue. As our dataset is not large in size a higher number of layers in the model can cause the gradient to vanish, hence hindering the learning capacity of the model.

Furthermore, we investigated transfer learning for doing TCR classification using the pre-trained RESNET-50 and VGG-19 models. The results illustrated DANCE is clearly performing better than the pre-trained models. A reason for that could be that the RESNET-50 and VGG-19 models are trained originally on different types of image data, so they are unable to generalize well to the DANCE-based images.

TABLE III: The TCR classification results for different models and algorithms. The best values are shown in bold.
DL Model Method Acc. \uparrow Prec. \uparrow Recall \uparrow F1 (Weig.) \uparrow F1 (Macro) \uparrow ROC AUC \uparrow Train Time (hrs.) \downarrow
- Efficient Kernel [36] 0.386 0.149 0.386 0.215 0.139 0.500 1.207
3-Layer Tab CNN OHE [9] 0.388 0.291 0.388 0.321 0.211 0.491 0.249
WDGRL [35] 0.436 0.339 0.436 0.358 0.236 0.510 0.070
4-Layer Tab CNN OHE [9] 0.371 0.286 0.371 0.288 0.192 0.489 0.330
WDGRL [35] 0.435 0.384 0.435 0.355 0.236 0.500 0.074
1-Layer CNN Chaos 0.343 0.330 0.343 0.335 0.246 0.498 4.983
DANCE (Ours) 0.478 0.440 0.478 0.312 0.278 0.635 3.099
2-Layer CNN Chaos 0.381 0.285 0.381 0.215 0.140 0.499 5.183
DANCE (Ours) 0.460 0.407 0.460 0.394 0.264 0.544 3.101
3-Layer CNN Chaos 0.379 0.143 0.379 0.208 0.137 0.500 6.156
DANCE (Ours) 0.478 0.451 0.478 0.430 0.299 0.559 3.186
4-Layer CNN Chaos 0.381 0.145 0.381 0.210 0.138 0.500 5.566
DANCE (Ours) 0.457 0.341 0.457 0.385 0.255 0.542 3.105
PreTrained RESNET50 Chaos 0.379 0.143 0.379 0.208 0.137 0.489 7.600
DANCE (Ours) 0.459 0.343 0.459 0.393 0.261 0.501 8.152
PreTrained VGG-19 Chaos 0.379 0.143 0.379 0.208 0.137 0.488 16.420
DANCE (Ours) 0.430 0.320 0.430 0.366 0.243 0.500 15.643

VI Conclusion

In conclusion, this study presents the DANCE (Deep Learning-Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images) approach, which combines Chaos Game Representation (CGR) with deep learning classification to address the challenges in analyzing T-cell protein sequences. By generating kaleidoscopic images using CGR, DANCE offers a visually captivating representation that preserves essential details and captures the structural and functional characteristics of protein sequences. The effectiveness of DANCE images for protein sequence classification is demonstrated through the utilization of deep learning models. Additionally, the study investigates the relationship between the visual patterns observed in DANCE images and protein properties, providing insights into structural motifs, protein domains, secondary structures, and other relevant features. By bridging the gap between visual representations and protein classification, this research contributes to the field of protein analysis and bioinformatics, offering new possibilities for a comprehensive and intuitive understanding of protein sequences. Future work includes evaluation of DANCE on other biological datasets such as coronavirus spike sequences and Zika virus sequences etc. Using more advanced deep learning models, such as Transformers for image classification is another exciting future extension.

References

  • [1] N. Li, J. Yuan, W. Tian, L. Meng, and Y. Liu, “T-cell receptor repertoire analysis for the diagnosis and treatment of solid tumor: a methodology and clinical applications,” Cancer Communications, vol. 40, no. 10, pp. 473–483, 2020.
  • [2] S. H. Gohil, J. B. Iorgulescu, D. A. Braun, D. B. Keskin, and K. J. Livak, “Applying high-dimensional single-cell technologies to the analysis of cancer immunotherapy,” Nature Reviews Clinical Oncology, vol. 18, no. 4, pp. 244–256, 2021.
  • [3] X. Hou, M. Wang, C. Lu, Q. Xie, G. Cui, J. Chen, Y. Du, Y. Dai, and H. Diao, “Analysis of the repertoire features of tcr beta chain cdr3 in human by high-throughput sequencing,” Cellular Physiology and Biochemistry, vol. 39, no. 2, pp. 651–667, 2016.
  • [4] S. Ali and M. Patterson, “Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences,” in IEEE International Conference on Big Data (Big Data), 2021, pp. 1533–1540.
  • [5] S. Ali, B. Bello, P. Chourasia, R. T. Punathil, Y. Zhou, and M. Patterson, “PWM2Vec: An efficient embedding approach for viral host specification from coronavirus spike sequences,” Biology, vol. 11, no. 3, p. 418, 2022.
  • [6] C. Chen, Y. Zha, D. Zhu, K. Ning, and X. Cui, “Hydrogen bonds meet self-attention: all you need for protein structure embedding,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).   IEEE, 2021, pp. 12–17.
  • [7] Z. Du, Y. He, J. Li, and V. N. Uversky, “Deepadd: protein function prediction from k-mer embedding and additional features,” Computational Biology and Chemistry, vol. 89, p. 107379, 2020.
  • [8] Z. Tayebi, S. Ali, and M. Patterson, “Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants,” Algorithms, vol. 14, no. 12, p. 348, 2021.
  • [9] K. Kuzmin et al., “Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone,” Biochemical and Biophysical Research Communications, vol. 533, no. 3, pp. 553–558, 2020.
  • [10] S. Ali, B. Sahoo, N. Ullah, A. Zelikovskiy, M. Patterson, and I. Khan, “A k-mer based approach for SARS-CoV-2 variant identification,” in International Symposium on Bioinformatics Research and Applications, 2021, pp. 153–164.
  • [11] L. Wu, C. Yin, J. Zhu, Z. Wu, L. He, Y. Xia, S. Xie, T. Qin, and T.-Y. Liu, “Sproberta: protein embedding learning with local fragment modeling,” Briefings in Bioinformatics, vol. 23, no. 6, 2022.
  • [12] W. Yeung, Z. Zhou, L. Mathew, N. Gravel, R. Taujale, B. O’Boyle, M. Salcedo, A. Venkat, W. Lanzilotta, S. Li et al., “Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies,” Briefings in Bioinformatics, vol. 24, no. 1, p. bbac619, 2023.
  • [13] J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola, “Generative models for graph-based protein design,” Advances in neural information processing systems, vol. 32, 2019.
  • [14] H. J. Jeffrey, “Chaos game representation of gene structure,” Nucleic acids research, vol. 18, no. 8, pp. 2163–2170, 1990.
  • [15] H. F. Löchel, D. Eger, T. Sperlea, and D. Heider, “Deep learning on chaos game representation for proteins,” Bioinformatics, vol. 36, no. 1, pp. 272–279, 2020.
  • [16] A. S. Nair, V. V. Nair, K. Arun, K. Kant, and A. Dey, “Bio-sequence signatures using chaos game representation,” Bioinformatics: applications in life and environmental sciences, pp. 62–76, 2009.
  • [17] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Deep learning for hyperspectral image classification: An overview,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6690–6709, 2019.
  • [18] J. D. Hirst and M. J. Sternberg, “Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks,” Biochemistry, vol. 31, no. 32, pp. 7211–7218, 1992.
  • [19] S. Matsuda, J.-P. Vert, H. Saigo, N. Ueda, H. Toh, and T. Akutsu, “A novel representation of protein sequences for prediction of subcellular location using support vector machines,” Protein Science, vol. 14, no. 11, pp. 2804–2813, 2005.
  • [20] C. M. Deber, C. Wang, L.-P. Liu, A. S. Prior, S. Agrawal, B. L. Muskat, and A. J. Cuticchia, “Tm finder: a prediction program for transmembrane protein segments using a combination of hydrophobicity and nonpolar phase helicity scales,” Protein Science, vol. 10, no. 1, pp. 212–219, 2001.
  • [21] A. Cherkasov et al., “Qsar modeling: where have you been? where are you going to?” Journal of medicinal chemistry, vol. 57, no. 12, pp. 4977–5010, 2014.
  • [22] J. Cui, Q. Liu, D. Puett, and Y. Xu, “Computational prediction of human proteins that can be secreted into the bloodstream,” Bioinformatics, vol. 24, no. 20, pp. 2370–2375, 2008.
  • [23] Z. Cournia, T. W. Allen, I. Andricioaei, B. Antonny, D. Baum, G. Brannigan, N.-V. Buchete, J. T. Deckman, L. Delemotte, C. Del Val et al., “Membrane protein structure, function, and dynamics: a perspective from experiments and theory,” The Journal of membrane biology, vol. 248, pp. 611–640, 2015.
  • [24] P. E. Bourne, E. J. Draizen, and C. Mura, “The curse of the protein ribbon diagram,” PLoS biology, vol. 20, no. 12, p. e3001901, 2022.
  • [25] N. Matthews, R. Easdon, A. Kitao, S. Hayward, and S. Laycock, “High quality rendering of protein dynamics in space filling mode,” Journal of Molecular Graphics and Modelling, vol. 78, pp. 158–167, 2017.
  • [26] T. Itoh, C. Muelder, K.-L. Ma, and J. Sese, “A hybrid space-filling and force-directed layout method for visualizing multiple-category graphs,” in 2009 IEEE Pacific Visualization Symposium.   IEEE, 2009, pp. 121–128.
  • [27] H. F. Löchel and D. Heider, “Chaos game representation and its applications in bioinformatics,” Computational and Structural Biotechnology Journal, vol. 19, pp. 6263–6271, 2021.
  • [28] A. Thomas, “Three dimensional chaos game representation of protein sequences,” arXiv preprint arXiv:2303.09683, 2023.
  • [29] C. Affonso, A. L. D. Rossi, F. H. A. Vieira, A. C. P. de Leon Ferreira et al., “Deep learning for biological image classification,” Expert systems with applications, vol. 85, pp. 114–122, 2017.
  • [30] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE transactions on neural networks and learning systems, vol. 32, no. 2, pp. 604–624, 2020.
  • [31] C. Ao, S. Jiao, Y. Wang, L. Yu, and Q. Zou, “Biological sequence classification: A review on data and general methods,” Research, vol. 2022, p. 0011, 2022.
  • [32] N. Colaert, K. Helsens, L. Martens, J. Vandekerckhove, and K. Gevaert, “Improved visualization of protein consensus sequences by icelogo,” Nature methods, vol. 6, no. 11, pp. 786–787, 2009.
  • [33] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. Nelson, A. Bridgland et al., “Improved protein structure prediction using potentials from deep learning,” Nature, vol. 577, no. 7792, pp. 706–710, 2020.
  • [34] S.-Y. Chen, T. Yue, Q. Lei, and A.-Y. Guo, “Tcrdb: a comprehensive database for t-cell receptor sequences with powerful search function,” Nucleic Acids Research, vol. 49, no. D1, pp. D468–D474, 2021.
  • [35] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided representation learning for domain adaptation,” in AAAI conference on artificial intelligence, 2018.
  • [36] S. Ali, T. E. Ali, T. Murad, H. Mansoor, and M. Patterson, “Molecular sequence classification using efficient kernel based embedding,” Information Sciences, vol. 679, p. 121100, 2024.
  • [37] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [39] Yao et al., “Negative log likelihood ratio loss for deep neural network classification,” in Proceedings of the Future Technologies Conference.   Springer, 2019, pp. 276–282.