Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

NGS Data Analysis

Uploaded by

lucylit0666
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NGS Data Analysis

Uploaded by

lucylit0666
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

NGS Data Analysis

NGS (Next-Generation Sequencing) generates massive amounts of raw data,


requiring systematic analysis to ensure accuracy and reliability. The initial
steps include handling FASTQ files, performing a quality check, and
applying pre-processing steps to prepare the data for downstream
analysis.

1. FASTQ Files

What are FASTQ Files?

• FASTQ is a standard file format for storing raw sequence data


generated from NGS platforms (e.g., Illumina, Oxford Nanopore).
• It combines both nucleotide sequence data and quality scores in a
single file.

Structure of a FASTQ File:

Each sequence entry in a FASTQ file consists of 4 lines:

1. Sequence Identifier: Starts with @ followed by a unique sequence


identifier.
2. Sequence: The actual nucleotide sequence (A, T, G, C, N).
3. Plus (+) Line: A + symbol, often followed by the sequence ID
(optional).
4. Quality Scores: ASCII-encoded quality scores corresponding to each
base in the sequence.

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''

Key Tools for Handling FASTQ Files:

• FASTQC: Quality control checks.


• seqtk: Lightweight toolkit for FASTQ file manipulation.
• FASTP: FASTQ pre-processing tool.

2. Quality Check (QC)

Why is Quality Check Important?

• Ensures the accuracy of raw sequencing data.


• Identifies poor-quality reads, adapter contamination, and other
sequencing artifacts.
• Prevents downstream errors in alignment, variant calling, or
assembly.

Key Metrics in Quality Control:

1. Per-base Sequence Quality: Quality scores across each nucleotide


position.
2. Per-sequence Quality Scores: Overall quality distribution of all
reads.
3. Adapter Content: Detects adapter sequences that may still be
present in reads.
4. GC Content: Ensures uniform GC distribution.
5. Read Length Distribution: Consistency in read lengths across
samples.
6. Duplicated Reads: Identifies PCR duplicates.

Quality Control Tools:

• FASTQC: Comprehensive quality assessment.


• MultiQC: Aggregates multiple FASTQC reports.
• Trim Galore!: Combines adapter trimming and QC filtering.

Example Output from FASTQC:

• Green: Good quality.


• Orange: Warning.
• Red: Poor quality (requires intervention).

3. Pre-processing

What is Pre-processing?

Pre-processing involves cleaning and preparing raw sequencing data for


downstream analysis. It includes:

1. Adapter Trimming
2. Quality Filtering
3. Read Trimming and Cropping
4. Removal of Low-quality Reads
5. De-duplication
Key Steps in Pre-processing:

1. Adapter Trimming:

• Adapters are short sequences added during library preparation.


• Residual adapter sequences can interfere with alignment and
analysis.
• Tools:
o Cutadapt
o Trimmomatic

2. Quality Filtering:

• Removes reads with poor-quality scores.


• Filters based on:
o Minimum Phred Score (e.g., Q30)
o Minimum read length (e.g., >50 bp)
• Tools:
o FASTP
o PRINSEQ

3. Read Trimming and Cropping:

• Trims poor-quality bases from the ends of reads.


• Crops reads to a specific length if required.
• Tools:
o Sickle
o Trim Galore!

4. Removal of Contaminants:

• Identifies and removes reads originating from non-target sources


(e.g., host genomes, bacterial contamination).
• Tools:
o Bowtie2
o Kraken2

5. De-duplication:

• PCR duplicates arise from library amplification and should be


removed to prevent bias.
• Tools:
o Picard (MarkDuplicates)
o Samtools rmdup
4. Workflow Summary:

Step Purpose Tools


1. Quality Check Assess raw data FASTQC, MultiQC
(QC) quality

2. Adapter Remove adapter Cutadapt,


Trimming sequences Trimmomatic

3. Quality Filtering Remove low-quality FASTP, PRINSEQ


reads

4. Read Trimming Remove low-quality Sickle, Trim Galore!


bases

5. Contaminant Filter unwanted Bowtie2, Kraken2


Removal reads

6. De-duplication Remove PCR Picard, Samtools


duplicates

Final Output After Pre-processing:

• Cleaned FASTQ Files: High-quality reads, free from adapters and


contaminants.
• Quality Metrics Report: Ensures the data meets downstream
analysis requirements.

Key Takeaways:

1. FASTQ Files: Store raw sequencing reads and quality scores.


2. Quality Check: Detects sequencing errors and biases using tools like
FASTQC.
3. Pre-processing: Improves data quality by trimming adapters,
filtering low-quality reads, and removing contaminants.
4. Tools: Essential tools include FASTQC, Cutadapt, Trimmomatic,
Bowtie2, and Picard.
5. Next Steps After Pre-processing: Alignment, variant calling,
transcriptome assembly, or metagenomic analysis.

You might also like