July 18, 2022

Improving usability of low-coverage, low-quality DNA sequenced reads

By: David Bearden

DNA sequencing involves the replication of an unknown strand with nucleotide-specific fluorescent signaling. A machine learning model associates the quality of each nucleotide read with a score (Phred Q score). Multiple reads for each nucleotide position are then aggregated by a quality score weighted voting process to accurately determine the true underlying nucleotide sequence, known as contig alignment. However, Q scores are inherently biased because they do not account for the sequencing machine nor the DNA sample preparation. With probabilistic modeling and quality score calibration, contig alignment of low-coverage (less than four) and low-quality reads for mutated sequences of mScarlet (a red fluorescent protein, RFP) was improved and more practically represented to facilitate further analysis.

After implementing an ad hoc coding framework in Python to handle the dataset of 355,104 assembly matrices (mutated reads aligned to the non-mutated mScarlet sequence), nucleotide-specific error patterns were extrapolated from high-coverage RFP (5 or more) and the non-mutated blue fluorescent protein (BFP) that was linked to RFP for analysis purposes, specifically compositional probability and conditional sequencing error rates.

Further feature engineering to extract useful representations of the data and quality score calibration based on empirical probability occurred before feeding the predictors into a logistic regression machine learning model.

Previously inconclusive mutated sections of low-coverage reads, which comprised 70% of the dataset’s assembly matrices, were either resolved or highlighted as inconclusive, facilitating further analysis of the consequent performance changes from mutated sections of mScarlet.

Leave a Reply Cancel reply