Four different bases can be stringed together in a mind-boggling amount of variations. They form one of twenty amino acids that themselves can be combined to form various proteins. My project indirectly supports my lab, Neurotoolbox, in its endeavor to improve fluorescent proteins that are utilized for spatial and temporal resolution of neurons in the brain. There are two notable types of proteins that the lab uses. One protein can be used to activate a neuron by shining a light with a specific wavelength. The other protein can fluoresce upon activation by its respective neuron. Both of these proteins have numerous capabilities in the field of neuroscience and in identifying nerve tracts.
My project within this lab is to facilitate the pursuit of improving the biological capabilities and optimizing the performance of these proteins. My principal investigator, Yiyang Gong, provided me with a MATLAB dataset housing all the reads of a mutation-induced sequence of one of the aforementioned fluorescent proteins. There are over 350,000 different mutated sequences each with their respective coverage (number of reads/voters) and quality scores. The original sequence is known, but the issue is the successful discernment of true and fake mutations. Over 55% of the dataset has incredibly low coverage (1 or 2 reads), 15% has moderate coverage (3 reads), and the other 30% has high coverage (4 to 20 reads).
When there are few voters and an inconclusive quality score, what is the true mutation? What about if both reads have a perfect quality score yet they disagree? These are the questions I have to answer, notably when the coverage is only moderate to low (3 or less) which makes up 70% of the dataset. Through Python data analysis, probabilistic modeling, and machine learning applications, I need to clean the dataset and create a library that associates a barcode (tagged to the end of different mutated sequences) with its respective SNPs. The mutations would later be processed to determine which sets of mutations would improve the performance of the fluorescent protein (my next project after completion of this one).