izpis_h1_title_alt

Deep learning of tissue-specific gene expression from DNA sequences
ID Polanc, Uroš (Author), ID Curk, Tomaž (Mentor) More about this mentor... This link opens in a new window, ID Zrimec, Jan (Comentor)

.pdfPDF - Presentation file, Download (8,50 MB)
MD5: 035B570BAC7D4EA48E81EA111C5E1E91

Abstract
Predicting tissue-specific gene expression is a crucial task in understanding the complex regulatory mechanisms governing gene expression. In this research, we employed three distinct models, two convolutional neural networks (CNNs) and DNABERT, to explore predictive models for tissue-specific gene expression. For the genome, we opted for the publicly available \textit{Arabidopsis thaliana}. Our approach involved systematically testing various methodologies, encompassing diverse transcript filtering techniques and an array of input sequences. The integration of multiple models and comprehensive input variations represents a significant step towards enhancing our understanding of tissue-specific gene expression prediction and furthering advancements in bioinformatics and computational biology. Our findings demonstrate the significance of both sequence data and additional CDS features in predicting gene expression. Combining these features showed only a marginal performance increase. DNABERT struggled with sequence-only inputs but performed comparably to CNN models with augmented CDS features. The Washburn model exhibited the most pronounced tissue-specific performance (R-squared $\approx$ 0.40), followed by DNABERT (R-squared $\approx$ 0.34) and Zrimec (R-squared $\approx$ 0.31). The models faced challenges in predicting both low- and highly-expressed genes but excelled in predicting mid-expressed genes. Additionally, predicting tissue-specific expression closely resembled predicting transcript mean expression, showing a consistent performance ordering across tissues. We analyzed kernel activations to showcase the model's pattern recognition skills. We cross-referenced these patterns with databases, finding around 650 matches. We used sequence occlusion to pinpoint important areas within the sequences. Our results highlighted the importance of the promoter near the TSS and the 5'UTR near the CDS in shaping model performance, especially with shorter occlusions. Additionally, all genomic regions except the terminator proved relevant when occluding their entire regions. In conclusion, we have demonstrated the model's capability to forecast tissue-specific gene expression and underscored the significance of non-coding genomic regions. While there remains ongoing research in this field, we aspire that our findings contribute to the understanding of tissue-specific gene expression.

Language:English
Keywords:bioinformatics, convolutional neural network, DNA, DNABERT, gene expression, machine learning, sequence motifs, mRNA, predictive models, regulatory mechanisms, tissue-specific gene expression, tissue-specificity
Work type:Master's thesis/paper
Typology:2.09 - Master's Thesis
Organization:FRI - Faculty of Computer and Information Science
Year:2023
PID:20.500.12556/RUL-152353 This link opens in a new window
COBISS.SI-ID:177600003 This link opens in a new window
Publication date in RUL:22.11.2023
Views:906
Downloads:69
Metadata:XML DC-XML DC-RDF
:
Copy citation
Share:Bookmark and Share

Secondary language

Language:Slovenian
Title:Globoko učenje tkivno specifičnega izražanja genov iz zaporedij DNA
Abstract:
Predvidevanje tkivno specifične genske ekspresije je ključno za razumevanje kompleksnih regulatornih mehanizmov, ki urejajo izražanje genov. V tem delu smo za raziskovanje napovednih modelov za tkivno specifično gensko ekspresijo uporabili tri različne modele: dve konvolucijski nevronski mreži (angl. convolutional neural network, CNN) in DNABERT. Za genom smo izbrali javno dostopno \textit{Arabidopsis thaliana}. Naš postopek je obsegal sistematično testiranje različnih metode, ki zajemajo različne tehnike filtriranja transkriptov in raznolika vhodna zaporedja. Integracija večjega števila modelov in variacije vhodov predstavljajo pomemben korak k izboljšanju razumevanja napovedi tkivno specifične ekspresije genov ter prispevajo k napredku bioinformatike in računske biologije. Naši rezultati kažejo na pomembnost tako vhodnih zaporedij kot dodatnih značilk kodirajoče regije (CDS) pri napovedovanju izražanja genov. Kombinacija teh vhodnih podatkov je pokazala le zmerno izboljšanje učinkovitosti. DNABERT se je spopadal z vnosom samo vhodnih zaporedij, vendar je dosegel rezultate, primerljive z modeli CNN z dodanimi značilkami CDS. Najizrazitejšo tkivno specifično učinkovitost je pokazal model Washburn (R-kvadrat približno 0,40), sledila sta model DNABERT (R-kvadrat približno 0,34) in model Zrimec (R-kvadrat približno 0,31). Modeli so se soočali z izzivi pri napovedovanju nizko in visoko izraženih genov, izkazali pa so se pri napovedovanju zmerno izraženih genov. Ocena napovedi tkivno specifičnega izražanja genov je podobna oceni napovedi povprečne vrednosti vseh primerov transkripta. Dodatno smo pokazali, da sta oba modela CNN ovrednotila tkiva s primerljivim vrstnim redom. Da bi prikazali modelove spretnosti prepoznavanja vzorcev, smo analizirali aktivacije konvolucijskih jeder. Te vzorce smo primerjali z referencami v bazah podatkov in našli približno 650 ujemanj. Da bi natančneje določili pomembna območja znotraj zaporedij, smo uporabili zamegljevanje zaporedja. Naši rezultati so poudarili pomen promotorja blizu TSS in 5'UTR blizu CDS pri oblikovanju učinkovitosti modela, še posebej pri krajših zameglitvah. Pri zameglitvi celotnih območjih so se vse genomske regije razen terminatorja izkazale za pomembne. Dokazali smo torej, da je model sposoben napovedati tkivno specifične genske ekspresije, in poudarili pomembnost nekodirajočih genomskih območij. Čeprav na tem področju poteka nenehno raziskovanje, si želimo, da bi naši ugotovitvi prispevali k napredku razumevanja tkivno specifičnega izražanja genov.

Keywords:bioinformatika, DNA, DNABERT, genska ekspresija, konvolucijska nevronska mreža, mRNA, napovedni modeli, regulatorni mehanizmi, sekvenčni motivi, strojno učenje, tkivna specifičnost, tkivno specifična genska ekspresija

Similar documents

Similar works from RUL:
Similar works from other Slovenian collections:

Back