Sestavljanje genoma iz odčitkov zaporedja

Rezar, Matija

Sestavljanje genoma iz odčitkov zaporedja
ID Rezar, Matija (Author), ID Brodnik, Andrej (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (627,89 KB)
MD5: 1EB2A3AD527159D711331FC155924676
PID: 20.500.12556/rul/94f8282b-f72e-41da-b980-399b9ac0bc91

Abstract

V grobem lahko postopek sestavljanja genoma opišemo kot iskanje Eulerjevih obhodov v de Bruijnovih grafih. V nalogi povzamemo teorijo podatkovnih struktur za predstavitev (de Bruijnovih) grafov in opišemo nekaj implementacij. Usmerjen graf G(V,E) lahko predstavimo kot množico urejenih parov vi → vj ∈ E. De Buijnovi grafi so definirani tako, da je mogopče sosede vozlišča izračunati iz oznake vozlišča, kar pomeni, da se lahko po grafu sprehajamo s pomočjo poizvedb o pripadnosti možici sosednosti. Množica povezav je lahko shranjena v slovarju. Slovar je lahko determinističen, kot sta na primer drevo ali kazalo FM, ali osnovan na verjetnosti, kot sta na primer razpršilna tabela ali Bloomov filter. V delu predstavljamo kBWT, novo deterministično podatkovno strukturo za predstavitev de Bruijnovih grafov, ki uporablja skoraj optimalnih n · σ + o(n) bitov pomnilnika, kjer je n število k-merov v grafu in σ velikost abecede. Podatke o sosednjosti vozlišča lahko v njej najdemo v Θ(σ · k) časa. Strukturo primerjamo z obstoječo podatkovno strukturo ogrodja GATB, ki je osnovana na Bloomovih filtrih. Preizkusi kažejo, da je deterministični kBWT pričakovano počasnejši od podatkovne strukture ogrodja GATB. Pokazalo pa se je, da kBWT bolj učinkovito izrablja predpomnilnik, vendar to ni nadomestilo porabe procesorskega časa.

Language:	English
Keywords:	genom, sestavljanje genoma, teorija grafov, de Bruijnov graf, Eulerjev obhod
Work type:	Master's thesis/paper
Organization:	FRI - Faculty of Computer and Information Science
Year:	2016
PID:	20.500.12556/RUL-91246
Publication date in RUL:	27.03.2017
Views:	2111
Downloads:	453
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	Slovenian
Title:	Genome assembly from sequence reads
Assembling genomes can be roughly described as finding Eulerian tours in de Bruijn graphs. We present the theory behind (de Bruijn) graph data structures and describe some of the implementations. A directed graph G(V,E) can be represented as a set of its edges in the form of ordered pairs vi → vj ∈ E. De Bruijn graphs are defined in a way that allows all possible neighbors of a node to be calculated from the given node’s label, which means that, given the adjacency set, we can navigate the graph by testing set membership. The edge set can be stored as a dictionary. The dictionary can be either a deterministic data structure, like a tree or an FM-index, or a probabilistic data structure, like a Bloom filter. In this thesis we present kBWT, a new space efficient deterministic data structure for storing a de Bruijn graph, which uses near-optimal n · σ + o(n) bits of memory, where n is the number of k-grams in the graph and σ is the size of the alphabet. It can retrieve neighborhood information for a given node in Θ(σ · k) time. We also compare it to an existing data structure found in the GATB framework, which is based on Bloom filters and therefore probabilistic. Benchmarks of the deterministic kBWT show it is slower in practice, compared to GATB’s data structure. Testing showed kBWT had better cache efficiency, which did not make up for the number of processor cycles used for executing the algorithm.
Keywords:	genome, genome assembly, graph theory, de Bruijn graph, Eulerian cycle

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents