University of Connecticut

Events Calendar

Doctoral Dissertation Oral Defense of Marius Nicolae

Thursday, March 31, 2016
2:00pm – 3:00pm

Storrs Campus
Family Studies Building (FSB), Room 18

Thesis: "Data Structures and Algorithms for the Identification of Biological Patterns"

Field of Study: Computer Science and Engineering

Abstract: This thesis studies the following problems: 1. Planted Motif Search. Discovering patterns in biological sequences is a crucial process that has resulted in the determination of open reading frames, gene promoter elements, intron/exon splicing sites, SH RNAs, etc. We study the (l, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers l and d. It returns all sequences M of length l that occur in each input string, where each occurrence differ from M in at most d positions. Another formulation is quorum PMS (qPMS), where M appears in at least q% of the strings. We developed qPMS9, an efficient parallel exact qPMS algorithm for DNA and protein datasets.

2. Suffix Array Construction. The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. The suffix array consists of the sorted suffixes of a string. There are several linear time suffix array construction algorithms known in the literature. However, one of the fastest algorithms in practice has a worst case run time of O(n 2 ). We developed an efficient algorithm called RadixSA that has a worst case run time of O(n log n) and is one of the fastest algorithms to date. RadixSA introduces an idea that may find independent applications as a speedup technique for other algorithms.

3. Pattern Matching with Mismatches. We consider several variants of the pattern matching with mismatches problem. Given a text T = t 1 t 2 · · · t n and a pattern P = p 1 p 2 · · · p m , we investigate the following problems: 1) Pattern matching with mismatches: for every alignment i, 1 ≤ i ≤ n − m + 1 output the distance between P and t i t i+1 · · · t i+m−1 , and 2) Pattern matching with k mismatches: output those alignments i where the distance is at most k. The distance metric used is the Hamming distance. Variants of these problems allow for wild cards in the text or the pattern. For these problems we offer novel deterministic, randomized and approximation algorithms. Source code relevant to these results is available at https://github.com/mariusmni/.

Contact:

Sanguthevar Rajasekaran, rajasek@engr.uconn.edu

Graduate School - Theses and Dissertation Defense (primary), UConn Master Calendar

Control Panel