RegExBiS
DFG | GZ: RE 1712/13-1
● Projekt-Nr.: 541107382 ● AOBJ: 709741
RegEx Querying for Biological Sequences
DNA sequencing, the process of determining the order of nucleotides in the DNA, is a cornerstone of such diverse fields as molecular biology, metagenomics, virology, and proteomics. It provides an angle to identify insertion and deletions in genes, their links with phenotypes and diseases, and potential modes of action for drug development.
Driven by these use cases, sequencing technology has seen enormous improvements in the last decade: A single machine for next-generation sequencing (NGS) collects 400 billion base pairs per day at a low cost, which leads to huge nucleotide and protein databases. The analysis of this data is enabled by queries that define a specific order of nucleotides or amino acids to be identified in a large collection of sequences stored in a sequence database.
Traditional analysis techniques employ queries derived from sequencing reads or known protein sequences, in which the queries are formulated as plain sequences of nucleotides or amino acids. To answer such sequence queries efficiently, state-of-the-art strategies for approximate search on massive databases avoid exploring the whole database. Instead, they filter sequences based on compact representations, typically using k-mers. This way, the search is limited to a small fraction of the database, while the error introduced through the approximate search can be bound formally.
This project aims to tackle the complex challenge of efficiently searching large sequence databases with more sophisticated queries. Applications such as querying for members of protein families like in the PROSITE database and finding transcription factor binding sites require searching for rich patterns of protein or nucleotide sequences, which are described by regular expression (regex) queries.
While implementations for regex querying are available, they are largely of a heuristic nature, like PHI-Blast, and take seconds for one query on a small database which does not scale well. The Reg Ex Querying of Biological Sequences project sets out to lay the foundations for efficient regex querying in large-scale sequence databases by answering the following research questions: * Which sequence representations enable effective filtering of a database for regex queries? * How to give formal error bounds for approximate searches for regex queries? * How to maintain the structures used for the search in the presence of an evolving database and query workload?
To address these questions, we adopt an interdisciplinary approach. We draw on recent advances in filter structures for sequence databases in bioinformatics, incorporate results on the decomposition of regex queries as developed for distributed complex event recognition, and integrate ideas from the database community on soft clustering. As such, the scientific results of the project can be expected to resonate also in the broader areas of omics data management and general-purpose data engineering.