SFB 1404 FONDA
Former Members:
- Evelin Aasna (Student Assistant)
- Tristan Aretz (Student Assitant)
- Felix Droop (Student Assistant)
- Manuel Zschäbitz (Student Assistant)
DFG
GZ: SFB 1401/1 | SFB 1401/2 2024
● Projekt-Nr.: 414984028 ● AOBJ: 675303
Foundations for Large-Scale Scientific Data Analysis Workflows
Essentially all scientific disciplines are generating an ever-increasing amount of data. To derive scientific discoveries, these data sets are analyzed by complex data analysis workflows (DAWs), which are series of discrete analysis programs arranged in (often non-linear) pipelines. Because they usually deal with very large data sets, these DAWs must be executed on distributed and/or parallel computational infrastructures, ranging from multi-core servers over mid-sized clusters to high-performance computing infrastructures (HPC). Traditionally, DAWs are optimized for speed, which leads to solutions that are hard to reproduce and share, and that are tightly bound to exactly one type of input. They are optimized for exactly the computational infrastructure available at the time of de-velopment, which requires scientists to fiddle around with heterogeneous low-level programming concepts.
The CRC FONDA – “Foundations of workflows for large-scale scientific data analysis” – will investigate methods for increasing productivity in the development, execution, and maintenance of DAWs for large scientific data sets. Our long-term goal is to develop methods and tools that achieve substantial reductions in development time and development cost of DAWs.
DAW runtime in distributed infrastructures if often dominated by the time required for data access and data exchange (DADE), which in turn depends on the data being analyzed, the tasks being executed, and the infrastructure on which a DAW runs. Changes in either of these aspects can quickly lead to deteriorating runtimes when a DAW is not adapted properly. Subproject A2 investigates methods that can adapt a given DAW to new input data or a different infrastructure with the goal to keep runtime low.
A2 is an interdisciplinary project; it will develop its research using DAWs for large-scale genome data analysis, which are typically IO heavy and thus particularly depend on proper DADE operations. It will intensively cooperate with subproject A6 by testing its newly developed methods also on DAWs for finding structural genomic variations, and it will use the hardware abstractions developed in B1. It will be carried out by Prof. Reinert, an expert in data structures and algorithms for genomic data, and Prof. Leser, an expert in optimization of UDF-heavy DAWs.
FU Researchers: Prof. Knut Reinert (Principal Investigator), Dr. Somayeh Mohammadi (Scientist)
The FONDA I project constitutes a collaborative effort of a consortium composed of:HUB: Humboldt-Universität zu Berlin (Speaker)
Charité: Charité - Universitätsmedizin Berlin, Berlin Institute of Health (BIH), Bernstein Center for Computational Neuroscience (BCCN)
FUB: Freie Universität Berlin
HHI: Fraunhofer Heinrich-Hertz-Institut, Berlin
MDC: Max Delbrück Center for Molecular Medicine, Berlin
TUB: Technische Universität Berlin
UO: Universität Osnabrück
UP/HPI: Universität Potsdam, Hasso-Plattner-Institut for Digital Engineering
ZIB: Zuse-Institut Berlin
DAWs in genomics can benefit enormously from workflow optimization, as very often multiple ways exist to solve the same scientific problem while exhibiting different properties in terms of resource usage. Genomic DAWs can be rewritten such that results for the same input remain provably identical, for instance by replacing tools for tasks having an exact solution or by distributing computation for embarrassingly parallel problems, or such that the same underlying problem is solved, but the precise solutions vary, for instance by replacing tools for tasks allowing only a heuristic solution. The former is the classical setting in plan optimization for database queries, while the latter option emerges from the fact that many problems in bioinformatics cannot be solved precisely due to the high complexity of the problem in combination with the sizes of typical inputs (e.g., genome assembly, read mapping) or because solutions are probabilistic in nature (e.g., variant calling, molecule identification in mass spectrometry). In the first phase of FONDA, we optimized genomics workflows regarding their data access patterns and thus efficacy on a changing cluster infrastructure in two application domains (RNA-Seq and metagenomics). Our new methods compile physical DAWs from an abstract description optimized for a target infrastructure based on characteristics of the DAW and of the infrastructures; examples of operations the compiler can choose from are selections among equivalent tools or partitioning of data following a scatter–gather pattern. The goal of our algorithms is the automatic reduction of runtime (wall clock) in the face of changing infrastructures, thus supporting FONDA’s goals of improved portability and adaptability.
Recent years, however, showed that science must start to put much more focus on another optimization goal: Reduction of energy consumption. Overall reduction in energy consumption is pivotal to reducing carbon emissions (a closely related yet not identical goal), and thus to fight climate change. It also helps to reduce the monetary costs of large-scale computations, which become more and more critical for research organizations and universities. However, energy-aware workflow optimization is a topic that has seen little attention yet, especially in bioinformatics. We are unaware of any efforts in this field, despite the high energy consumption it causes; globally, probably tens of thousands of ‘omics workflows are executed at every moment in time over large data sets generated from high-throughput technologies such as sequencing.
To fill this gap, our subproject will lay the grounds for energy-reducing DAW optimization in bioinformatics. We will first provide an extensive characterization of the energy-consumption behavior of common tools and workflows and then devise a new instrumentation module for workflow engines to enable them to report energy-consumption-related metrics at the task, at the node, and at the workflow level. Based on this instrumentation, we will develop methods that estimate the energy consumption of a given DAW for a given input on a given infrastructure to implement energy-aware reporting and monitoring. Next, we will build on the algorithms developed in the first phase of FONDA to develop novel methods for energy-aware DAW optimization also embracing techniques from multi-objective optimization to explore the trade-off between runtime and energy consumption. The respective properties of different, automatically selected rewritings of a workflow will be graphically presented to the workflow user to allow workflow steering based on individual preferences or following institutional goals. Compared to phase I, we extend the scope of considered problems by also including variant calling in DNA and RNA and quantitative mass-spectrometry-based proteomics. A2 contributes to improving the environmental sustainability and adaptability of DAW executions.
FU Researchers: Prof. Knut Reinert (Principal Investigator), Dr. Somayeh Mohammadi (Scientist)
T9: FONDA Reproducibility BadgingIn the open research community, concerns surrounding computational reproducibility have become increasingly prominent, emphasizing the critical importance of transparency and replicability of digital artefacts produced in scientific projects. These concerns are particularly heightened when it comes to reproducing complex data analysis workflows (DAWs), especially in the context of large-scale, data-intensive DAWs executed across multiple centers where data access is decentralized. Here a multitude of variables, including the need for a clear specification of the parametric space of executed tasks, navigating privacy regulations in a multi-site setting, and managing resources consumption, including CPUs and accelerators, memory, I/O, and network access, collectively influence the reproducibility of research processes.
The primary research focus of our CRC in phase II is on enhancing the sustainability, usability, and multi-site capabilities of DAWs and DAW engines. A critical goal in achieving these three key features is the validation of the modifications made to existing DAW components by independent auditors. These auditors will carry out internal reproducibility tests for research artefacts that have been submitted for publication in a conference or journal in order to ensure their reproducibility. A badging system specially developed for the FONDA project will be used for this purpose.
This work holds significant importance, not only in the context of publishing novel computational methods developed within our CRC but also in assisting DAW engineers across various subprojects in optimizing resource allocation in terms of time and finances. Furthermore, it provides a practical validation of a specific DAW conceptualization in the early stages of DAW development. It allows auditors to assess whether the required artifacts are fully specified, facilitating replication by non-domain experts. In contrast to T6, the primary goal of our team is not to create an easy-to-install reference stack, but to reproduce experiments in isolation as support to-be-published results.
On the grounds of this, we aim to develop a comprehensive badging system that effectively characterizes different levels of DAW reproducibility. The badges will provide clear distinctions among varying degrees of reproducibility by taking the distributed and infrastructure-dependent execution of data-intensive DAWs into account. Subsequently, we will focus on the practical application of this system by systematically reproducing research findings of DAWs or individual DAW components developed within this CRC. As part of this effort, we will assign previously defined badges based on the reproducibility of these research outcomes. This comprehensive assessment will involve repeating experiments using the same data within the same execution infrastructure, employing different datasets within the same environment, and testing different datasets across diverse execution infrastructures. Lastly, guided by the insights gained from these reproducibility efforts, we aim to implement a prototype of a reproducibility service. This service shall automate the execution of reproducibility tests as much as possible, effectively reducing the overhead for conducting such tests and by that increasing the likelihood that researchers are willing to adopt best practices to develop fully reproducible DAWs, ultimately fostering more robust and transparent data analysis workflows.
Project Leader: Prof. Knut Reinert
The FONDA II project constitutes a collaborative effort of a consortium composed of:HUB: Humboldt-Universität zu Berlin (Speaker) - Institute for Computer Science, Institute for Physics, Institute for Geography
BAM: Bundesanstalt für Materialforschung und -prüfung, Berlin
Charité: Charité - Universitätsmedizin Berlin
FUB: Freie Universität Berlin - Institute for Bioinformatics
GFZ: Deutsches Geoforschungszentrum Potsdam
HPI: Hasso-Plattner Institute at the Universität Potsdam
MDC: Max Delbrück Center for Molecular Medicine, Berlin
TUB: Technische Universität Berlin - Faculty for Electrical Engineering and Computer Science
TUD: Technische Universität Darmstadt - Faculty for Electrical Engineering and Information Technology
UP: Universität Potsdam - Faculty for Informatics and Computational Science, Faculty for Digital Engineering
ZIB: Zuse-Institut Berlin