A MapReduce Input Format for Analyzing Big High-Energy Physics Data Stored in ROOT Framework Files


Huge scientific data, such as the petabytes of data generated by the Large Hadron Collider (LHC) experiments at CERN are nowadays analyzed by grid computing infrastructures using a hierarchic filtering approach to reduce the amount of data. In practice, this means that an individual scientist has no access to the underlying raw data and furthermore, the accessible data is often outdated as filtering and distribution of data only takes places every few months. A viable alternative to perform analysis of huge scientific data may be cloud computing, which promises to make a “private computing grid” available to everyone via the Internet. Together with Google’s MapReduce paradigm for efficient processing of huge data sets, it provides a promising candidate for scientific computation on a large scale. This thesis investigates the applicability of the MapReduce paradigm, in particular as implemented by Apache Hadoop, for analyzing LHC data. We modify a typical LHC data analysis task so that it can be executed within the Hadoop framework. The main challenge is to split the binary input data, which is based on ROOT data files, so that calculations are performed efficiently at those locations where the data is physically stored. For Hadoop, this is achieved by developing a ROOT-specific input format. Services of the Amazon Elastic Compute Cloud (EC2) are utilized to deploy large compute clusters to evaluate the solution and explore the applicability of the cloud computing paradigm to LHC data analysis. We compare the performance of our solution with a parallelization of the analysis using the PROOF framework, a standard tool specialized in parallelized LHC data analysis. Our results show that the Hadoop-based solution is able to compete with the performance using PROOF. Additionally, we demonstrate that it scales well on clusters build from several hundred compute nodes inside the EC2 cloud.
Document Type: 
Master's Theses
Gottingen, Germany
Institute of Computer Science, Georg-August-Universität Göttingen
2020 © Software Engineering For Distributed Systems Group

Main menu 2