	SECTION I.
Introduction

The modern detectors used in high energy physics (HEP) experiments are complex instruments designed to register collisions of elementary particles at extremely high energies. Only a small fraction of such collisions results in interesting, new phenomena. Therefore, in order to maximize the probability of a discovery in particle physics, a collision rate in the MHz range is needed. Data that correspond to a single collision of particles, referred to as an event, are acquired from millions of readout channels. Several stages of filtering are performed in order to select only the interesting events, and then send them to persistent storage. The data acquisition systems used in HEP experiments [1] [2][3] [4][5] [6], acquire event fragments from numerous sources, after a first selection step that is completely realized in dedicated hardware. Further filtering steps are implemented in software running on a set of computing farms. In the very first software-selection step due to the still high rate (the order of 100 kHz), the data are usually distributed in a static way between filtering nodes. In case of systems, with only one stage of software filtration, this is also the final stage, where the event reconstruction has to be done. In this case the static distribution determines strongly the system. The processing power of the participating computing nodes and farms has to be easily measurable, so that the distribution schema could be prepared precisely. Subsequently, it is difficult to introduce heterogeneity to the discussed group of data acquisition systems. Moreover, static data distribution decreases fault tolerance and introduces additional single points of failure. The main goal of our research is to increase the system's overall fault tolerance through dynamic load balancing. The proposed method aims to balance the load inside heterogeneous systems, as well as, homogeneous systems, where the imbalance could be caused by faults. Furthermore, our studies include developing a scalable load balancing protocol along with a distributed asynchronous load assignment policy.
A. Case study

As a case study we consider the Data Acquisition (DAQ) system of the Compact Muon Solenoid (CMS) experiment at CERN's new Large Hadron Collider shown in the Figure 1. CMS is a multi-purpose detector for studying proton-proton and heavy ion collisions at TeV scale [1]. CMS is designed to collect data at the LHC bunch crossing frequency of 40 MHz. The first level trigger pre-selects events with interesting signatures reducing the incoming data rate to a maximum of 100 kHz. The DAQ System acquires event fragments from about 500 sources and combines them into full events. Each data source delivers event fragments of average size of 2 kB at a rate of 100 kHz. Event fragments are transported by a non-blocking network (based on Myrinet technology) to the surface and statically distributed (usually in round-robin fashion) amongst several autonomous processing units called DAQ Slices. A DAQ Slice is a sub-farm organized around a Terascale Force10 switch, where parallelization is achieved through SPMD (Single Process, Multiple Data) technique. In the first event building stage event fragments are received by a DAQ Slice through distributed readout consisting of computing nodes called Readout Units (RU), and then assembled into super-fragments inside RUs. Subsequently, in the second stage, in each of the DAQ Slices, an Event Manager (EVM) node assigns super-fragments to Builder Units (BU) that construct the whole event. The complete events are then delivered to Filter Units (FU) that run the High Level Trigger selection algorithm (BU and FU are hosted on the same node). Events accepted for storage are transmitted to Storage Manager (SM) nodes connected to a Storage Area Network.
Figure 1
Figure 1. Schematic view of the CMS DAQ System [1].

View All | Next

Currently when one DAQ Slice becomes less efficient, e.g. because of some fault like a failing computing node it slows down other DAQ Slices. Moreover, there are several potential single points of failure like the EVM, SM and RU nodes. We propose a load balancing algorithm that balances the load between DAQ Slices dynamically (the load balancing inside a DAQ Slice is provided by the EVM) and as a result enhances the overall fault tolerance of the system by removing single points of failure.
SECTION II.
Related Work

There are several strategies for balancing the incoming load in high energy physics data acquisition systems. In Atlas experiment at CERN [2], similarly as in CMS, data are acquired with 100 kHz frequency. Firstly, the incoming events are pre-filtered on the basis of partial event information so that the initial event rate is reduced to 3 kHz. The selection decision is passed to a central node called DataFlow Manager (DFM) that supervises the event reconstruction process. For each accepted event the DFM allocates an event-building node according to pull-requests obtained from those nodes. This way, a demand driven load balancing has been obtained. It can be noticed that it is possible to distribute the incoming load on an event after event basis only because an additional filtering step has been introduced that reduces the incoming event rate drastically. Moreover, the applied central agent policy increases the number of single points of failures rather than decreases it, which is our objective.

The data acquisition system of DZERO experiment [5] (similar solutions are also used in Zeus and CDF experiments) at Fermilab handles an incoming data rate of 1 kHz (also after a prefiltering stage). The event building and filtering is supervised by a single process called Routing Master (RM) running on a single board computer. The RM chooses the destination using a table containing the information about the number of free buffers on each farm node. First, a set of the least loaded nodes is identified, and then the destination node is chosen in round-robin manner. After assigning an event to a farm node the corresponding entry in the table is decremented. Farm nodes update the table entries periodically, through messages with the number of available buffers. Potentially, this solution could be adopted for higher data taking rates provided that the assignment decision would be made and distributed for bunches of events, which is possible since the load balancer is aware of the space available in farm node's buffers. However, the RM is an additional single point of failure.

Thome et al [7] gave us a compact overview and comparison of load balancing strategies used in scientific SPMD systems. However, applying one of the presented algorithms to our system would be difficult because the proposed methods are designed to balance the workload between individual computing nodes, and not, as in our case, between computing sub-farms. Nonetheless, an interesting analysis of load balancing features has been provided. For us, the most important outcome from this analysis concerns the approach for gathering internal load indices and for workload redistribution. It has been shown that the global, collective load balancing led to the best results. The global strategy implies that the load data should be gathered from the whole system at once. The collective strategy in turn, implies that load balancing should lead to exact workload redistribution in the whole system. Using these strategies provides the fastest reaction to imbalance in the system. Another interesting fact is that algorithms using these strategies obtained almost identical results for distributed and centralized load balancers.
SECTION III.
Proposed Method
A. Load metric

Each of the around 500 data sources delivers every 10 μs a new event fragment that needs to be assigned to a DAQ Slice. Sending a single message between computing nodes takes about 10 to 100 μs depending on the network. Therefore, the workload has to be calculated and exchanged for blocks of events in advance, rather than for single events. Initially, n events will be allocated to each DAQ Slice, and then, after the first DAQ Slice becomes under-loaded (a DAQ Slice is considered as under-loaded if it has to process less than n/2 events), the workload of each DAQ Slice will be estimated by checking the number of events that still have to be processed (data ownership as the load index [8]). In the final step, in order to achieve exact load distribution, each DAQ Slices will request a number of events that together with the number of already owned, unprocessed events, is equal to the initial n events. This way, the most loaded DAQ Slice (the one which owns most data) will be assigned with least load and vice versa. The detailed analysis of load metric and communication pattern has been conducted in [9].
B. Load Balancer

The load assignment algorithm is located in every data source and allows for determining the destination DAQ Slice asynchronously in each of them for each block of events (distributed, asynchronous load balancer [10]). The algorithm chooses the destination based on a set of counters that indicate how many events were allocated to each DAQ Slice. The event fragment is assigned to the DAQ Slice with currently largest counter, subsequently the counter is decremented. In parallel, the data sources receive requests for new blocks of events from DAQ Slices. When all counters reach 0, the values from these requests are used to create a new set of counters. An event fragment may be assigned to a DAQ Slice if, and only if the requests from all DAQ Slices were received. Moreover, load measurements done at the same time, are marked with the same, unique number, so they are not mixed with another set of measurements. Taking into account the precondition that data sources deliver event fragments in the same sequence, and that the proposed load assignment algorithm is fully deterministic, it is guaranteed that all fragments of a single event are always assigned to the same DAQ Slice.
C. Initial workload

A DAQ Slice may accept always, regardless of its efficiency, as many events as can fit into its readout buffers and therefore the maximal, initial number of events n has to be equal to the readout buffer size divided by the expected event fragment size (in the real system it is necessary to provide also some buffer reserve due to the variable fragment size). This way, the delay due to waiting for free space in those buffers can be avoided. Furthermore, n has to be large enough so that in the time needed for distributing n/2 events the load measurement can be triggered in all DAQ Slices and then load requests can be sent from each DAQ Slice to all data sources. This is a necessary condition that has to be fulfilled in order to avoid lowering the data taking rate. On the other hand, it is desirable to keep n as small as possible to achieve more precise load balancing and in case of failure of a readout node to reduce to a minimum the amount of lost data.

In case of heterogeneous systems the initial workload should be assigned in such a way that the time needed by each DAQ Slice to process its load would be identical (this problem can be addressed by using divisible load theory [11]). This way each DAQ Slice will be also classified as under-loaded at the same time which is critical for achieving good scalability (the details can be found in [9]).
D. Fault tolerance

Due to the proposed load balancing algorithm loss of capacity in one DAQ Slice caused by a software failure, failing network connection or failing computing node does not slow down other DAQ Slices. The system is even able to tolerate the extreme case, when a critical node like the SM fails. The damaged DAQ Slice will still be able to accept events because it was only allocated the number of events that can fit into its readout buffers. Then, when the readout buffers will fill up, the damaged DAQ Slice will stop requesting new events during load balancing cycles. This way, the whole load will be redirected to other DAQ Slices. The problem is more complex when it comes to a failure of a readout node (RU). In this case, the data sources corresponding to the damaged RU will detect the failure using a heartbeat mechanism and discard event fragments assigned to this node (in this case the readout buffer is not available). The damaged DAQ Slice will not be able to process incomplete events, which means that at some point it will stop requesting new events. The maximum number of events that can be lost in this way is the initial number of events n. The most complex case occurs when one of the EVM nodes fails because it not only supervises the whole DAQ Slice, but also acts as a readout node. Moreover, the EVM node is also responsible for triggering the load balancing cycles, measuring the workload, and requesting new blocks of events for its DAQ Slice. The data sources corresponding to the EVM will behave in a similar way as if an ordinary readout node failed (they will start to discard fragments assigned to this node). When the load balancing cycle is triggered, the EVM responsible for notifying the damaged EVM will detect the fault and send, along with its request for new events, a message indicating that the damaged DAQ Slice should be excluded from data taking (this feature is not yet fully implemented). In the standard system, each of the above mentioned node failures would cause buffers in the data sources to fill up and therefore would result in stopping the data acquisition.
SECTION IV.
Preliminary Results and Conclusions

We prepared a prototype that has been studied in the CMS DAQ test environment. The algorithm has been implemented in C for the Myrinet network driver and in C++ for the applications running on computing farm. The goal of our experiments was to verify if the fault tolerance of the system has been enhanced through dynamic load balancing. We obtained positive results concerning the improved performance of the system in case of decrease in processing power of one, or more DAQ Slices. We used a 4 DAQ Slice setup with 1 EVM, 3 RUs and 4 BUs per DAQ Slice. The system was tuned so that the throughput limitation would come from event filter farm and the maximum possible data taking rate would be 50 kHz (12.5 kHz per DAQ Slice, which is the nominal speed for the production system). The initial data acquisition rate was set to maximum (50 kHz).
Figure 2
Figure 2. System response to fault occurrence in one DAQ Slice

Previous | View All

As shown in the Figure 2, we studied the response of the system to fault occurrence in a particular DAQ Slice. In order to simulate this, after a certain period of time BUs were killed one after another. It can be easily noticed that the capacity loss in the standard system was significantly greater. Furthermore, loss of the entire processing power in one DAQ Slice stopped the data acquisition. On the other hand, in case of the system running with load balancing algorithm, data were distributed proportionally to the efficiency of DAQ Slices, and as a result the malfunctioned DAQ Slice has been ignored. This way, it has also been proven that introducing heterogeneity into the system is now possible. Moreover, we studied the case when a critical computing node fails. We performed a series of experiments in which we simulated a breakdown of a SM node, and then a RU node. Likewise, in both cases positive results were obtained. The incoming load was distributed only between operational DAQ Slices which means that a breakdown in a SM or RU node is no longer critical for the system, and as a result that this single points of failure have been removed. Those experiments were afterwards successfully repeated, even in the case when the faults were introduced to more than one DAQ Slice. The overhead of the discussed load balancing algorithm has been measured in CMS DAQ test environment for a system of 8 data sources and 8 DAQ Slices (8 inputs and 8 outputs), each of them consisting of 1 EVM, 1 RU and 2 BUs. The throughput was measured for event sizes in the range from 128 B to 10 kB running the DAQ system at maximum speed. Comparison of the results obtained for the standard system and the system running with the proposed load balancing algorithm lead as to the conclusion that for event fragment sizes above 2 kB (the operating range for CMS DAQ System) the overhead due to the algorithm is less than 1% and hence negligible.

During the last technical stop of the LHC, the proposed algorithm has been tested on the full-scale production DAQ system of the CMS experiment. Preliminary measurements show that the algorithm scales well and meets the design specifications of the CMS DAQ System.
ACKNOWLEDGMENT

I would like to thank my supervisors: Professor Stanislaw Kozielski and Dr Hannes Sakulin for all their help and support during my PhD studies, as well as, for stimulating discussions.

