subscribe to arXiv mailings

Graph Neural Network Training Systems: A Performance Comparison of Full-Graph and Mini-Batch

Authors: Saurabh Bajaj, Hui Guan, Marco Serafini

Abstract: Graph Neural Networks (GNNs) have gained significant attention in recent years due to their ability to learn representations of graph structured data. Two common methods for training GNNs are mini-batch training and full-graph training. Since these two methods require different training pipelines and systems optimizations, two separate categories of GNN training systems emerged, each tailored for… ▽ More Graph Neural Networks (GNNs) have gained significant attention in recent years due to their ability to learn representations of graph structured data. Two common methods for training GNNs are mini-batch training and full-graph training. Since these two methods require different training pipelines and systems optimizations, two separate categories of GNN training systems emerged, each tailored for one method. Works that introduce systems belonging to a particular category predominantly compare them with other systems within the same category, offering limited or no comparison with systems from the other category. Some prior work also justifies its focus on one specific training method by arguing that it achieves higher accuracy than the alternative. The literature, however, has incomplete and contradictory evidence in this regard. In this paper, we provide a comprehensive empirical comparison of full-graph and mini-batch GNN training systems to get a clearer picture of the state of the art in the field. We find that the mini-batch training systems we consider consistently converge faster than the full-graph training ones across multiple datasets, GNN models, and system configurations, with speedups between 2.4x - 15.2x. We also find that both training techniques converge to similar accuracy values, so comparing systems across the two categories in terms of time-to-accuracy is a sound approach. △ Less

Submitted 8 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

Comments: 12 pages, 1 appendix, 8 Figures, 16 Tables, Graph Neural Network, Graph Neural Networks, Full-graph training, Mini-batch training, full-batch training, distributed training, performance, epoch time, time to accuracy, accuracy

arXiv:2403.01050 [pdf, other]

doi 10.1109/PACT58117.2023.00026

GraphMini: Accelerating Graph Pattern Matching Using Auxiliary Graphs

Authors: Juelin Liu, Sandeep Polisetty, Hui Guan, Marco Serafini

Abstract: Graph pattern matching is a fundamental problem encountered by many common graph mining tasks and the basic building block of several graph mining systems. This paper explores for the first time how to proactively prune graphs to speed up graph pattern matching by leveraging the structure of the query pattern and the input graph. We propose building auxiliary graphs, which are different pruned… ▽ More Graph pattern matching is a fundamental problem encountered by many common graph mining tasks and the basic building block of several graph mining systems. This paper explores for the first time how to proactively prune graphs to speed up graph pattern matching by leveraging the structure of the query pattern and the input graph. We propose building auxiliary graphs, which are different pruned versions of the graph, during query execution. This requires careful balancing between the upfront cost of building and managing auxiliary graphs and the gains of faster set operations. To this end, we propose GraphMini, a new system that uses query compilation and a new cost model to minimize the cost of building and maintaining auxiliary graphs and maximize gains. Our evaluation shows that using GraphMini can achieve one order of magnitude speedup compared to state-of-the-art subgraph enumeration systems on commonly used benchmarks. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2312.15405 [pdf, other]

Enhancing Computation Pushdown for Cloud OLAP Databases

Authors: Yifei Yang, Xiangyao Yu, Marco Serafini, Ashraf Aboulnaga, Michael Stonebraker

Abstract: Network is a major bottleneck in modern cloud databases that adopt a storage-disaggregation architecture. Computation pushdown is a promising solution to tackle this issue, which offloads some computation tasks to the storage layer to reduce network traffic. Existing cloud OLAP systems statically decide whether to push down computation during the query optimization phase and do not consider the st… ▽ More Network is a major bottleneck in modern cloud databases that adopt a storage-disaggregation architecture. Computation pushdown is a promising solution to tackle this issue, which offloads some computation tasks to the storage layer to reduce network traffic. Existing cloud OLAP systems statically decide whether to push down computation during the query optimization phase and do not consider the storage layer's computational capacity and load. Besides, there is a lack of a general principle that determines which operators are amenable for pushdown. Existing systems design and implement pushdown features empirically, which ends up picking a limited set of pushdown operators respectively. In this paper, we first design Adaptive pushdown as a new mechanism to avoid throttling the storage-layer computation during pushdown, which pushes the request back to the computation layer at runtime if the storage-layer computational resource is insufficient. Moreover, we derive a general principle to identify pushdown-amenable computational tasks, by summarizing common patterns of pushdown capabilities in existing systems. We propose two new pushdown operators, namely, selection bitmap and distributed data shuffle. Evaluation results on TPC-H show that Adaptive pushdown can achieve up to 1.9x speedup over both No pushdown and Eager pushdown baselines, and the new pushdown operators can further accelerate query execution by up to 3.0x. △ Less

Submitted 29 December, 2023; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: 13 pages, 15 figures

arXiv:2303.13775 [pdf, other]

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Authors: Sandeep Polisetty, Juelin Liu, Kobi Falus, Yi Ren Fung, Seung-Hwan Lim, Hui Guan, Marco Serafini

Abstract: Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. One of the major performance costs in GNN training is the l… ▽ More Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. One of the major performance costs in GNN training is the loading of input features, which prevents GPUs from being fully utilized. In this paper, we argue that this problem is exacerbated by redundancies that are inherent to the data parallel approach. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant data loads and splits the sampling and training of each mini-batch across multiple GPUs online, at each iteration, using a lightweight splitting algorithm. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$. △ Less

Submitted 27 June, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

arXiv:2212.10387 [pdf, other]

Tuning the Tail Latency of Distributed Queries Using Replication

Authors: Nathan Ng, Hung Le, Marco Serafini

Abstract: Querying graph data with low latency is an important requirement in application domains such as social networks and knowledge graphs. Graph queries perform multiple hops between vertices. When data is partitioned and stored across multiple servers, queries executing at one server often need to hop to vertices stored by another server. Such distributed traversals represent a performance bottleneck… ▽ More Querying graph data with low latency is an important requirement in application domains such as social networks and knowledge graphs. Graph queries perform multiple hops between vertices. When data is partitioned and stored across multiple servers, queries executing at one server often need to hop to vertices stored by another server. Such distributed traversals represent a performance bottleneck for low-latency queries. To reduce query latency, one can replicate remote data to make distributed traversals unnecessary, but replication is expensive and should be minimized. In this paper, we introduce the problem of finding data replication schemes that satisfy arbitrary user-defined query latency constraints with minimal replication cost. We propose a novel workload model to express data access causality, propose a family of heuristics, and introduce non-trivial sufficient conditions for their correctness. Our evaluation on two representative benchmarks show that our algorithms enable fine-tuning query latency with data replication and can find sweet spots in the latency/replication design space. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: An earlier version of this paper was submitted in April 2022. Previous versions are available at https://marcoserafini.github.io/projects/latency-replication/index.html

arXiv:2105.02315 [pdf, other]

doi 10.1145/3469379.3469387

Scalable Graph Neural Network Training: The Case for Sampling

Authors: Marco Serafini, Hui Guan

Abstract: Graph Neural Networks (GNNs) are a new and increasingly popular family of deep neural network architectures to perform learning on graphs. Training them efficiently is challenging due to the irregular nature of graph data. The problem becomes even more challenging when scaling to large graphs that exceed the capacity of single devices. Standard approaches to distributed DNN training, such as data… ▽ More Graph Neural Networks (GNNs) are a new and increasingly popular family of deep neural network architectures to perform learning on graphs. Training them efficiently is challenging due to the irregular nature of graph data. The problem becomes even more challenging when scaling to large graphs that exceed the capacity of single devices. Standard approaches to distributed DNN training, such as data and model parallelism, do not directly apply to GNNs. Instead, two different approaches have emerged in the literature: whole-graph and sample-based training. In this paper, we review and compare the two approaches. Scalability is challenging with both approaches, but we make a case that research should focus on sample-based training since it is a more promising approach. Finally, we review recent systems supporting sample-based training. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 9 pages, 2 figures

Journal ref: ACM SIGOPS Operating Systems Review, Volume 55, Issue 1, July 2021, pp 68-76

arXiv:2009.06693 [pdf, other]

Accelerating Graph Sampling for Graph Machine Learning using GPUs

Authors: Abhinav Jangda, Sandeep Polisetty, Arjun Guha, Marco Serafini

Abstract: Representation learning algorithms automatically learn the features of data. Several representation learning algorithms for graph data, such as DeepWalk, node2vec, and GraphSAGE, sample the graph to produce mini-batches that are suitable for training a DNN. However, sampling time can be a significant fraction of training time, and existing systems do not efficiently parallelize sampling. Samplin… ▽ More Representation learning algorithms automatically learn the features of data. Several representation learning algorithms for graph data, such as DeepWalk, node2vec, and GraphSAGE, sample the graph to produce mini-batches that are suitable for training a DNN. However, sampling time can be a significant fraction of training time, and existing systems do not efficiently parallelize sampling. Sampling is an embarrassingly parallel problem and may appear to lend itself to GPU acceleration, but the irregularity of graphs makes it hard to use GPU resources effectively. This paper presents NextDoor, a system designed to effectively perform graph sampling on GPUs. NextDoor employs a new approach to graph sampling that we call transit-parallelism, which allows load balancing and caching of edges. NextDoor provides end-users with a high-level abstraction for writing a variety of graph sampling algorithms. We implement several graph sampling applications, and show that NextDoor runs them orders of magnitude faster than existing systems. △ Less

Submitted 10 May, 2021; v1 submitted 14 September, 2020; originally announced September 2020.

Comments: Published in EuroSys 2021

arXiv:2003.03604 [pdf, other]

Aion: Better Late than Never in Event-Time Streams

Authors: Sérgio Esteves, Gianmarco De Francisci Morales, Rodrigo Rodrigues, Marco Serafini, Luís Veiga

Abstract: Processing data streams in near real-time is an increasingly important task. In the case of event-timestamped data, the stream processing system must promptly handle late events that arrive after the corresponding window has been processed. To enable this late processing, the window state must be maintained for a long period of time. However, current systems maintain this state in memory, which ei… ▽ More Processing data streams in near real-time is an increasingly important task. In the case of event-timestamped data, the stream processing system must promptly handle late events that arrive after the corresponding window has been processed. To enable this late processing, the window state must be maintained for a long period of time. However, current systems maintain this state in memory, which either imposes a maximum period of tolerated lateness, or causes the system to degrade performance or even crash when the system memory runs out. In this paper, we propose AION, a comprehensive solution for handling late events in an efficient manner, implemented on top of Flink. In designing AION, we go beyond a naive solution that transfers state between memory and persistent storage on demand. In particular, we introduce a proactive caching scheme, where we leverage the semantics of stream processing to anticipate the need for bringing data to memory. Furthermore, we propose a predictive cleanup scheme to permanently discard window state based on the likelihood of receiving more late events, to prevent storage consumption from growing without bounds. Our evaluation shows that AION is capable of maintaining sustainable levels of memory utilization while still preserving high throughput, low latency, and low staleness. △ Less

Submitted 22 April, 2020; v1 submitted 7 March, 2020; originally announced March 2020.

Comments: 14 pages, 28 figures

arXiv:2002.05837 [pdf, other]

PushdownDB: Accelerating a DBMS using S3 Computation

Authors: Xiangyao Yu, Matt Youill, Matthew Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, Michael Stonebraker

Abstract: This paper studies the effectiveness of pushing parts of DBMS analytics queries into the Simple Storage Service (S3) engine of Amazon Web Services (AWS), using a recently released capability called S3 Select. We show that some DBMS primitives (filter, projection, aggregation) can always be cost-effectively moved into S3. Other more complex operations (join, top-K, group-by) require reimplementatio… ▽ More This paper studies the effectiveness of pushing parts of DBMS analytics queries into the Simple Storage Service (S3) engine of Amazon Web Services (AWS), using a recently released capability called S3 Select. We show that some DBMS primitives (filter, projection, aggregation) can always be cost-effectively moved into S3. Other more complex operations (join, top-K, group-by) require reimplementation to take advantage of S3 Select and are often candidates for pushdown. We demonstrate these capabilities through experimentation using a new DBMS that we developed, PushdownDB. Experimentation with a collection of queries including TPC-H queries shows that PushdownDB is on average 30% cheaper and 6.7X faster than a baseline that does not use S3 Select. △ Less

Submitted 13 February, 2020; originally announced February 2020.

arXiv:1910.05773 [pdf, other]

doi 10.14778/3384345.3384351

LiveGraph: A Transactional Graph Storage System with Purely Sequential Adjacency List Scans

Authors: Xiaowei Zhu, Guanyu Feng, Marco Serafini, Xiaosong Ma, Jiping Yu, Lei Xie, Ashraf Aboulnaga, Wenguang Chen

Abstract: The specific characteristics of graph workloads make it hard to design a one-size-fits-all graph storage system. Systems that support transactional updates use data structures with poor data locality, which limits the efficiency of analytical workloads or even simple edge scans. Other systems run graph analytics workloads efficiently, but cannot properly support transactions. This paper presents… ▽ More The specific characteristics of graph workloads make it hard to design a one-size-fits-all graph storage system. Systems that support transactional updates use data structures with poor data locality, which limits the efficiency of analytical workloads or even simple edge scans. Other systems run graph analytics workloads efficiently, but cannot properly support transactions. This paper presents LiveGraph, a graph storage system that outperforms both the best graph transactional systems and the best systems for real-time graph analytics on fresh data. LiveGraph does that by ensuring that adjacency list scans, a key operation in graph workloads, are purely sequential: they never require random accesses even in presence of concurrent transactions. This is achieved by combining a novel graph-aware data structure, the Transactional Edge Log (TEL), together with a concurrency control mechanism that leverages TEL's data layout. Our evaluation shows that LiveGraph significantly outperforms state-of-the-art (graph) database solutions on both transactional and real-time analytical workloads. △ Less

Submitted 29 August, 2020; v1 submitted 13 October, 2019; originally announced October 2019.

arXiv:1804.01942 [pdf, other]

Scaling Out Acid Applications with Operation Partitioning

Authors: Habib Saissi, Marco Serafini, Neeraj Suri

Abstract: OLTP applications with high workloads that cannot be served by a single server need to scale out to multiple servers. Typically, scaling out entails assigning a different partition of the application state to each server. But data partitioning is at odds with preserving the strong consistency guarantees of ACID transactions, a fundamental building block of many OLTP applications. The more we scale… ▽ More OLTP applications with high workloads that cannot be served by a single server need to scale out to multiple servers. Typically, scaling out entails assigning a different partition of the application state to each server. But data partitioning is at odds with preserving the strong consistency guarantees of ACID transactions, a fundamental building block of many OLTP applications. The more we scale out and spread data across multiple servers, the more frequent distributed transactions accessing data at different servers will be. With a large number of servers, the high cost of distributed transactions makes scaling out ineffective or even detrimental. In this paper we propose Operation Partitioning, a novel paradigm to scale out OLTP applications that require ACID guarantees. Operation Partitioning indirectly partitions data across servers by partitioning the application's operations through static analysis. This partitioning of operations yields to a lock-free Conveyor Belt protocol for distributed coordination, which can scale out unmodified applications running on top of unmodified database management systems. We implement the protocol in a system called Elia and use it to scale out two applications, TPC-W and RUBiS. Our experiments show that Elia can increase maximum throughput by up to 4.2x and reduce latency by up to 58.6x compared to MySQL Cluster while at the same time providing a stronger isolation guarantee (serializability instead of read committed). △ Less

Submitted 5 April, 2018; originally announced April 2018.

arXiv:1705.09073 [pdf, other]

Load Balancing for Skewed Streams on Heterogeneous Cluster

Authors: Muhammad Anis Uddin Nasir, Hiroshi Horii, Marco Serafini, Nicolas Kourtellis, Rudy Raymond, Sarunas Girdzijauskas, Takayuki Osogami

Abstract: Streaming applications frequently encounter skewed workloads and execute on heterogeneous clusters. Optimal resource utilization in such adverse conditions becomes a challenge, as it requires inferring the resource capacities and input distribution at run time. In this paper, we tackle the aforementioned challenges by modeling them as a load balancing problem. We propose a novel partitioning strat… ▽ More Streaming applications frequently encounter skewed workloads and execute on heterogeneous clusters. Optimal resource utilization in such adverse conditions becomes a challenge, as it requires inferring the resource capacities and input distribution at run time. In this paper, we tackle the aforementioned challenges by modeling them as a load balancing problem. We propose a novel partitioning strategy called Consistent Grouping (CG), which enables each processing element instance (PEI) to process the workload according to its capacity. The main idea behind CG is the notion of small, equal-sized virtual workers at the sources, which are assigned to workers based on their capacities. We provide a theoretical analysis of the proposed algorithm and show via extensive empirical evaluation that our proposed scheme outperforms the state-of-the-art approaches, like key grouping. In particular, CG achieves 3.44x better performance in terms of latency compared to key grouping. △ Less

Submitted 1 October, 2017; v1 submitted 25 May, 2017; originally announced May 2017.

Comments: 12 pages, under submission

arXiv:1510.07623 [pdf, other]

Partial Key Grouping: Load-Balanced Partitioning of Distributed Streams

Authors: Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, David Garcia-Soriano, Nicolas Kourtellis, Marco Serafini

Abstract: We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load b… ▽ More We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 175% in throughput and up to 45% in latency when deployed on a real Storm cluster. PKG has been integrated in Apache Storm v0.10. △ Less

Submitted 26 October, 2015; originally announced October 2015.

Comments: 14 pages. arXiv admin note: substantial text overlap with arXiv:1504.00788

arXiv:1510.05714 [pdf, other]

When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing

Authors: Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, Nicolas Kourtellis, Marco Serafini

Abstract: Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: more workers implies fewer keys per worker,… ▽ More Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: more workers implies fewer keys per worker, so it becomes harder to "average out" the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heaving hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to $d \geq 2$ choices to ensure a balanced load, where $d$ is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by $\mathbf{150\%}$ and $\mathbf{60\%}$ respectively over the previous state-of-the-art when deployed on Apache Storm. △ Less

Submitted 27 January, 2016; v1 submitted 19 October, 2015; originally announced October 2015.

Comments: 12 pages, 14 Figures, this paper is accepted and will be published at ICDE 2016

arXiv:1510.04233 [pdf, other]

Arabesque: A System for Distributed Graph Mining - Extended version

Authors: Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga

Abstract: Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large… ▽ More Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social net- works, semantic web, and bioinformatics. In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions. △ Less

Submitted 14 October, 2015; originally announced October 2015.

Comments: A short version of this report appeared in the Proceedings of the 25th ACM Symp. on Operating Systems Principles (SOSP), 2015

Report number: QCRI-TR-2015-005

arXiv:1504.00788 [pdf, other]

doi 10.1109/ICDE.2015.7113279

The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines

Authors: Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, David García-Soriano, Nicolas Kourtellis, Marco Serafini

Abstract: We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load b… ▽ More We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster. △ Less

Submitted 3 April, 2015; originally announced April 2015.

Comments: 31st IEEE International Conference on Data Engineering (ICDE), 2015

arXiv:1308.2979 [pdf, other]

On Barriers and the Gap between Active and Passive Replication (Full Version)

Authors: Flavio P. Junqueira, Marco Serafini

Abstract: Active replication is commonly built on top of the atomic broadcast primitive. Passive replication, which has been recently used in the popular ZooKeeper coordination system, can be naturally built on top of the primary-order atomic broadcast primitive. Passive replication differs from active replication in that it requires processes to cross a barrier before they become primaries and start broadc… ▽ More Active replication is commonly built on top of the atomic broadcast primitive. Passive replication, which has been recently used in the popular ZooKeeper coordination system, can be naturally built on top of the primary-order atomic broadcast primitive. Passive replication differs from active replication in that it requires processes to cross a barrier before they become primaries and start broadcasting messages. In this paper, we propose a barrier function tau that explains and encapsulates the differences between existing primary-order atomic broadcast algorithms, namely semi-passive replication and Zookeeper atomic broadcast (Zab), as well as the differences between Paxos and Zab. We also show that implementing primary-order atomic broadcast on top of a generic consensus primitive and tau inherently results in higher time complexity than atomic broadcast, as witnessed by existing algorithms. We overcome this problem by presenting an alternative, primary-order atomic broadcast implementation that builds on top of a generic consensus primitive and uses consensus itself to form a barrier. This algorithm is modular and matches the time complexity of existing tau-based algorithms. △ Less

Submitted 12 October, 2015; v1 submitted 13 August, 2013; originally announced August 2013.

Comments: A shorter version of this work (without appendices) appears in the proceedings of the 27th International Symposium on Distributed Computing (DISC) 2013 conference

Showing 1–17 of 17 results for author: Serafini, M