Modern day end-to-end data analytics pipelines are centered around transforming various kinds of collection-based data types, including tensors, tables, graphs, and streams. Current data analyics frameworks are typically designed around one or two of these data types. Examples of such frameworks include Apache Calcite which leverages relational streams, Halide which supports streaming stencil computations, and Lara offers unified support for operations on bags with matrices.
In the CDA project, the goal is to leverage the ability to express and optimise operations over more than two types of collections. To this end, we propose Arc, an intermediate representation for data analytics. Arc is implemented as a dialect in MLIR (Multi-Level IR) which is a highly modular and hackable compiler framework built on top of LLVM. This talk will introduce the design ideas behind the latest installment of Arc, and will give an overview of MLIR and its role in achieving our goal.
This talk presents a unified model for streaming and batch data processing. We start from the history of large-scale data processing systems at Google and their evolution towards higher levels of abstraction culminating in the Beam programming model. We delve into the fundamentals of out-of-order unbounded data stream processing and we show how Beam's powerful abstractions for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
We discuss how these key questions and abstractions enable important properties of correctness, expressive power, composability, flexibility, and modularity. Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary, such as Google Cloud Dataflow, and open-source, such as Flink, Spark, Samza, etc. We close with a presentation of the execution model and show how the Google Cloud Dataflow service optimizes and orchestrates the distributed execution and autoscaling of a Beam pipeline.
This talk will report on some experiences in applying two different stateless model checking (SMC) tools in two different case studies. The first of them applied Nidhugg, an SMC tool for C/Pthread programs, to the code of Tree RCU, the Hierarchical Read-Copy-Update synchronization mechanism for mutual exclusion used in the Linux kernel, a low-level and quite complex concurrent program. The second case study applied Concuerror, an SMC tool for Erlang programs, to test and verify, during their design phase by engineers at VMWare, chain repair methods for CORFU, a distributed shared log which aims to be scalable and reliable in the presence of failures and asynchrony.
Besides the results from these two case studies, we will try to present some experiences and lessons learned about engaging in testing projects of that size and complexity.