
Why Apache Spark is often considered as slow?
The question about why Apache Spark is "slow" is one of the most often questions I'm hearing from junior engineers and peoples I'm mentoring. While that is partially true, it should be clarified. TLDR – OSS Spark is a multi-purpose engine that is designed to handle different kinds of workloads. Under the hood of Spark is using a data-centric code generation but also it has some vectorization as well as option to fallbak to a pure Volcano-mode. Because of that Spark can be considred as a hybrid engine, that can benefit from all the approaches. But because of it's multi-purpose nature it will be almost always slower compared to pure vectorized engines like Trino on OLAP workloads on top of columnar data, except rare cases of big amount of nulls or deep branching in the query. In this blogpost I'm trying to explain the statement above.