Benchmarking Spark libraray with JMH

In this blog post, I will provide an end-to-end example of how to integrate JMH benchmarks for an Apache Spark-based library into an SBT build. I will cover aspects that are often poorly documented, such as setting up a Spark Session and datasets as shared resources, configuring driver memory for standalone Spark with JMH, and more. Additionally, as a bonus, I will demonstrate how to integrate benchmark results into the library’s documentation using the Typelevel Laika static site generator. While many of these steps may seem obvious to experienced JVM developers, I personally spent several hours figuring out how to configure them correctly. From what I have observed, the subject of Apache Spark with JMH is underexplored, and online searches did not yield much guidance. Thus, my hope is that this short post will help someone someday and save them a few hours of effort.

September 2, 2025 · 8 min · Sem Sinchenko