Graph Embeddings at scale with Spark and GraphFrames

In this blog post, I will recount my experience working on the addition of the graph embeddings API to the GraphFrames library. I will start with a top-level overview of the vertex representation learning task and the existing approaches. Then, I will focus on generating embeddings from random walks. I will explain the intuition behind and the implementation details of Random Walks in GraphFrames, as well as the chosen trade-offs and limitations. Given a sequence of graph vertices, I will explain the possible ways to learn vertex representations from it. First, I will analyze the common approach of using the word2vec model and its limitations, especially the limitations of the Apache Spark ML Word2vec implementation. I will present the results of my analysis of alternative approaches and their respective advantages and disadvantages. Finally, I will explain the intuition behind the Hash2vec approach that I chose as a starting point for GraphFrames, as well as the details of its implementation in Scala and Spark. Finally, I will present an experimental comparison of Hash2vec and Word2vec embeddings and share my thoughts about future directions.

December 12, 2025 · 40 min · Sem Sinchenko

Benchmarking Spark libraray with JMH

In this blog post, I will provide an end-to-end example of how to integrate JMH benchmarks for an Apache Spark-based library into an SBT build. I will cover aspects that are often poorly documented, such as setting up a Spark Session and datasets as shared resources, configuring driver memory for standalone Spark with JMH, and more. Additionally, as a bonus, I will demonstrate how to integrate benchmark results into the library’s documentation using the Typelevel Laika static site generator. While many of these steps may seem obvious to experienced JVM developers, I personally spent several hours figuring out how to configure them correctly. From what I have observed, the subject of Apache Spark with JMH is underexplored, and online searches did not yield much guidance. Thus, my hope is that this short post will help someone someday and save them a few hours of effort.

September 2, 2025 · 8 min · Sem Sinchenko