Welcome to my Blog!

👋👋👋 Welcome to my personal blog!

I’m a Russian-born data engineer living in Belgrade, Serbia since 2022.

My blog contains the context mostly about three topics:

  1. My exploration of modern data stack and selected engineering topics such as Python/JVM development, Apache Spark, etc.
  2. Stories about my contribution to OSS projects
  3. Reports from my personal bikepacking tours

About any questions related to this blog you can contact me via email: ssinchenko@apache.org

Graph Embeddings at scale with Spark and GraphFrames

In this blog post, I will recount my experience working on the addition of the graph embeddings API to the GraphFrames library. I will start with a top-level overview of the vertex representation learning task and the existing approaches. Then, I will focus on generating embeddings from random walks. I will explain the intuition behind and the implementation details of Random Walks in GraphFrames, as well as the chosen trade-offs and limitations. Given a sequence of graph vertices, I will explain the possible ways to learn vertex representations from it. First, I will analyze the common approach of using the word2vec model and its limitations, especially the limitations of the Apache Spark ML Word2vec implementation. I will present the results of my analysis of alternative approaches and their respective advantages and disadvantages. Finally, I will explain the intuition behind the Hash2vec approach that I chose as a starting point for GraphFrames, as well as the details of its implementation in Scala and Spark. Finally, I will present an experimental comparison of Hash2vec and Word2vec embeddings and share my thoughts about future directions.

December 12, 2025 · 40 min · Sem Sinchenko

Graphs, Algorithms, and My First Impression of DataFusion

I don’t think anyone besides me has considered using Apache DataFusion to write graph algorithms. I still don’t fully understand DataFusion’s place in the world of graphs, but I’d like to share my initial experience with it. Spoiler alert: It’s surprisingly good! In this post, I will explain the weakly connected components problem and its close relationship to the common problem of modern data warehouses (DWHs), namely, identity resolution. I will also describe an algorithm for connected components in a MapReduce paradigm. I consider this to be the algorithm that strikes the best balance between performance and simplicity. Finally, I’ll present my DataFusion-based implementation of the algorithm and the results of toy benchmarks on a graph containing four million nodes and 129 million edges. As always, keep in mind that this post is very opinionated!

November 25, 2025 · 29 min · Sem Sinchenko

Benchmarking Spark libraray with JMH

In this blog post, I will provide an end-to-end example of how to integrate JMH benchmarks for an Apache Spark-based library into an SBT build. I will cover aspects that are often poorly documented, such as setting up a Spark Session and datasets as shared resources, configuring driver memory for standalone Spark with JMH, and more. Additionally, as a bonus, I will demonstrate how to integrate benchmark results into the library’s documentation using the Typelevel Laika static site generator. While many of these steps may seem obvious to experienced JVM developers, I personally spent several hours figuring out how to configure them correctly. From what I have observed, the subject of Apache Spark with JMH is underexplored, and online searches did not yield much guidance. Thus, my hope is that this short post will help someone someday and save them a few hours of effort.

September 2, 2025 · 8 min · Sem Sinchenko

Dreaming of Graphs in the Open Lakehouse

While Open Lakehouse platforms now natively support tables, geospatial data, vectors, and more, property graphs are still missing. In the age of AI and growing interest in Graph RAG, graphs are becoming especially relevant – there’s a need to deliver Knowledge Graphs to RAG systems, with standards, ETL, and frameworks for different scenarios. There’s a young project, Apache GraphAr (incubating), that aims to define a storage standard. For processing, there is a good tooling already. GraphFrames is like Spark for Iceberg – batch and scalable on distributed clusters; Kuzu is like DuckDB for Iceberg – fast, in-memory, and in-process; Apache HugeGraph is like ClickHouse or Doris for graphs – a standalone server for queries. I’m currently working also on graphframes-rs to bring Apache DataFusion and its ecosystem into this picture. All the pieces seem to be here—it just remains to put them together. More thoughts in the full post.

June 26, 2025 · 13 min · Sem Sinchenko

Why Apache Spark is often considered as slow?

The question about why Apache Spark is "slow" is one of the most often questions I'm hearing from junior engineers and peoples I'm mentoring. While that is partially true, it should be clarified. TLDR – OSS Spark is a multi-purpose engine that is designed to handle different kinds of workloads. Under the hood of Spark is using a data-centric code generation but also it has some vectorization as well as option to fallbak to a pure Volcano-mode. Because of that Spark can be considred as a hybrid engine, that can benefit from all the approaches. But because of it's multi-purpose nature it will be almost always slower compared to pure vectorized engines like Trino on OLAP workloads on top of columnar data, except rare cases of big amount of nulls or deep branching in the query. In this blogpost I'm trying to explain the statement above.

June 18, 2025 · 21 min · Sem Sinchenko

Apache Datafusion Comet and the story of my first contribution to it

In this blog post, I will provide a brief high-level overview of projects designed to accelerate Apache Spark by the native physical execution, including Databricks Photon, Apache Datafusion Comet, and Apache Gluten (incubating). I will explain the problems these projects aim to solve and their approaches. The main focus will be on the Comet project, particularly its internal architecture. Additionally, I will share my personal experience of making my first significant contribution to the project. This will include not only a description of the problem I solved and my solution but also insights into the overall contribution experience and the pull request review process.

November 22, 2024 · 25 min · Sem Sinchenko

Generation H2O benchmark data using Rust and PyArrow

Preface I would like to express my gratitude to Matthew Powers for testing my project and providing feedback, and to Steve Russo for offering a valuable review of my code and drawing my attention to avoiding the use of unwrap. Prior to his review, some parts of the code looked like this: let distr_k = Uniform::<i64>::try_from(1..=k).unwrap(); let distr_nk = Uniform::<i64>::try_from(1..=(n / k)).unwrap(); let distr_5 = Uniform::<i64>::try_from(1..=5).unwrap(); let distr_15 = Uniform::<i64>::try_from(1..=15).unwrap(); let distr_float = Uniform::<f64>::try_from(0.0..=100.0).unwrap(); let distr_nas = Uniform::<i64>::try_from(0..=100).unwrap(); ...

October 30, 2024 · 17 min · Sem Sinchenko

Why I think that Hive metastore is still unbeatable even by modern solutions like Unity or Polaris

Open table formats like Apache Iceberg and Delta are evolving rapidly today. Developers worldwide are creating both open-source and proprietary custom formats for specific tasks such as data streaming, graph data, and embeddings. Additionally, we have numerous legacy and highly specific data sources, such as logs in custom formats or collections of old Excel files. This diversity is precisely why I believe that extensibility, or the ability to implement custom input and output formats, is crucial. Unfortunately, this feature, which is present in Hive Metastore, is missing in modern data catalogs like Unity or Polaris.

October 22, 2024 · 7 min · Sem Sinchenko

Spark-Connect: I'm starting to love it!

Summary This blog post is a detailed story about how I ported a popular data quality framework, AWS Deequ, to Spark-Connect. Deequ is a very cool, reliable and scalable framework that allows to compute a lot of metrics, checks and anomaly detection suites on the data using Apache Spark cluster. But the Deequ core is a Scala library that uses a lot of low-level Apache Spark APIs for better performance, so it cannot be run directly on any of Spark-Connect environment. To solve this problem, I defined protobuf messages for all main structures of Deequ, like Check, Analyzer, AnomalyDetectionStrategy, etc., wrote a helper object that can re-create Deequ structures from the corresponding protobuf, and finally made a Spark-Connect native plugin that can process Deequ specific messages, construct DQ suits from them, compute the report, and return the result to the Spark-Connect client. I tested my solution with PySpark Connect 3.5.1, but it should work with any of the existing Spark-Connect clients (Spark-Connect Java/Scala, Spark-Connect Go, Spark-Connect Rust, Spark-Connect C#, etc). ...

July 6, 2024 · 31 min · Sem Sinchenko

Unitycatalog: the first look

Databricks recently open-sourced Unitycatlog, a unified data catalog that aims to provide a single source of truth for data discovery, governance, and access control across multiple systems. In this blog post, we take a first look at Unitycatlog and dive into the source code to explain which features from the announcement are actually present. We explore how Unitycatlog addresses the challenges of managing data in a complex data landscape and discuss its potential impact on simplifying data governance and improving data accessibility for organizations.

June 17, 2024 · 8 min · Sem Sinchenko