Welcome to my Blog!

👋👋👋 Welcome to my personal blog!

I’m a Russian-born data engineer living in Belgrade, Serbia since 2022.

My blog contains the context mostly about three topics:

  1. My exploration of modern data stack and selected engineering topics such as Python/JVM development, Apache Spark, etc.
  2. Stories about my contribution to OSS projects
  3. Reports from my personal bikepacking tours

About any questions related to this blog you can contact me via email: ssinchenko@apache.org

Org Mode in the AI Era: Organize Your Life in Plain Text, Then Automate It

Yes, I really ended up running headless Emacs in Docker Compose. Yes, it actually works. And yes, in this blog post I will explain why I think this setup makes much more sense than it sounds. I will start with the format question and why Org Mode still looks like the strongest plain-text foundation for combining notes, TODOs, scheduling, digests, and second-brain workflows. Then, I will explain why Emacs was chosen not just as an editor, but as the backend runtime for reliable Org processing and as the orchestrator for LLM-based automation. Finally, I will describe the two roles AI plays here. It is part of the system itself, but only inside carefully bounded workflows, and it is also the reason I could realistically build this kind of strange personal infrastructure in the first place. This project is not a product, not a SaaS, and not a generic framework. It is software for one user, built around one user’s workflows, from open-source pieces like Org Mode, gptel, and Elfeed. What changed in the AI era is that this kind of narrow, deeply personal software became much more realistic to build.

April 15, 2026 · 31 min · Sem Sinchenko

Reviving GraphFrames: Notes from Maintaining a 10-Year-Old OSS Project

About a year ago, I got involved in reviving and maintaining GraphFrames, a 10-year-old OSS project with a lot of history and not enough active maintenance. I was doing it neither for money nor to sell anything, but out of a still old-fashioned belief in free software and in the idea that it is worth spending time on software that is genuinely useful to others. This post is a reflection on what that experience has actually looked like: a constant tension between building new features and doing unglamorous maintenance, inherited code and forgotten assumptions everywhere, and the persistent fear of breaking backward compatibility in a library that had long ago been wired into important production processes.

April 8, 2026 · 30 min · Sem Sinchenko

Why I (Still) Use Aider in 2026: Code Ownership, OpenSpec, and the Vibecoding Hype

In this blog post, I will share my thoughts on the current hype around "agentic coding" and why I still use Aider for human-in-the-loop pair programming in 2026. I will start with a top-level overview of the "vibecoding" trend and why generating thousands of lines of code with autonomous agents creates a massive code ownership crisis. Then, I will focus on the reality of maintaining existing OSS projects and why blind AI pull requests are a technical debt trap. Finally, I will explain why the industry needs a standardized approach to define boundaries for LLMs, and how reviewing structured formats like OpenSpec can be a better alternative to reviewing raw AI-generated code.

March 10, 2026 · 13 min · Sem Sinchenko

Graph Embeddings at scale with Spark and GraphFrames

In this blog post, I will recount my experience working on the addition of the graph embeddings API to the GraphFrames library. I will start with a top-level overview of the vertex representation learning task and the existing approaches. Then, I will focus on generating embeddings from random walks. I will explain the intuition behind and the implementation details of Random Walks in GraphFrames, as well as the chosen trade-offs and limitations. Given a sequence of graph vertices, I will explain the possible ways to learn vertex representations from it. First, I will analyze the common approach of using the word2vec model and its limitations, especially the limitations of the Apache Spark ML Word2vec implementation. I will present the results of my analysis of alternative approaches and their respective advantages and disadvantages. Finally, I will explain the intuition behind the Hash2vec approach that I chose as a starting point for GraphFrames, as well as the details of its implementation in Scala and Spark. Finally, I will present an experimental comparison of Hash2vec and Word2vec embeddings and share my thoughts about future directions.

December 12, 2025 · 40 min · Sem Sinchenko

Graphs, Algorithms, and My First Impression of DataFusion

I don’t think anyone besides me has considered using Apache DataFusion to write graph algorithms. I still don’t fully understand DataFusion’s place in the world of graphs, but I’d like to share my initial experience with it. Spoiler alert: It’s surprisingly good! In this post, I will explain the weakly connected components problem and its close relationship to the common problem of modern data warehouses (DWHs), namely, identity resolution. I will also describe an algorithm for connected components in a MapReduce paradigm. I consider this to be the algorithm that strikes the best balance between performance and simplicity. Finally, I’ll present my DataFusion-based implementation of the algorithm and the results of toy benchmarks on a graph containing four million nodes and 129 million edges. As always, keep in mind that this post is very opinionated!

November 25, 2025 · 29 min · Sem Sinchenko

Benchmarking Spark libraray with JMH

In this blog post, I will provide an end-to-end example of how to integrate JMH benchmarks for an Apache Spark-based library into an SBT build. I will cover aspects that are often poorly documented, such as setting up a Spark Session and datasets as shared resources, configuring driver memory for standalone Spark with JMH, and more. Additionally, as a bonus, I will demonstrate how to integrate benchmark results into the library’s documentation using the Typelevel Laika static site generator. While many of these steps may seem obvious to experienced JVM developers, I personally spent several hours figuring out how to configure them correctly. From what I have observed, the subject of Apache Spark with JMH is underexplored, and online searches did not yield much guidance. Thus, my hope is that this short post will help someone someday and save them a few hours of effort.

September 2, 2025 · 8 min · Sem Sinchenko

Dreaming of Graphs in the Open Lakehouse

While Open Lakehouse platforms now natively support tables, geospatial data, vectors, and more, property graphs are still missing. In the age of AI and growing interest in Graph RAG, graphs are becoming especially relevant – there’s a need to deliver Knowledge Graphs to RAG systems, with standards, ETL, and frameworks for different scenarios. There’s a young project, Apache GraphAr (incubating), that aims to define a storage standard. For processing, there is a good tooling already. GraphFrames is like Spark for Iceberg – batch and scalable on distributed clusters; Kuzu is like DuckDB for Iceberg – fast, in-memory, and in-process; Apache HugeGraph is like ClickHouse or Doris for graphs – a standalone server for queries. I’m currently working also on graphframes-rs to bring Apache DataFusion and its ecosystem into this picture. All the pieces seem to be here—it just remains to put them together. More thoughts in the full post.

June 26, 2025 · 13 min · Sem Sinchenko

Why Apache Spark is often considered as slow?

The question about why Apache Spark is "slow" is one of the most often questions I'm hearing from junior engineers and peoples I'm mentoring. While that is partially true, it should be clarified. TLDR – OSS Spark is a multi-purpose engine that is designed to handle different kinds of workloads. Under the hood of Spark is using a data-centric code generation but also it has some vectorization as well as option to fallbak to a pure Volcano-mode. Because of that Spark can be considred as a hybrid engine, that can benefit from all the approaches. But because of it's multi-purpose nature it will be almost always slower compared to pure vectorized engines like Trino on OLAP workloads on top of columnar data, except rare cases of big amount of nulls or deep branching in the query. In this blogpost I'm trying to explain the statement above.

June 18, 2025 · 21 min · Sem Sinchenko

Apache Datafusion Comet and the story of my first contribution to it

In this blog post, I will provide a brief high-level overview of projects designed to accelerate Apache Spark by the native physical execution, including Databricks Photon, Apache Datafusion Comet, and Apache Gluten (incubating). I will explain the problems these projects aim to solve and their approaches. The main focus will be on the Comet project, particularly its internal architecture. Additionally, I will share my personal experience of making my first significant contribution to the project. This will include not only a description of the problem I solved and my solution but also insights into the overall contribution experience and the pull request review process.

November 22, 2024 · 25 min · Sem Sinchenko

Generation H2O benchmark data using Rust and PyArrow

Preface I would like to express my gratitude to Matthew Powers for testing my project and providing feedback, and to Steve Russo for offering a valuable review of my code and drawing my attention to avoiding the use of unwrap. Prior to his review, some parts of the code looked like this: let distr_k = Uniform::<i64>::try_from(1..=k).unwrap(); let distr_nk = Uniform::<i64>::try_from(1..=(n / k)).unwrap(); let distr_5 = Uniform::<i64>::try_from(1..=5).unwrap(); let distr_15 = Uniform::<i64>::try_from(1..=15).unwrap(); let distr_float = Uniform::<f64>::try_from(0.0..=100.0).unwrap(); let distr_nas = Uniform::<i64>::try_from(0..=100).unwrap(); ...

October 30, 2024 · 17 min · Sem Sinchenko