Sem Sinchenko

Computing ML Feature Store in PySpark

In this blog post, I will share my experience in building an ML Feature Store using PySpark. I will demonstrate how one can utilize case-when expressions to generate multiple aggregations with minimal data shuffling across the cluster. This approach is significantly more efficient than the naive method of using a combination of groupBy and pivot for generating aggregations (or features in ML terms).

Extending Spark Connect

This blog post presents a very detailed step-by-step guide on how to create a SparkConnect protocol extension in Java and call it from PySpark. It will also cover a topic about how to define all the necessary proto3 messages for it. At the end of this guide you will have a way to interact with Spark JVM from PySpark almost like you can with py4j in a non-connect version.

Supporting multiple Apache Spark versions with Maven

I recently had the opportunity to work on an open source project that implements a custom Apache Spark data source and associated logic for working with graph data. The code was written to work with Apache Spark 3.2.2. I am committed to extending support to multiple versions of Spark. In this blog post I want to show how the structure of such a project can be organized using Maven profiles.

How Databricks Runtime 14.x destroyed 3d-party PySpark packages compatibility

In this post, I want to discuss the groundbreaking changes in the latest LTS release of the Databricks runtime. This release introduced Spark Connect as the default way to work with shared clusters. I will give a brief introduction to the topic of internal JVM calls and Spark Connect, provide examples of 3d-party OSS projects broken in 14.3, and try to understand the reasons for such a move by Databricks.

PySpark column lineage

In this post, I will show you how to use information from the spark plan to track data lineage at the column level. This approach will also works with recently introduced SparkConnect.

How to estimate a PySpark DF size?

Sometimes it is an important question, how much memory does our DataFrame use? And there is no easy answer if you are working with PySpark. You can try to collect the data sample and run local memory profiler. You can estimate the size of the data in the source (for example, in parquet file). But we will go another way and try to analyze the logical plan of Spark from PySpark. In case when we are working with Scala Spark API we are able to work with resolved or unresolved logical plans and physical plan via a special API. But from PySpark API only string representation is available and we will work with it.

Cycling Eastern Serbia

I would like to tell you about my bicycle trip through Eastern Serbia. This part of the world is beautiful, but there is a big problem with lack of information in English. So I will try to fill this gap. The route I will describe starts in Belgrade, goes along the Danube River, through Djerdap National Park to the border with Serbia, and returns to Belgrade through Kucaj-Beljanica National Park.

Using Pyenv with NixOS

The problem Recently I decided to switch from Ubuntu to NixOS. Do not ask me why, it was just for fun mostly. One of the main ideas behind NixOS is to separation of dependencies: each new package is installed into separate sandbox with own scope of dependencies. By design it should make system significantly more stable but sometimes there are problems. One of such problems I faced with pyenv – a tool for simplifying python versions management. ...

Generating docstrings with GPT

Generating Python docstrings with GPT and Emacs Motivation There is an open source library in which I'm a maintainer. And recently I committed to creating docstrings for all the public functions and methods. I heard that recent Large Language Models (LLM) are good enough in the annotation of texts and documenting of code so I decided to try to use one of OpenAI models to solve this problem. In this post I will use Emacs plugins and extensions to generate docstrings but most advises about which prompt is better to use are generic and may be used with different code editors and IDE's. ...

Working With File System from PySpark

Working with File System from PySpark Motivation Any of us is working with File System in our work. Almost every pipeline or application has some kind of file-based configuration. Typically json or yaml files are used. Also for data pipelines, it is sometimes important to be able to write results or state them in a human-readable format. Or serialize some artifacts, like matplotlib plot, into bytes and write them to the disk. ...