PySpark column lineage

In this post, I will show you how to use information from the spark plan to track data lineage at the column level. This approach will also works with recently introduced SparkConnect.

January 10, 2024 · 14 min · Sem Sinchenko

How to estimate a PySpark DF size?

Sometimes it is an important question, how much memory does our DataFrame use? And there is no easy answer if you are working with PySpark. You can try to collect the data sample and run local memory profiler. You can estimate the size of the data in the source (for example, in parquet file). But we will go another way and try to analyze the logical plan of Spark from PySpark. In case when we are working with Scala Spark API we are able to work with resolved or unresolved logical plans and physical plan via a special API. But from PySpark API only string representation is available and we will work with it.

November 23, 2023 · 6 min · Sem Sinchenko

Cycling Eastern Serbia

I would like to tell you about my bicycle trip through Eastern Serbia. This part of the world is beautiful, but there is a big problem with lack of information in English. So I will try to fill this gap. The route I will describe starts in Belgrade, goes along the Danube River, through Djerdap National Park to the border with Serbia, and returns to Belgrade through Kucaj-Beljanica National Park.

October 12, 2023 · 19 min · Sem Sinchenko

Using Pyenv with NixOS

The problem Recently I decided to switch from Ubuntu to NixOS. Do not ask me why, it was just for fun mostly. One of the main ideas behind NixOS is to separation of dependencies: each new package is installed into separate sandbox with own scope of dependencies. By design it should make system significantly more stable but sometimes there are problems. One of such problems I faced with pyenv – a tool for simplifying python versions management....

September 29, 2023 · 2 min · Sem Sinchenko

Generating docstrings with GPT

Generating Python docstrings with GPT and Emacs Motivation There is an open source library in which I'm a maintainer. And recently I committed to creating docstrings for all the public functions and methods. I heard that recent Large Language Models (LLM) are good enough in the annotation of texts and documenting of code so I decided to try to use one of OpenAI models to solve this problem. In this post I will use Emacs plugins and extensions to generate docstrings but most advises about which prompt is better to use are generic and may be used with different code editors and IDE's....

April 6, 2023 · 5 min · Sem Sinchenko

Working With File System from PySpark

Working with File System from PySpark Motivation Any of us is working with File System in our work. Almost every pipeline or application has some kind of file-based configuration. Typically json or yaml files are used. Also for data pipelines, it is sometimes important to be able to write results or state them in a human-readable format. Or serialize some artifacts, like matplotlib plot, into bytes and write them to the disk....

March 30, 2023 · 10 min · Sem Sinchenko