Apache Datafusion Comet and the story of my first contribution to it

In this blog post, I will provide a brief high-level overview of projects designed to accelerate Apache Spark by the native physical execution, including Databricks Photon, Apache Datafusion Comet, and Apache Gluten (incubating). I will explain the problems these projects aim to solve and their approaches. The main focus will be on the Comet project, particularly its internal architecture. Additionally, I will share my personal experience of making my first significant contribution to the project. This will include not only a description of the problem I solved and my solution but also insights into the overall contribution experience and the pull request review process.

November 22, 2024 · 25 min · Sem Sinchenko

Generation H2O benchmark data using Rust and PyArrow

Preface I would like to express my gratitude to Matthew Powers for testing my project and providing feedback, and to Steve Russo for offering a valuable review of my code and drawing my attention to avoiding the use of unwrap. Prior to his review, some parts of the code looked like this: let distr_k = Uniform::<i64>::try_from(1..=k).unwrap(); let distr_nk = Uniform::<i64>::try_from(1..=(n / k)).unwrap(); let distr_5 = Uniform::<i64>::try_from(1..=5).unwrap(); let distr_15 = Uniform::<i64>::try_from(1....

October 30, 2024 · 17 min · Sem Sinchenko

Why I think that Hive metastore is still unbeatable even by modern solutions like Unity or Polaris

Open table formats like Apache Iceberg and Delta are evolving rapidly today. Developers worldwide are creating both open-source and proprietary custom formats for specific tasks such as data streaming, graph data, and embeddings. Additionally, we have numerous legacy and highly specific data sources, such as logs in custom formats or collections of old Excel files. This diversity is precisely why I believe that extensibility, or the ability to implement custom input and output formats, is crucial. Unfortunately, this feature, which is present in Hive Metastore, is missing in modern data catalogs like Unity or Polaris.

October 22, 2024 · 7 min · Sem Sinchenko

Spark-Connect: I'm starting to love it!

Summary This blog post is a detailed story about how I ported a popular data quality framework, AWS Deequ, to Spark-Connect. Deequ is a very cool, reliable and scalable framework that allows to compute a lot of metrics, checks and anomaly detection suites on the data using Apache Spark cluster. But the Deequ core is a Scala library that uses a lot of low-level Apache Spark APIs for better performance, so it cannot be run directly on any of Spark-Connect environment....

July 6, 2024 · 31 min · Sem Sinchenko

Unitycatalog: the first look

Databricks recently open-sourced Unitycatlog, a unified data catalog that aims to provide a single source of truth for data discovery, governance, and access control across multiple systems. In this blog post, we take a first look at Unitycatlog and dive into the source code to explain which features from the announcement are actually present. We explore how Unitycatlog addresses the challenges of managing data in a complex data landscape and discuss its potential impact on simplifying data governance and improving data accessibility for organizations.

June 17, 2024 · 8 min · Sem Sinchenko

Effective asOfJoin in PySpark for Feature Store

Leveraging Time-Based Feature Stores for Efficient Data Science Workflows In our previous post, we briefly touched upon the concept of ML feature stores and their significance in streamlining machine learning workflows. Today, we’ll again explore a specific type of feature store known as a time-based feature store, which plays a crucial role in handling temporal data and enabling efficient feature retrieval for data science tasks. In this post we’ll how a feature-lookup problem may be effectively solved in PySpark using domain knowledge and understanding how Apache Spark works with partitions and columnar data formats....

April 14, 2024 · 14 min · Sem Sinchenko

Computing ML Feature Store in PySpark

In this blog post, I will share my experience in building an ML Feature Store using PySpark. I will demonstrate how one can utilize case-when expressions to generate multiple aggregations with minimal data shuffling across the cluster. This approach is significantly more efficient than the naive method of using a combination of groupBy and pivot for generating aggregations (or features in ML terms).

April 7, 2024 · 8 min · Sem Sinchenko

Extending Spark Connect

This blog post presents a very detailed step-by-step guide on how to create a SparkConnect protocol extension in Java and call it from PySpark. It will also cover a topic about how to define all the necessary proto3 messages for it. At the end of this guide you will have a way to interact with Spark JVM from PySpark almost like you can with py4j in a non-connect version.

March 4, 2024 · 22 min · Sem Sinchenko

Supporting multiple Apache Spark versions with Maven

I recently had the opportunity to work on an open source project that implements a custom Apache Spark data source and associated logic for working with graph data. The code was written to work with Apache Spark 3.2.2. I am committed to extending support to multiple versions of Spark. In this blog post I want to show how the structure of such a project can be organized using Maven profiles.

February 25, 2024 · 7 min · Sem Sinchenko

How Databricks Runtime 14.x destroyed 3d-party PySpark packages compatibility

In this post, I want to discuss the groundbreaking changes in the latest LTS release of the Databricks runtime. This release introduced Spark Connect as the default way to work with shared clusters. I will give a brief introduction to the topic of internal JVM calls and Spark Connect, provide examples of 3d-party OSS projects broken in 14.3, and try to understand the reasons for such a move by Databricks.

February 22, 2024 · 10 min · Sem Sinchenko