Data-Engineering

Dreaming of Graphs in the Open Lakehouse

While Open Lakehouse platforms now natively support tables, geospatial data, vectors, and more, property graphs are still missing. In the age of AI and growing interest in Graph RAG, graphs are becoming especially relevant – there’s a need to deliver Knowledge Graphs to RAG systems, with standards, ETL, and frameworks for different scenarios. There’s a young project, Apache GraphAr (incubating), that aims to define a storage standard. For processing, there is a good tooling already. GraphFrames is like Spark for Iceberg – batch and scalable on distributed clusters; Kuzu is like DuckDB for Iceberg – fast, in-memory, and in-process; Apache HugeGraph is like ClickHouse or Doris for graphs – a standalone server for queries. I’m currently working also on graphframes-rs to bring Apache DataFusion and its ecosystem into this picture. All the pieces seem to be here—it just remains to put them together. More thoughts in the full post.

Why Apache Spark is often considered as slow?

The question about why Apache Spark is "slow" is one of the most often questions I'm hearing from junior engineers and peoples I'm mentoring. While that is partially true, it should be clarified. TLDR – OSS Spark is a multi-purpose engine that is designed to handle different kinds of workloads. Under the hood of Spark is using a data-centric code generation but also it has some vectorization as well as option to fallbak to a pure Volcano-mode. Because of that Spark can be considred as a hybrid engine, that can benefit from all the approaches. But because of it's multi-purpose nature it will be almost always slower compared to pure vectorized engines like Trino on OLAP workloads on top of columnar data, except rare cases of big amount of nulls or deep branching in the query. In this blogpost I'm trying to explain the statement above.

Apache Datafusion Comet and the story of my first contribution to it

In this blog post, I will provide a brief high-level overview of projects designed to accelerate Apache Spark by the native physical execution, including Databricks Photon, Apache Datafusion Comet, and Apache Gluten (incubating). I will explain the problems these projects aim to solve and their approaches. The main focus will be on the Comet project, particularly its internal architecture. Additionally, I will share my personal experience of making my first significant contribution to the project. This will include not only a description of the problem I solved and my solution but also insights into the overall contribution experience and the pull request review process.

Generation H2O benchmark data using Rust and PyArrow

Preface I would like to express my gratitude to Matthew Powers for testing my project and providing feedback, and to Steve Russo for offering a valuable review of my code and drawing my attention to avoiding the use of unwrap. Prior to his review, some parts of the code looked like this: let distr_k = Uniform::<i64>::try_from(1..=k).unwrap(); let distr_nk = Uniform::<i64>::try_from(1..=(n / k)).unwrap(); let distr_5 = Uniform::<i64>::try_from(1..=5).unwrap(); let distr_15 = Uniform::<i64>::try_from(1..=15).unwrap(); let distr_float = Uniform::<f64>::try_from(0.0..=100.0).unwrap(); let distr_nas = Uniform::<i64>::try_from(0..=100).unwrap(); ...

Why I think that Hive metastore is still unbeatable even by modern solutions like Unity or Polaris

Open table formats like Apache Iceberg and Delta are evolving rapidly today. Developers worldwide are creating both open-source and proprietary custom formats for specific tasks such as data streaming, graph data, and embeddings. Additionally, we have numerous legacy and highly specific data sources, such as logs in custom formats or collections of old Excel files. This diversity is precisely why I believe that extensibility, or the ability to implement custom input and output formats, is crucial. Unfortunately, this feature, which is present in Hive Metastore, is missing in modern data catalogs like Unity or Polaris.

Spark-Connect: I'm starting to love it!

Summary This blog post is a detailed story about how I ported a popular data quality framework, AWS Deequ, to Spark-Connect. Deequ is a very cool, reliable and scalable framework that allows to compute a lot of metrics, checks and anomaly detection suites on the data using Apache Spark cluster. But the Deequ core is a Scala library that uses a lot of low-level Apache Spark APIs for better performance, so it cannot be run directly on any of Spark-Connect environment. To solve this problem, I defined protobuf messages for all main structures of Deequ, like Check, Analyzer, AnomalyDetectionStrategy, etc., wrote a helper object that can re-create Deequ structures from the corresponding protobuf, and finally made a Spark-Connect native plugin that can process Deequ specific messages, construct DQ suits from them, compute the report, and return the result to the Spark-Connect client. I tested my solution with PySpark Connect 3.5.1, but it should work with any of the existing Spark-Connect clients (Spark-Connect Java/Scala, Spark-Connect Go, Spark-Connect Rust, Spark-Connect C#, etc). ...

Unitycatalog: the first look

Databricks recently open-sourced Unitycatlog, a unified data catalog that aims to provide a single source of truth for data discovery, governance, and access control across multiple systems. In this blog post, we take a first look at Unitycatlog and dive into the source code to explain which features from the announcement are actually present. We explore how Unitycatlog addresses the challenges of managing data in a complex data landscape and discuss its potential impact on simplifying data governance and improving data accessibility for organizations.

Effective asOfJoin in PySpark for Feature Store

Leveraging Time-Based Feature Stores for Efficient Data Science Workflows In our previous post, we briefly touched upon the concept of ML feature stores and their significance in streamlining machine learning workflows. Today, we’ll again explore a specific type of feature store known as a time-based feature store, which plays a crucial role in handling temporal data and enabling efficient feature retrieval for data science tasks. In this post we’ll how a feature-lookup problem may be effectively solved in PySpark using domain knowledge and understanding how Apache Spark works with partitions and columnar data formats. ...

Computing ML Feature Store in PySpark

In this blog post, I will share my experience in building an ML Feature Store using PySpark. I will demonstrate how one can utilize case-when expressions to generate multiple aggregations with minimal data shuffling across the cluster. This approach is significantly more efficient than the naive method of using a combination of groupBy and pivot for generating aggregations (or features in ML terms).

Extending Spark Connect

This blog post presents a very detailed step-by-step guide on how to create a SparkConnect protocol extension in Java and call it from PySpark. It will also cover a topic about how to define all the necessary proto3 messages for it. At the end of this guide you will have a way to interact with Spark JVM from PySpark almost like you can with py4j in a non-connect version.