Dreaming of Graphs in the Open Lakehouse

While Open Lakehouse platforms now natively support tables, geospatial data, vectors, and more, property graphs are still missing. In the age of AI and growing interest in Graph RAG, graphs are becoming especially relevant – there’s a need to deliver Knowledge Graphs to RAG systems, with standards, ETL, and frameworks for different scenarios. There’s a young project, Apache GraphAr (incubating), that aims to define a storage standard. For processing, there is a good tooling already. GraphFrames is like Spark for Iceberg – batch and scalable on distributed clusters; Kuzu is like DuckDB for Iceberg – fast, in-memory, and in-process; Apache HugeGraph is like ClickHouse or Doris for graphs – a standalone server for queries. I’m currently working also on graphframes-rs to bring Apache DataFusion and its ecosystem into this picture. All the pieces seem to be here—it just remains to put them together. More thoughts in the full post.

June 26, 2025 · 13 min · Sem Sinchenko