Supporting multiple Apache Spark versions with Maven

Preface

There is no rocket science in this blog post, just some examples and information from the official Maven documentation. By the way, when I started working on this topic, it was not easy to find an example on the Internet. Even ChatGPT did not provide me with a 100% working solution in case of a complex Maven project. From this point of view, I think that my post might be useful for someone who wants to dive into the similar topic from scratch.

Introduction

I always like two topics: Apache Spark and graphs. Recently I had a nice chance to work on an open source project that combines these two topics. It is a GraphAr project, a novel way and format to store network data in data lakes. You can think of it as a Delta Lake for Big Graphs, just because the general idea is quite close: we have a metadata file and store the underlying data (vertices and edges) in parquet (or orc, or csv) files. The project is quite young, so initially only Apache Spark 3.2.2 was supported. I have been looking for a new OSS project to contribute to for a long time, so I committed to extend support to at least two versions: 3.2.2 and 3.3.4.

Initially the project had the structure like this:

| graphar
|- spark
|-- src
|--- main
|---- scala/com/alibaba/graphar
|----- datasources
|------ GarDataSource.scala
|------ ...
|----- commonCode
|------ CommonClass.scala
|------ AnotherCommonClass.scala
|-- pom.xml

The main problem: Spark Datasources API

An implementation of tools for working with a GraphAr format in Spark contains two main parts:

An implementation of helpers and methods for working with metadata
An implementation of datasources to allow users to write read.format("com.alibaba.graphar.datasources.GarDataSource").save("...")

If the first one is mostly generic and uses spark @Stable APIs, the second one is quite tricky and calls parquet, orc and csv ReaderFactory / OutputWriter implementations. Because of continued moving internal spark datasources from v1 to v2, the second part is the biggest problem. Even switching from 3.2.x to 3.3.x break everything.

Obvious solution: reflection

The first and quite obvious though is, of course, to use Reflection API. It is a relative low-level JVM API that allows you to call classes and methods in a runtime. For example, let's imagine we have a class that has a static method staticMethod(s: String) => Int in spark 3.3.x but staticMethod(b: Boolean) => Int in spark 3.2.

val myClass = Class.forName("com.mypackage.MyClass")
val staticMethod = myClass.getMethod("staticMethod")

spark.version match {
  case s: String if s.startsWith("3.3") => staticMethod
    .invoke("true").asInstanceOf[Int]
  case _ => staticMethod.invoke(true).asInstanceOf[Int]

The first problem is that such a code is very hard to read and maintain. In such a case you loose all the capabilities of modern IDEs like Emacs, that shows you inline errors and suggestions. Also, you loose an advantage of compiled language, because if you make a typo in a name of the class, or in a name of the method, you will know it only in runtime. Btw, spark itself uses reflection API to support both Hadoop 2 / Hadoop 3.

The second problem is that reflection can help you to resolve simple cases when an API of some library was changed. But reflection cannot help you, for example, if you need to override a Class or an Interface and this class/interface changed from one version of the library to another. That was exactly the case of com.alibaba.graphar.datasources.

The right way to do it: Maven Profiles

Even being little old-school tool and being not specially supposed for Scala projects, Apache Maven is still very popular and very reliable building system for any JVM project. What is most important is that Maven provide Reactor, a tool for working with multi-projects with complex dependencies.

Splitting GraphAr to common part and datasources part

The first thing I needed to do is to split the monolithic GraphAr Spark project into two parts:

A common part that contains the code, that uses @Stable API of spark
A datasources subproject, that contains overriding of rapidly changing spark internal classes

Because it was expected to provide the support of multiple spark versions in the future, I choose the following project structure:

| graphar
|- spark
|-- graphar
|--- src/main/scala/com/alibaba/graphar/...
|--- pom.xml
|-- datasources-32
|--- src/main/scala/com/alibaba/graphar/...
|--- pom.xml
|-- datasources-33
|----src/main/scala/com/alibaba/graphar/...
|-- pom.xml

Here graphar should contain the common code, and we have two datasources submodule version. One for spark 3.2.x and for spark 3.3.x specific code.

Top-level pom.xml file

The top-level pom.xml is quite simple and defines mostly the profiles. We will use one profile for spark 3.2.x and one for spark 3.3.x:

  <?xml version="1.0" encoding="UTF-8"?>
  <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>

      <groupId>com.alibaba</groupId>
      <artifactId>graphar</artifactId>
      <version>${graphar.version}</version>
      <packaging>pom</packaging>

      <profiles>
          <profile>
              <id>datasources-32</id>
              <properties>
                  <sbt.project.name>graphar</sbt.project.name>
                  ...
                  <spark.version>3.2.2</spark.version>
                  ...
                  <graphar.version>0.1.0-SNAPSHOT</graphar.version>
              </properties>
              <modules>
                  <module>graphar</module>
                  <module>datasources-32</module>
              </modules>
              <activation>
                  <activeByDefault>true</activeByDefault>
              </activation>
          </profile>
          <profile>
              <id>datasources-33</id>
              <properties>
                  <sbt.project.name>graphar</sbt.project.name>
                  ...
                  <spark.version>3.3.4</spark.version>
                  ...
                  <graphar.version>0.1.0-SNAPSHOT</graphar.version>
              </properties>
              <modules>
                  <module>graphar</module>
                  <module>datasources-33</module>
              </modules>
          </profile>
      </profiles>
      <build>
          <plugins>
            ...
          </plugins>
      </build>
  </project>

What is important here is that Maven Profiles does not allow you to override dependencies or other complex things. But, it allows you to create or override properties and you can use property, for example, for spark version for further overriding of dependencies!

To use Reactor build, the top-level module should always use pom packaging system.

Starting from this moment you can call any Maven command for a specific profile in the following way:

mvn clean package -P datasources-32
mvn clean package -P datasources-33

A small note about IDEs integration

For a smooth integration with a Language Server (like metals) you need to specify, which profile should be used. You can add a default profile into top-level pom.xml in the following way (in a profile tag):

<activation>
    <activeByDefault>true</activeByDefault>
</activation>

datasources submodule pom.xml

It is important to have scala-maven-plugin inside the pom.xml of every submodule, that contains scala code! Otherwise, even if Reactor choose the right compilation order, there will be errors because plugins are not pushed down from parent module to submodules!

Any submodule in multi-module project should contains own pom.xml, that defines parent project. Cool thing is that inside submodule pom you can refer to properties, defined in the parent pom! Let's see on a GraphAr submodule pom for datasources:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>com.alibaba</groupId>
        <artifactId>graphar</artifactId>
        <version>${graphar.version}</version>
    </parent>

    <groupId>com.alibaba</groupId>
    <artifactId>graphar-datasources</artifactId>
    <version>${graphar.version}</version>
    <packaging>jar</packaging>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
          ...
        </plugins>
    </build>
</project>

As one may see, we are defining spark-core and spark-sql dependencies using a parent module properties spark.version and scala.version!

<scope>provided</scope> here means that the dependency classes should not be included into output JAR file and will be presented in CP in runtime.

Commons submodule pom.xml

In our case, commons are depends on datasources implementation. But, because the package and classes in both datasources-32 and datasources-33 are the same, we do not need to specify it for each profile. It is enough to specify it only once as a dependency:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>com.alibaba</groupId>
        <artifactId>graphar</artifactId>
        <version>${graphar.version}</version>
    </parent>

    <groupId>com.alibaba</groupId>
    <artifactId>graphar-commons</artifactId>
    <version>${graphar.version}</version>
    <packaging>jar</packaging>

    <dependencies>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>graphar-datasources</artifactId>
            <version>${graphar.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        ...
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_${scala.binary.version}</artifactId>
            <version>3.1.1</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <scope>provided</scope>
        </dependency>
       ...
    </dependencies>
    <build>
        <plugins>
            ...
        </plugins>
    </build>
</project>

As one may see in this case Reactor will resolve the following dependency as an inner one:

<dependency>
  <groupId>com.alibaba</groupId>
  <artifactId>graphar-datasources</artifactId>
  <version>${graphar.version}</version>
</dependency>

It is important to have scalatest in all the submodules that contain tests!

How Reactor build looks like

Let's see on an output of, for example, mvn clean package -P datasources-33:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] graphar                                                            [pom]
[INFO] graphar-datasources                                                [jar]
[INFO] graphar-commons                                                    [jar]
[INFO]

In this case, Maven realized, that graphar is just a top-level pom-module and that datasources should be compiled first because commons depends on it.

You can use GraphAr Spark implementation as a source of inspiration for your own spark-related projects. I hope that you found that post useful!