PySpark column lineage
In this post, I will show you how to use information from the spark plan to track data lineage at the column level. This approach will also works with recently introduced SparkConnect.
In this post, I will show you how to use information from the spark plan to track data lineage at the column level. This approach will also works with recently introduced SparkConnect.
Sometimes it is an important question, how much memory does our DataFrame use? And there is no easy answer if you are working with PySpark. You can try to collect the data sample and run local memory profiler. You can estimate the size of the data in the source (for example, in parquet file). But we will go another way and try to analyze the logical plan of Spark from PySpark. In case when we are working with Scala Spark API we are able to work with resolved or unresolved logical plans and physical plan via a special API. But from PySpark API only string representation is available and we will work with it.
Working with File System from PySpark Motivation Any of us is working with File System in our work. Almost every pipeline or application has some kind of file-based configuration. Typically json or yaml files are used. Also for data pipelines, it is sometimes important to be able to write results or state them in a human-readable format. Or serialize some artifacts, like matplotlib plot, into bytes and write them to the disk....