Apache Spark

Common Snippets

Initialize Spark Session

import os
import sys

SPARK_HOME = "/opt/homebrew/Cellar/apache-spark/3.5.3/libexec"
JAVA_HOME = '/opt/homebrew/opt/openjdk@17'

os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['JAVA_HOME'] = JAVA_HOME
sys.path.extend([
    f"{SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip",
    f"{SPARK_HOME}/python/lib/pyspark.zip",
])

from pyspark.sql import SparkSession
spark = SparkSession.builder\
    .master('local[*]') \
    .getOrCreate()

Using findspark

Create Session

Read CSV

Drop null values

Target Encoding

Window Functions and Coalesce

Conditional column values

Last updated