Beginning Apache Spark 3 Pdf ((new)) -

Run with:

df = spark.read.parquet("sales.parquet") df.filter("amount > 1000").groupBy("region").count().show() You can register DataFrames as temporary views and run SQL: beginning apache spark 3 pdf

# Read df = spark.read.option("header", "true").csv("path/to/file.csv") df.write.parquet("output.parquet") 4.2 Common Transformations | Operation | Example | |------------------|-------------------------------------------| | Select columns | df.select("name", "age") | | Filter rows | df.filter(df.age > 21) | | Add column | df.withColumn("new", df.value * 2) | | Group and aggregate | df.groupBy("dept").avg("salary") | | Join | df1.join(df2, "id", "inner") | 4.3 Handling Missing Data df.dropna(how="any", subset=["important_col"]) df.fillna("age": 0, "name": "unknown") 4.4 User‑Defined Functions (UDFs) When built‑in functions are insufficient: Run with: df = spark

from pyspark.sql import SparkSession spark = SparkSession.builder .appName("MyApp") .config("spark.sql.adaptive.enabled", "true") .getOrCreate() 3.1 RDD – The Original Foundation RDDs (Resilient Distributed Datasets) are low‑level, immutable, partitioned collections. They provide fault tolerance via lineage. However, they are not recommended for new projects because they lack optimization. squared_udf = udf(squared, IntegerType()) df

squared_udf = udf(squared, IntegerType()) df.withColumn("squared_val", squared_udf(df.value))

spark.stop()

Example:

Discover more from Set The Tape

Subscribe now to keep reading and get access to the full archive.

Continue reading