Spark Tips – Introduction to Databases (Fall 2018)

Basics

Spark is an open-source framework for processing big data on clusters. Version 2.3.1 has already been installed on your course VM. We will be using Spark primarily through PySpark, Spark’s Python API. You can find PySpark’s complete documentation here; the parts most relevant to this course are:

For this course, instead of running Spark on a cluster consisting of many machines, we are simply going to run Spark standalone within your VM. That said, it should be relatively straightforward to run the same PySpark code, with little to no modification, on a much more powerful cluster; Spark would automatically take care of parallelization for you.

Working with the PySpark shell

You can start an interactive PySpark shell using the command pyspark in your VM. You might see a bunch of harmless warning messages on starting up. Then, you will be presented with a “>>>” prompt; you are now inside the pyspark shell. Inside this shell, you can issue Python commands that invoke functionalities of Spark. For example, the following commands load our familiar bars and serves tables into DataFrame’s and show their contents:

bars = spark.read.csv('/opt/dbcourse/examples/db-beers/data/Bar.dat').toDF('name', 'address')
serves = spark.read.csv('/opt/dbcourse/examples/db-beers/data/Serves.dat').toDF('bar', 'beer', 'price')
bars.show()
serves.show()

You can then run some queries:

from pyspark.sql import functions as F
serves.groupBy('beer').agg(F.avg('price')).show()
serves.join(bars, bars.name == serves.bar).groupBy('bar').agg(F.avg('price')).sort('avg(price)', ascending=0).show()

To exit from the pyspark shell, type:

exit()

Working with Spark in a Python program

On your VM, you will find an example program /opt/dbcourse/assignments/hw4-x2/spark.py, with standard boilerplate code that should get you started. Just run it from the VM command line as you would any other Python program.