Notes on Spark 2.0 Performance

14 Jun 2016 spark

This post is just a collection of notes I took during a Meetup about Spark 2.0 performance. The main takeaway was if you don’t have to, don’t use the RDD API directly. The DataFrame and Dataset abstractions offer better performance because of optimizations they can perform. This is true in Spark 1.6 as well as 2.0.

Here is the full presentation: https://tinyurl.com/markus-spark-2-0