Spark Release 0.9.0

Spark 0.9.0 is a major release that adds significant new features. It updates Spark to Scala 2.10, simplifies high availability, and updates numerous components of the project. This release includes a first version of GraphX, a powerful new framework for graph processing that comes with a library of standard algorithms. In addition, Spark Streaming is now out of alpha, and includes significant optimizations and simplified high availability deployment.

You can download Spark 0.9.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, or Hadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.

Scala 2.10 Support

Spark now runs on Scala 2.10, letting users benefit from the language and library improvements in this version.

Configuration System

The new SparkConf class is now the preferred way to configure advanced settings on your SparkContext, though the previous Java system property method still works. SparkConf is especially useful in tests to make sure properties don’t stay set across tests.

Spark Streaming Improvements

Spark Streaming is now out of alpha, and comes with simplified high availability and several optimizations.

When running on a Spark standalone cluster with the standalone cluster high availability mode, you can submit a Spark Streaming driver application to the cluster and have it automatically recovered if either the driver or the cluster master crashes.
Windowed operators have been sped up by 30-50%.
Spark Streaming’s input source plugins (e.g. for Twitter, Kafka and Flume) are now separate Maven modules, making it easier to pull in only the dependencies you need.
A new StreamingListener interface has been added for monitoring statistics about the streaming computation.
A few aspects of the API have been improved:
- DStream and PairDStream classes have been moved from org.apache.spark.streaming to org.apache.spark.streaming.dstream to keep it consistent with org.apache.spark.rdd.RDD.
- DStream.foreach has been renamed to foreachRDD to make it explicit that it works for every RDD, not every element
- StreamingContext.awaitTermination() allows you wait for context shutdown and catch any exception that occurs in the streaming computation. *StreamingContext.stop() now allows stopping of StreamingContext without stopping the underlying SparkContext.

GraphX Alpha

GraphX is a new framework for graph processing that uses recent advances in graph-parallel computation. It lets you build a graph within a Spark program using the standard Spark operators, then process it with new graph operators that are optimized for distributed computation. It includes basic transformations, a Pregel API for iterative computation, and a standard library of graph loaders and analytics algorithms. By offering these features within the Spark engine, GraphX can significantly speed up processing pipelines compared to workflows that use different engines.

GraphX features in this release include:

Building graphs from arbitrary Spark RDDs
Basic operations to transform graphs or extract subgraphs
An optimized Pregel API that takes advantage of graph partitioning and indexing
Standard algorithms including PageRank, connected components, strongly connected components, SVD++, and triangle counting
Interactive use from the Spark shell

GraphX is still marked as alpha in this first release, but we recommend for new users to use it instead of the more limited Bagel API.

MLlib Improvements

Spark’s machine learning library (MLlib) is now available in Python, where it operates on NumPy data (currently requires Python 2.7 and NumPy 1.7)
A new algorithm has been added for Naive Bayes classification
Alternating Least Squares models can now be used to predict ratings for multiple items in parallel
MLlib’s documentation was expanded to include more examples in Scala, Java and Python

Python Changes

Python users can now use MLlib (requires Python 2.7 and NumPy 1.7)
PySpark now shows the call sites of running jobs in the Spark application UI (http://:4040), making it easy to see which part of your code is running
IPython integration has been updated to work with newer versions

Packaging

Spark’s scripts have been organized into “bin” and “sbin” directories to make it easier to separate admin scripts from user ones and install Spark on standard Linux paths.
Log configuration has been improved so that Spark finds a default log4j.properties file if you don’t specify one.

Core Engine

Spark’s standalone mode now supports submitting a driver program to run on the cluster instead of on the external machine submitting it. You can access this functionality through the org.apache.spark.deploy.Client class.
Large reduce operations now automatically spill data to disk if it does not fit in memory.
Users of standalone mode can now limit how many cores an application will use by default if the application writer didn’t configure its size. Previously, such applications took all available cores on the cluster.
spark-shell now supports the -i option to run a script on startup.
New histogram and countDistinctApprox operators have been added for working with numerical data.
YARN mode now supports distributing extra files with the application, and several bugs have been fixed.

Compatibility

This release is compatible with the previous APIs in stable components, but several language versions and script locations have changed.

Scala programs now need to use Scala 2.10 instead of 2.9.
Scripts such as spark-shell and pyspark have been moved into the bin folder, while administrative scripts to start and stop standalone clusters have been moved into sbin.
Spark Streaming’s API has been changed to move external input sources into separate modules, DStream and PairDStream has been moved to package org.apache.spark.streaming.dstream and DStream.foreach has been renamed to foreachRDD. We expect the current API to be stable now that Spark Streaming is out of alpha.
While the old method of configuring Spark through Java system properties still works, we recommend that users update to the new [SparkConf], which is easier to inspect and use.

We expect all of the current APIs and script locations in Spark 0.9 to remain stable when we release Spark 1.0. We wanted to make these updates early to give users a chance to switch to the new API.

Contributors

The following developers contributed to this release:

Andrew Ash – documentation improvements
Pierre Borckmans – documentation fix
Russell Cardullo – graphite sink for metrics
Evan Chan – local:// URI feature
Vadim Chekan – bug fix
Lian Cheng – refactoring and code clean-up in several locations, bug fixes
Ewen Cheslack-Postava – Spark EC2 and PySpark improvements
Mosharaf Chowdhury – optimized broadcast
Dan Crankshaw – GraphX contributions
Haider Haidi – documentation fix
Frank Dai – Naive Bayes classifier in MLlib, documentation improvements
Tathagata Das – new operators, fixes, and improvements to Spark Streaming (lead)
Ankur Dave – GraphX contributions
Henry Davidge – warning for large tasks
Aaron Davidson – shuffle file consolidation, H/A mode for standalone scheduler, various improvements and fixes
Kyle Ellrott – GraphX contributions
Hossein Falaki – new statistical operators, Scala and Python examples in MLlib
Harvey Feng – hadoop file optimizations and YARN integration
Ali Ghodsi – support for SIMR
Joseph E. Gonzalez – GraphX contributions
Thomas Graves – fixes and improvements for YARN support (lead)
Rong Gu – documentation fix
Stephen Haberman – bug fixes
Walker Hamilton – bug fix
Mark Hamstra – scheduler improvements and fixes, build fixes
Damien Hardy – Debian build fix
Nathan Howell – sbt upgrade
Grace Huang – improvements to metrics code
Shane Huang – separation of admin and user scripts:
Prabeesh K – MQTT integration for Spark Streaming and code fix
Holden Karau – sbt build improvements and Java API extensions
KarthikTunga – bug fix
Grega Kespret – bug fix
Marek Kolodziej – optimized random number generator
Jey Kottalam – EC2 script improvements
Du Li – bug fixes
Haoyuan Li – tachyon support in EC2
LiGuoqiang – fixes to build and YARN integration
Raymond Liu – build improvement and various fixes for YARN support
George Loentiev – Maven build fixes
Akihiro Matsukawa – GraphX contributions
David McCauley – improvements to json endpoint
Mike – bug fixes
Fabrizio (Misto) Milo – bug fix
Mridul Muralidharan – speculation improvements, several bug fixes
Tor Myklebust – Python mllib bindings, instrumentation for task serialization
Sundeep Narravula – bug fix
Binh Nguyen – Java API improvements and version upgrades
Adam Novak – bug fix
Andrew Or – external sorting
Kay Ousterhout – several bug fixes and improvements to Spark scheduler
Sean Owen – style fixes
Nick Pentreath – ALS implicit feedback algorithm
Pillis – Vector.random() method
Imran Rashid – bug fix
Ahir Reddy – support for SIMR
Luca Rosellini – script loading for Scala shell
Josh Rosen – fixes, clean-up, and extensions to scala and Java API’s
Henry Saputra – style improvements and clean-up
Andre Schumacher – Python improvements and bug fixes
Jerry Shao – multi-user support, various fixes and improvements
Prashant Sharma – Scala 2.10 support, configuration system, several smaller fixes
Shiyun – style fix
Wangda Tan – UI improvement and bug fixes
Matthew Taylor – bug fix
Jyun-Fan Tsai – documentation fix
Takuya Ueshin – bug fix
Shivaram Venkataraman – sbt build optimization, EC2 improvements, Java and Python API
Jianping J Wang – GraphX contributions
Martin Weindel – build fix
Patrick Wendell – standalone driver submission, various fixes, release manager
Neal Wiggins – bug fix
Andrew Xia – bug fixes and code cleanup
Reynold Xin – GraphX contributions, task killing, various fixes, improvements and optimizations
Dong Yan – bug fix
Haitao Yao – bug fix
Xusen Yin – bug fix
Fengdong Yu – documentation fixes
Matei Zaharia – new configuration system, Python MLlib bindings, scheduler improvements, various fixes and optimizations
Wu Zeming – bug fix
Nan Zhu – documentation improvements

Thanks to everyone who contributed!

Spark News Archive

Latest News

Spark 3.4.3 released (Apr 18, 2024)
Spark 3.5.1 released (Feb 23, 2024)
Spark 3.3.4 released (Dec 16, 2023)
Spark 3.4.2 released (Nov 30, 2023)