Spark Release 0.9.0
Spark 0.9.0 is a major release that adds significant new features. It updates Spark to Scala 2.10, simplifies high availability, and updates numerous components of the project. This release includes a first version of GraphX, a powerful new framework for graph processing that comes with a library of standard algorithms. In addition, Spark Streaming is now out of alpha, and includes significant optimizations and simplified high availability deployment.
You can download Spark 0.9.0 as either a
source package
(5 MB tgz) or a prebuilt package for
Hadoop 1 / CDH3,
CDH4, or
Hadoop 2 / CDH5 / HDP2
(160 MB tgz). Release signatures and checksums are available at the official Apache download site.
Scala 2.10 Support
Spark now runs on Scala 2.10, letting users benefit from the language and library improvements in this version.
Configuration System
The new SparkConf class is now the preferred way to configure advanced settings on your SparkContext, though the previous Java system property method still works. SparkConf is especially useful in tests to make sure properties don’t stay set across tests.
Spark Streaming Improvements
Spark Streaming is now out of alpha, and comes with simplified high availability and several optimizations.
- When running on a Spark standalone cluster with the standalone cluster high availability mode, you can submit a Spark Streaming driver application to the cluster and have it automatically recovered if either the driver or the cluster master crashes.
- Windowed operators have been sped up by 30-50%.
- Spark Streaming’s input source plugins (e.g. for Twitter, Kafka and Flume) are now separate Maven modules, making it easier to pull in only the dependencies you need.
- A new StreamingListener interface has been added for monitoring statistics about the streaming computation.
- A few aspects of the API have been improved:
DStream
and PairDStream
classes have been moved from org.apache.spark.streaming
to org.apache.spark.streaming.dstream
to keep it consistent with org.apache.spark.rdd.RDD
.
DStream.foreach
has been renamed to foreachRDD
to make it explicit that it works for every RDD, not every element
StreamingContext.awaitTermination()
allows you wait for context shutdown and catch any exception that occurs in the streaming computation.
*StreamingContext.stop()
now allows stopping of StreamingContext without stopping the underlying SparkContext.
GraphX Alpha
GraphX is a new framework for graph processing that uses recent advances in graph-parallel computation. It lets you build a graph within a Spark program using the standard Spark operators, then process it with new graph operators that are optimized for distributed computation. It includes basic transformations, a Pregel API for iterative computation, and a standard library of graph loaders and analytics algorithms. By offering these features within the Spark engine, GraphX can significantly speed up processing pipelines compared to workflows that use different engines.
GraphX features in this release include:
GraphX is still marked as alpha in this first release, but we recommend for new users to use it instead of the more limited Bagel API.
MLlib Improvements
- Spark’s machine learning library (MLlib) is now available in Python, where it operates on NumPy data (currently requires Python 2.7 and NumPy 1.7)
- A new algorithm has been added for Naive Bayes classification
- Alternating Least Squares models can now be used to predict ratings for multiple items in parallel
- MLlib’s documentation was expanded to include more examples in Scala, Java and Python
Python Changes
- Python users can now use MLlib (requires Python 2.7 and NumPy 1.7)
- PySpark now shows the call sites of running jobs in the Spark application UI (http://:4040), making it easy to see which part of your code is running
- IPython integration has been updated to work with newer versions
Packaging
- Spark’s scripts have been organized into “bin” and “sbin” directories to make it easier to separate admin scripts from user ones and install Spark on standard Linux paths.
- Log configuration has been improved so that Spark finds a default log4j.properties file if you don’t specify one.
Core Engine
- Spark’s standalone mode now supports submitting a driver program to run on the cluster instead of on the external machine submitting it. You can access this functionality through the org.apache.spark.deploy.Client class.
- Large reduce operations now automatically spill data to disk if it does not fit in memory.
- Users of standalone mode can now limit how many cores an application will use by default if the application writer didn’t configure its size. Previously, such applications took all available cores on the cluster.
spark-shell
now supports the -i
option to run a script on startup.
- New
histogram
and countDistinctApprox
operators have been added for working with numerical data.
- YARN mode now supports distributing extra files with the application, and several bugs have been fixed.
Compatibility
This release is compatible with the previous APIs in stable components, but several language versions and script locations have changed.
- Scala programs now need to use Scala 2.10 instead of 2.9.
- Scripts such as
spark-shell
and pyspark
have been moved into the bin
folder, while administrative scripts to start and stop standalone clusters have been moved into sbin
.
- Spark Streaming’s API has been changed to move external input sources into separate modules,
DStream
and PairDStream
has been moved to package org.apache.spark.streaming.dstream
and DStream.foreach
has been renamed to foreachRDD
. We expect the current API to be stable now that Spark Streaming is out of alpha.
- While the old method of configuring Spark through Java system properties still works, we recommend that users update to the new [SparkConf], which is easier to inspect and use.
We expect all of the current APIs and script locations in Spark 0.9 to remain stable when we release Spark 1.0. We wanted to make these updates early to give users a chance to switch to the new API.
Contributors
The following developers contributed to this release:
- Andrew Ash – documentation improvements
- Pierre Borckmans – documentation fix
- Russell Cardullo – graphite sink for metrics
- Evan Chan – local:// URI feature
- Vadim Chekan – bug fix
- Lian Cheng – refactoring and code clean-up in several locations, bug fixes
- Ewen Cheslack-Postava – Spark EC2 and PySpark improvements
- Mosharaf Chowdhury – optimized broadcast
- Dan Crankshaw – GraphX contributions
- Haider Haidi – documentation fix
- Frank Dai – Naive Bayes classifier in MLlib, documentation improvements
- Tathagata Das – new operators, fixes, and improvements to Spark Streaming (lead)
- Ankur Dave – GraphX contributions
- Henry Davidge – warning for large tasks
- Aaron Davidson – shuffle file consolidation, H/A mode for standalone scheduler, various improvements and fixes
- Kyle Ellrott – GraphX contributions
- Hossein Falaki – new statistical operators, Scala and Python examples in MLlib
- Harvey Feng – hadoop file optimizations and YARN integration
- Ali Ghodsi – support for SIMR
- Joseph E. Gonzalez – GraphX contributions
- Thomas Graves – fixes and improvements for YARN support (lead)
- Rong Gu – documentation fix
- Stephen Haberman – bug fixes
- Walker Hamilton – bug fix
- Mark Hamstra – scheduler improvements and fixes, build fixes
- Damien Hardy – Debian build fix
- Nathan Howell – sbt upgrade
- Grace Huang – improvements to metrics code
- Shane Huang – separation of admin and user scripts:
- Prabeesh K – MQTT integration for Spark Streaming and code fix
- Holden Karau – sbt build improvements and Java API extensions
- KarthikTunga – bug fix
- Grega Kespret – bug fix
- Marek Kolodziej – optimized random number generator
- Jey Kottalam – EC2 script improvements
- Du Li – bug fixes
- Haoyuan Li – tachyon support in EC2
- LiGuoqiang – fixes to build and YARN integration
- Raymond Liu – build improvement and various fixes for YARN support
- George Loentiev – Maven build fixes
- Akihiro Matsukawa – GraphX contributions
- David McCauley – improvements to json endpoint
- Mike – bug fixes
- Fabrizio (Misto) Milo – bug fix
- Mridul Muralidharan – speculation improvements, several bug fixes
- Tor Myklebust – Python mllib bindings, instrumentation for task serialization
- Sundeep Narravula – bug fix
- Binh Nguyen – Java API improvements and version upgrades
- Adam Novak – bug fix
- Andrew Or – external sorting
- Kay Ousterhout – several bug fixes and improvements to Spark scheduler
- Sean Owen – style fixes
- Nick Pentreath – ALS implicit feedback algorithm
- Pillis –
Vector.random()
method
- Imran Rashid – bug fix
- Ahir Reddy – support for SIMR
- Luca Rosellini – script loading for Scala shell
- Josh Rosen – fixes, clean-up, and extensions to scala and Java API’s
- Henry Saputra – style improvements and clean-up
- Andre Schumacher – Python improvements and bug fixes
- Jerry Shao – multi-user support, various fixes and improvements
- Prashant Sharma – Scala 2.10 support, configuration system, several smaller fixes
- Shiyun – style fix
- Wangda Tan – UI improvement and bug fixes
- Matthew Taylor – bug fix
- Jyun-Fan Tsai – documentation fix
- Takuya Ueshin – bug fix
- Shivaram Venkataraman – sbt build optimization, EC2 improvements, Java and Python API
- Jianping J Wang – GraphX contributions
- Martin Weindel – build fix
- Patrick Wendell – standalone driver submission, various fixes, release manager
- Neal Wiggins – bug fix
- Andrew Xia – bug fixes and code cleanup
- Reynold Xin – GraphX contributions, task killing, various fixes, improvements and optimizations
- Dong Yan – bug fix
- Haitao Yao – bug fix
- Xusen Yin – bug fix
- Fengdong Yu – documentation fixes
- Matei Zaharia – new configuration system, Python MLlib bindings, scheduler improvements, various fixes and optimizations
- Wu Zeming – bug fix
- Nan Zhu – documentation improvements
Thanks to everyone who contributed!
Spark News Archive