Apache Spark 2.3.0 is the fourth release in the 2.x line. This release adds support for Continuous Processing in Structured Streaming along with a brand new Kubernetes Scheduler backend. Other major updates include the new DataSource and Structured Streaming v2 APIs, and a number of PySpark performance enhancements. In addition, this release continues to focus on usability, stability, and polish while resolving around 1400 tickets.
To download Apache Spark 2.3.0, visit the downloads page. You can consult JIRA for the detailed changes. We have curated a list of high level changes here, grouped by major modules.
spark.sql.orc.impl
to native
.64KB
JVM bytecode limit on the Java method and Java compiler constant pool limitINSERT OVERWRITE DIRECTORY
to directly write data into the filesystem from a query2.10
0.8.0
and Netty to 4.1.17
Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.
Programming guide: Structured Streaming Programming Guide.
ClusteringEvaluator
for tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python)FeatureHasher
transformer (Scala/Java/Python)OneHotEncoderEstimator
(Scala/Java/Python)QuantileDiscretizer
(Scala/Java)Bucketizer
(Scala/Java/Python)CrossValidator
and TrainValidationSplit
can collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models.CrossValidator
, TrainValidationSplit,
OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobsfeatureSubsetStrategy
Param to GBTClassifier
and GBTRegressor
. Using this to subsample features can significantly improve training speed; this option has been a key strength of xgboost
.Word2Vec
learning rate scaling with num
iterations. The new learning rate is set to match the original Word2Vec
C code and should give better results from training.JSON
support for Matrix parameters (This fixed a bug for ML persistence with LogisticRegressionModel
when using bounds on coefficients.)Bucketizer.transform
incorrectly drops row containing NaN
. When Param handleInvalid
was set to “skip,” Bucketizer
would drop a row with a valid value in the input column if another (irrelevant) column had a NaN
value.StringIndexerModel
to throw an incorrect “Unseen label” exception when handleInvalid
was set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset.CrossValidator
TrainValidationSplit
Imputer
should train using a single pass over the dataOnlineLDAOptimizer
avoids collecting statistics to the driver for each mini-batch.Programming guide: Machine Learning Library (MLlib) Guide.
The main focus of SparkR in the 2.3.0 release was towards improving the stability of UDFs and adding several new SparkR wrappers around existing APIs:
withWatermark
, trigger
, partitionBy
and stream-stream joinsProgramming guide: SparkR (R on Spark).
StackOverflowErrors
Programming guide: GraphX Programming Guide.
register*
for UDFs in SQLContext
and Catalog
in PySparkOneHotEncoder
has been deprecated and will be removed in 3.0. It has been replaced by the new OneHotEncoderEstimator
. Note that OneHotEncoderEstimator
will be renamed to OneHotEncoder
in 3.0 (but OneHotEncoderEstimator
will be kept as an alias).NULL
in the prior versions)elt()
returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.functions.concat()
returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.double
type as the common type for double
type and date
type. Now it finds the correct common type for such conflicts. For details, see the migration guide.percentile_approx
function previously accepted numeric
type input and outputted double
type results. Now it supports date
type, timestamp
type and numeric
types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles._corrupt_record
by default). Instead, you can cache or save the parsed results and then send the same query.na.fill()
or fillna
also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.0.19.2
or upper is required for using Pandas related functionalities, such as toPandas
, createDataFrame
from Pandas DataFrame, etc.df.replace
does not allow to omit value
when to_replace
is not a dictionary. Previously, value
could be omitted in the other cases and had None
by default, which is counter-intuitive and error prone.LogisticRegressionTrainingSummary
to a BinaryLogisticRegressionTrainingSummary
. Users should instead use the model.binarySummary
method. See [SPARK-17139] for more detail (note this is an @Experimental
API). This does not affect the Python summary method, which will still work correctly for both multinomial and binary cases.BinaryClassificationMetrics.pr()
: first point (0.0, 1.0) is misleading and has been replaced by (0.0, p) where precision p matches the lowest recall point.RFormula
without an intercept now outputs the reference category when encoding string terms, in order to match native R behavior. This may change results from model training.OneVsRest
is now set to 1 (i.e. serial). In 2.2 and earlier versions, the level of parallelism was set to the default threadpool size in Scala. This may change performance.0.13.2
. This included an important bug fix in strong Wolfe line search for L-BFGS.OptimizeMetadataOnlyQuery
Last but not least, this release would not have been possible without the following contributors: ALeksander Eskilson, Adrian Ionescu, Ajay Saini, Ala Luszczak, Albert Jang, Alberto Rodriguez De Lema, Alex Mikhailau, Alexander Istomin, Anderson Osagie, Andrea Zito, Andrew Ash, Andrew Korzhuev, Andrew Ray, Anirudh Ramanathan, Anton Okolnychyi, Arman Yazdani, Armin Braun, Arseniy Tashoyan, Arthur Rand, Atallah Hezbor, Attila Zsolt Piros, Ayush Singh, Bago Amirbekian, Ben Barnard, Bo Meng, Bo Xu, Bogdan Raducanu, Brad Kaiser, Bravo Zhang, Bruce Robbins, Bruce Xu, Bryan Cutler, Burak Yavuz, Carson Wang, Chang Chen, Charles Chen, Cheng Wang, Chenjun Zou, Chenzhao Guo, Chetan Khatri, Chie Hayashida, Chin Han Yu, Chunsheng Ji, Corey Woodfield, Daniel Li, Daniel Van Der Ende, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Parfenchik, Donghui Xu, Dongjoon Hyun, Eren Avsarogullari, Eric Vandenberg, Erik LaBianca, Eyal Farago, Favio Vazquez, Felix Cheung, Feng Liu, Feng Zhu, Fernando Pereira, Fokko Driesprong, Gabor Somogyi, Gene Pang, Gera Shegalov, German Schiavon, Glen Takahashi, Greg Owen, Grzegorz Slowikowski, Guilherme Berger, Guillaume Dardelet, Guo Xiao Long, He Qiao, Henry Robinson, Herman Van Hovell, Hideaki Tanaka, Holden Karau, Huang Tengfei, Huaxin Gao, Hyukjin Kwon, Ilya Matiach, Imran Rashid, Iurii Antykhovych, Ivan Sadikov, Jacek Laskowski, JackYangzg, Jakub Dubovsky, Jakub Nowacki, James Thompson, Jan Vrsovsky, Jane Wang, Jannik Arndt, Jason Taaffe, Jeff Zhang, Jen-Ming Chung, Jia Li, Jia-Xuan Liu, Jin Xing, Jinhua Fu, Jirka Kremser, Joachim Hereth, John Compitello, John Lee, John O’Leary, Jorge Machado, Jose Torres, Joseph K. Bradley, Josh Rosen, Juliusz Sompolski, Kalvin Chau, Kazuaki Ishizaki, Kent Yao, Kento NOZAWA, Kevin Yu, Kirby Linvill, Kohki Nishio, Kousuke Saruta, Kris Mok, Krishna Pandey, Kyle Kelley, Li Jin, Li Yichao, Li Yuanjian, Liang-Chi Hsieh, Lijia Liu, Liu Shaohui, Liu Xian, Liyun Zhang, Louis Lyu, Lubo Zhang, Luca Canali, Maciej Brynski, Maciej Szymkiewicz, Madhukara Phatak, Mahmut CAVDAR, Marcelo Vanzin, Marco Gaido, Marcos P, Marcos P. Sanchez, Mark Petruska, Maryann Xue, Masha Basmanova, Miao Wang, Michael Allman, Michael Armbrust, Michael Gummelt, Michael Mior, Michael Patterson, Michael Styles, Michal Senkyr, Mikhail Sveshnikov, Min Shen, Ming Jiang, Mingjie Tang, Mridul Muralidharan, Nan Zhu, Nathan Kronenfeld, Neil Alexander McQuarrie, Ngone51, Nicholas Chammas, Nick Pentreath, Ohad Raviv, Oleg Danilov, Onur Satici, PJ Fanning, Parth Gandhi, Patrick Woody, Paul Mackles, Peng Meng, Peng Xiao, Pengcheng Liu, Peter Szalai, Pralabh Kumar, Prashant Sharma, Rekha Joshi, Remis Haroon, Reynold Xin, Reza Safi, Riccardo Corbella, Rishabh Bhardwaj, Robert Kruszewski, Ron Hu, Ruben Berenguel Montoro, Ruben Janssen, Rui Zha, Rui Zhan, Ruifeng Zheng, Russell Spitzer, Ryan Blue, Sahil Takiar, Saisai Shao, Sameer Agarwal, Sandor Murakozi, Sanket Chintapalli, Santiago Saavedra, Sathiya Kumar, Sean Owen, Sergei Lebedev, Sergey Serebryakov, Sergey Zhemzhitsky, Seth Hendrickson, Shane Jarvie, Shashwat Anand, Shintaro Murakami, Shivaram Venkataraman, Shixiong Zhu, Shuangshuang Wang, Sid Murching, Sital Kedia, Soonmok Kwon, Srinivasa Reddy Vundela, Stavros Kontopoulos, Steve Loughran, Steven Rand, Sujith, Sujith Jay Nair, Sumedh Wale, Sunitha Kambhampati, Suresh Thalamati, Susan X. Huynh, Takeshi Yamamuro, Takuya UESHIN, Tathagata Das, Tejas Patil, Teng Peng, Thomas Graves, Tim Van Wassenhove, Travis Hegner, Tristan Stevens, Tucker Beck, Valeriy Avanesov, Vinitha Gankidi, Vinod KC, Wang Gengliang, Wayne Zhang, Weichen Xu, Wenchen Fan, Wieland Hoffmann, Wil Selwood, Wing Yew Poon, Xiang Gao, Xianjin YE, Xianyang Liu, Xiao Li, Xiaochen Ouyang, Xiaofeng Lin, Xiaokai Zhao, Xiayun Sun, Xin Lu, Xin Ren, Xingbo Jiang, Yan Facai (Yan Fa Cai), Yan Kit Li, Yanbo Liang, Yash Sharma, Yinan Li, Yong Tang, Youngbin Kim, Yuanjian Li, Yucai Yu, Yuhai Cen, Yuhao Yang, Yuming Wang, Yuval Itzchakov, Zhan Zhang, Zhang A Peng, Zhaokun Liu, Zheng RuiFeng, Zhenhua Wang, Zuo Tingbing, brandonJY, caneGuy, cxzl25, djvulee, eatoncys, heary-cao, ho3rexqj, lizhaoch, maclockard, neoremind, peay, shaofei007, wangjiaochun, zenglinxi0615