文章目录
Apache Spark 1.4.0的新特性可以看这里《Apache Spark 1.4.0新特性详解》。
Apache Spark 1.4.0于美国时间的2015年6月11日正式发布。Python 3支持,R API,window functions,ORC,DataFrame的统计分析功能,更好的执行解析界面,再加上机器学习管道从alpha毕业成正式API。
Apache Spark 1.4.0是1.x版本线的第五个版本,这个版本将R API正式加入到Spark中,同时Spark核心引擎的可用性也有所提升,扩展了MLib和Spark Streaming。Spark 1.4有来自70个机构的超过210贡献者参与,并且有超过1000个patch。
邮件内容如下:
Hi All,
I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is
the fifth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 210 developers and more
than 1,000 commits!A huge thanks go to all of the individuals and organizations involved
in development and testing of this release.Visit the release notes [1] to read about the new features, or
download [2] the release today.For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).Thanks to everyone who helped work on this release!
[1] http://spark.apache.org/releases/spark-release-1-4-0.html
[2] http://spark.apache.org/downloads.html
SparkR
Spark 1.4是第一个引入SparkR的版本,通过基于Spark新的DataFrame API使得R可以和Spark绑定。SparkR使得R语言用户可以使用Spark集群来分析大规模的数据。而且可以直接使用Spark SQL。可以参见SparkR(R on Spark)编程指南了解更多详情。
Spark Core
Spark Core上面主要是带来操作性,表现性以及兼容性方面的提升,主要更新如下:
SPARK-6942: Visualization for Spark DAGs and operational monitoring SPARK-4897: Python 3 support SPARK-3644: A REST API for application information SPARK-4550: Serialized shuffle outputs for improved performance SPARK-7081: Initial performance improvements in project Tungsten SPARK-3074: External spilling for Python groupByKey operations SPARK-3674: YARN support for Spark EC2 and SPARK-5342: Security for long running YARN applications SPARK-2691: Docker support in Mesos and SPARK-6338: Cluster mode in Mesos
DataFrame API and Spark SQL
The DataFrame API sees major extensions in Spark 1.4 (see this link for a full list) with a focus on analytic and mathmatical functions. Spark SQL introduces new operational utilities along with support for ORCFile.
SPARK-2883: Support for ORCFile format SPARK-2213: Sort-merge joins to optimize very large joins SPARK-5100: Dedicated UI for the SQL JDBC server SPARK-6829: Mathematical functions in DataFrames SPARK-8299: Improved error message reporting for DataFrame and SQL SPARK-1442: Window functions in Spark SQL and DataFrames SPARK-6231 / SPARK-7059: Improved API support for self joins SPARK-5947: Partitioning support in Spark’s data source API SPARK-7320: Rollup and cube functions SPARK-6117: Summary and descriptive statistics
Spark ML/MLlib
Spark’s ML pipelines API graduates from alpha in this release, with new transformers and improved Python coverage. MLlib also adds several new algorithms.
SPARK-5884: A variety of feature transformers for ML pipelines SPARK-7381: Python API for ML pipelines SPARK-5854: Personalized PageRank for GraphX SPARK-6113: Stabilize DecisionTree and ensembles APIs SPARK-7262: Binary LogisticRegression with L1/L2 (elastic net) SPARK-7015: OneVsRest multiclass to binary reduction SPARK-4588: Add API for feature attributes SPARK-1406: PMML model evaluation support via MLib SPARK-5995: Make ML Prediction Developer APIs public SPARK-3066: Support recommendAll in matrix factorization model SPARK-4894: Bernoulli naive Bayes
Spark Streaming
Spark streaming adds visual instrumentation graphs and significantly improved debugging information in the UI. It also enhances support for both Kafka and Kinesis.
SPARK-7602: Visualization and monitoring in the streaming UI including batch drill down (SPARK-6796, SPARK-6862) SPARK-7621: Better error reporting for Kafka SPARK-2808: Support for Kafka 0.8.2.1 and Kafka with Scala 2.11 SPARK-5946: Python API for Kafka direct mode SPARK-7111: Input rate tracking for Kafka SPARK-5960: Support for transferring AWS credentials to Kinesis SPARK-7056 A pluggable interface for write ahead logs
Known Issues
This release has few known issues which will be addressed in Spark 1.4.1
Python sortBy()/sortByKey() can hang if a single partition is larger than worker memory SPARK-8202 Unintended behavior change of JSON schema inference SPARK-8093 Some ML pipleline components do not correctly implement copy SPARK-8151 Spark-ec2 branch pointer is wrong SPARK-8310本博客文章除特别声明,全部都是原创!
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Apache Spark 1.4.0正式发布】(https://www.iteblog.com/archives/1390.html)