文章目录
Apache Spark 2.4 与昨天正式发布,Apache Spark 2.4 版本是 2.x 系列的第五个版本。
Apache Spark 2.4 为我们带来了众多的主要功能和增强功能,主要如下:
- 新的调度模型(Barrier Scheduling),使用户能够将分布式深度学习训练恰当地嵌入到 Spark 的 stage 中,以简化分布式训练工作流程。
- 添加了35个高阶函数,用于在 Spark SQL 中操作数组/map。
- 新增一个新的基于 Databricks 的 spark-avro 模块的原生 AVRO 数据源。
- PySpark 还为教学和可调试性的所有操作引入了热切的评估模式(eager evaluation mode)。
- Spark on K8S 支持 PySpark 和 R ,支持客户端模式(client-mode)。
- Structured Streaming 的各种增强功能。 例如,连续处理(continuous processing)中的有状态操作符。
- 内置数据源的各种性能改进。 例如,Parquet 嵌套模式修剪(schema pruning)。
- 支持 Scala 2.12。
具体的介绍可以参见《即将发布的 Apache Spark 2.4 都有哪些新功能》 文章的介绍。以下是各个模块解决的 ISSUE 列表。
Core and Spark SQL
- 主要功能
- Barrier Execution Mode: [SPARK-24374] 调度器中支持 Barrier Execution Mode,使得 Spark 可以更好地和 机器学习框架进行整合;
- Scala 2.12 Support: [SPARK-14220] 添加对 Scala 2.12 的支持。 现在,您可以使用 Scala 2.12 构建 Spark ,并使用 Scala 2.12 中编写 Spark 应用程序。
- Higher-order functions: [SPARK-23899] 添加许多新的内置函数,包括高阶函数,可以更轻松地处理复杂数据类型。
- Built-in Avro data source: [SPARK-24768] Inline Spark-Avro package with logical type support, better performance and usability.
- API
- [SPARK-24035] SQL syntax for Pivot
- [SPARK-24940] Coalesce and Repartition Hint for SQL Queries
- [SPARK-19602] Support column resolution of fully qualified column name
- [SPARK-21274] Implement EXCEPT ALL and INTERSECT ALL
- Performance and stability
- [SPARK-16406] Reference resolution for large number of columns should be faster
- [SPARK-23486] Cache the function name from the external catalog for lookupFunctions
- [SPARK-23803] Support Bucket Pruning
- [SPARK-24802] Optimization Rule Exclusion
- [SPARK-4502] Nested schema pruning for Parquet tables
- [SPARK-24296] Support replicating blocks larger than 2 GB
- [SPARK-24307] Support sending messages over 2GB from memory
- [SPARK-23243] Shuffle+Repartition on an RDD could lead to incorrect answers
- [SPARK-25181] Limited the size of BlockManager master and slave thread pools, lowering memory overhead when networking is slow
- Connectors
- [SPARK-23972] Update Parquet from 1.8.2 to 1.10.0
- [SPARK-25419] Parquet predicate pushdown improvement
- [SPARK-23456] Native ORC reader is on by default
- [SPARK-22279] Use native ORC reader to read Hive serde tables by default
- [SPARK-21783] Turn on ORC filter push-down by default
- [SPARK-24959] Speed up count() for JSON and CSV
- [SPARK-24244] Parsing only required columns to the CSV parser
- [SPARK-23786] CSV schema validation - column names are not checked
- [SPARK-24423] Option query for specifying the query to read from JDBC
- [SPARK-22814] Support Date/Timestamp in JDBC partition column
- [SPARK-24771] Update Avro from 1.7.7 to 1.8
- Kubernetes Scheduler Backend
- [SPARK-23984] PySpark bindings for K8S
- [SPARK-24433] R bindings for K8S
- [SPARK-23146] Support client mode for Kubernetes cluster backend
- [SPARK-23529] Support for mounting K8S volumes
- PySpark
- [SPARK-24215] Implement eager evaluation for DataFrame APIs
- [SPARK-22274] User-defined aggregation functions with Pandas UDF
- [SPARK-22239] User-defined window functions with Pandas UDF
- [SPARK-24396] Add Structured Streaming ForeachWriter for Python
- [SPARK-23874] Upgrade Apache Arrow to 0.10.0
- [SPARK-25004] Add spark.executor.pyspark.memory limit
- [SPARK-23030] Use Arrow stream format for creating from and collecting Pandas DataFrames
- [SPARK-24624] Support mixture of Python UDF and Scalar Pandas UDF
- Other notable changes
- [SPARK-24596] Non-cascading Cache Invalidation
- [SPARK-23880] Do not trigger any job for caching data
- [SPARK-23510] Support Hive 2.2 and Hive 2.3 metastore
- [SPARK-23711] Add fallback generator for UnsafeProjection
- [SPARK-24626] Parallelize location size calculation in Analyze Table command
Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.
Structured Streaming
- Major features
- [SPARK-24565] Exposed the output rows of each microbatch as a DataFrame using foreachBatch (Python, Scala, and Java)
- [SPARK-24396] Added Python API for foreach and ForeachWriter
- [SPARK-25005] Support “kafka.isolation.level” to read only committed records from Kafka topics that are written using a transactional producer.
- Other notable changes
- [SPARK-24662] Support the LIMIT operator for streams in Append or Complete mode
- [SPARK-24763] Remove redundant key data from value in streaming aggregation
- [SPARK-24156] Faster generation of output results and/or state cleanup with stateful operations (mapGroupsWithState, stream-stream join, streaming aggregation, streaming dropDuplicates) when there is no data in the input stream.
- [SPARK-24730] Support for choosing either the min or max watermark when there are multiple input streams in a query
- [SPARK-25399] Fixed a bug where reusing execution threads from continuous processing for microbatch streaming can result in a correctness issue
- [SPARK-18057] Upgraded Kafka client version from 0.10.0.1 to 2.0.0
Programming guide: Structured Streaming Programming Guide.
MLlib
- Major features
- [SPARK-22666] Spark datasource for image format
- Other notable changes
- [SPARK-22119] Add cosine distance measure to KMeans/BisectingKMeans/Clustering evaluator
- [SPARK-10697] Lift Calculation in Association Rule mining
- [SPARK-14682] Provide evaluateEachIteration method or equivalent for spark.ml GBTs
- [SPARK-7132] Add fit with validation set to spark.ml GBT
- [SPARK-15784] Add Power Iteration Clustering to spark.ml
- [SPARK-15064] Locale support in StopWordsRemover
- [SPARK-21741] Python API for DataFrame-based multivariate summarizer
- [SPARK-21898] Feature parity for KolmogorovSmirnovTest in MLlib
- [SPARK-10884] Support prediction on single instance for regression and classification related models
- [SPARK-23783] Add new generic export trait for ML pipelines
- [SPARK-11239] PMML export for ML linear regression
Programming guide: Machine Learning Library (MLlib) Guide.
SparkR
- Major features
- [SPARK-25393] Adding new function from_csv()
- [SPARK-21291] add R partitionBy API in DataFrame
- [SPARK-25007] Add array_intersect/array_except/array_union/shuffle to SparkR
- [SPARK-25234] avoid integer overflow in parallelize
- [SPARK-25117] Add EXCEPT ALL and INTERSECT ALL support in R
- [SPARK-24537] Add array_remove / array_zip / map_from_arrays / array_distinct
- [SPARK-24187] Add array_join function to SparkR
- [SPARK-24331] Adding arrays_overlap, array_repeat, map_entries to SparkR
- [SPARK-24198] Adding slice function to SparkR
- [SPARK-24197] Adding array_sort function to SparkR
- [SPARK-24185] add flatten function to SparkR
- [SPARK-24069] Add array_min / array_max functions
- [SPARK-24054] Add array_position function / element_at functions
- [SPARK-23770] Add repartitionByRange API in SparkR
Programming guide: SparkR (R on Spark).
GraphX
- Major features
- [SPARK-25268] run Parallel Personalized PageRank throws serialization Exception
Programming guide: GraphX Programming Guide.
Deprecations
- MLlib
- [SPARK-23451] Deprecate KMeans computeCost
- [SPARK-25345] Deprecate readImages APIs from ImageSchema
Changes of behavior
- Spark Core
- [SPARK-25088] Rest Server default & doc updates
- Spark SQL
- [SPARK-23549] Cast to timestamp when comparing timestamp with date
- [SPARK-24324] Pandas Grouped Map UDF should assign result columns by name
- [SPARK-23425] load data for hdfs file path with wildcard usage is not working properly
- [SPARK-23173] from_json can produce nulls for fields which are marked as non-nullable
- [SPARK-24966] Implement precedence rules for set operations
- [SPARK-25708] HAVING without GROUP BY should be global aggregate
- [SPARK-24341] Correctly handle multi-value IN subquery
- [SPARK-19724] Create a managed table with an existed default location should throw an exception
Please read the Migration Guide for all the behavior changes
Known Issues
- [SPARK-25271] CTAS with Hive parquet tables should leverage native parquet source
- [SPARK-24935] Problem with Executing Hive UDAF’s from Spark 2.2 Onwards
- [SPARK-25879] Schema pruning fails when a nested field and top level field are selected
- [SPARK-25906] spark-shell cannot handle
-i
option correctly - [SPARK-25921] Python worker reuse causes Barrier tasks to run without BarrierTaskContext
- [SPARK-25918] LOAD DATA LOCAL INPATH should handle a relative path
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Apache Spark 2.4.0 正式发布】(https://www.iteblog.com/archives/2445.html)