基于社区开发者们的观察,绝大多数的Spark应用程序的瓶颈不在于I/O或者网络,而在于CPU和内存。基于这个事实,开发者们发起了Tungsten项目,而Spark 1.5是Tungsten项目的第一阶段。Tungsten项目主要集中在三个方面,于此来提高Spark应用程序的内存和CPU的效率,使得性能能够接近硬件的限制。
Tungsten项目的三个阶段
内存管理和二进制处理(Memory Management and Binary Processing)
1、避免使用非transient的Java对象(它们以二进制格式存储),这样可以减少GC的开销。
2、通过使用基于内存的密集数据格式,这样可以减少内存的使用情况。
3、更好的内存计算(字节的大小),而不是依赖启发式。
4、对于知道数据类型的操作(比如DataFrame和SQL),我们可以直接对二进制格式进行操作,这样我们就不需要进行系列化和反系列化的操作。
缓存感知计算(Cache-aware Computation)
对aggregations, joins和shuffle操作进行快速排序和hash操作。
代码生成(Code Generation)
1、更快的表达式求值和DataFrame/SQL操作。
2、快速系列化
大多数的Tungsten项目是在DataFrame中进行的,这样使得我们可以获取到关于Application的更多语义。如果有需要,社区还将对Spark的RDD API进行一些改造。下面是Spark 1.5进行的一些关于Tungsten项目的Issues
SPARK-7076 Binary processing compact tuple representation SPARK-7077 Binary processing hash table for aggregation SPARK-7080 Binary processing based aggregate operator SPARK-7081 Faster sort-based shuffle path using binary processing cache-aware sort SPARK-7082 Binary processing external sort-merge join SPARK-7083 Binary processing dimensional join SPARK-7184 Investigate turning codegen on by default SPARK-7190 UTF8String backed by binary data SPARK-7251 Perform sequential scan when iterating over entries in BytesToBytesMap SPARK-7288 Suppress compiler warnings due to use of sun.misc.Unsafe SPARK-7293 Report memory used in aggregations and joins SPARK-7311 Enable in-memory serialized map-side shuffle to work with SQL serializers SPARK-7375 Avoid defensive copying in SQL exchange operator when sort-based shuffle buffers data in serialized form SPARK-7440 Remove physical Distinct operator in favor of Aggregate SPARK-7450 Use UNSAFE.getLong() to speed up BitSetMethods#anySet() SPARK-7517 Rename unsafe module to managedmemory SPARK-7691 Use type-specific row accessor functions in CatalystTypeConverters' toScala functions SPARK-7698 Implement buffer pooling / re-use in ExecutorMemoryManager when using HeapAllocator SPARK-7812 Speed up SQL code generation SPARK-7813 Push code generation into expression definition SPARK-7814 Turn code generation on by default SPARK-7815 Enable UTF8String to work against memory address directly SPARK-7887 Remove EvaluatedType from SQL Expression SPARK-7956 Use Janino to compile SQL expression SPARK-8117 Push codegen into Expression SPARK-8149 Break ExpressionEvaluationSuite down to multiple files SPARK-8154 Remove Term/Code type aliases in code generation SPARK-9700 Pick default page size more intelligently SPARK-9548 BytesToBytesMap could have a destructive iterator SPARK-9228 Combine unsafe and codegen into a single option SPARK-9363 SortMergeJoin operator should support UnsafeRow SPARK-9693 Reserve a page in all unsafe operators to avoid starving an operator SPARK-9703 EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied SPARK-8160 Tungsten style external aggregation SPARK-9736 JoinedRow.anyNull should delegate to the underlying rows SPARK-7165 Sort-merge Join for left/right outer joins SPARK-8159 Improve expression function coverage (Spark 1.5) SPARK-8189 Use 100ns precision for TimestampType SPARK-8190 ExpressionEvalHelper.checkEvaluation should also run the optimizer version SPARK-8286 Rewrite UTF8String in Java and move it into unsafe package. SPARK-8301 Improve UTF8String substring/startsWith/endsWith/contains performance SPARK-8303 Move DateUtils into unsafe package SPARK-8305 Improve codegen SPARK-8307 Improve timestamp from parquet SPARK-8317 Do not push sort into shuffle in Exchange operator SPARK-8319 Update logic related to key ordering in shuffle dependencies SPARK-8354 Fix off-by-factor-of-8 error when allocating scratch space in UnsafeFixedWidthAggregationMap SPARK-8446 Add helper functions for testing physical SparkPlan operators SPARK-8498 Fix NullPointerException in error-handling path in UnsafeShuffleWriter SPARK-8579 Support arbitrary object in UnsafeRow SPARK-8713 Support codegen for not thread-safe expressions SPARK-8850 Turn unsafe mode on by default SPARK-8866 Use 1 microsecond (us) precision for TimestampType SPARK-8876 Remove InternalRow type alias in expressions package SPARK-8879 Remove EmptyRow class SPARK-9022 UnsafeProject SPARK-9023 UnsafeExchange SPARK-9024 Unsafe HashJoin SPARK-9050 Remove out-of-date code in Exchange that was obsoleted by SPARK-8317 SPARK-9054 Rename RowOrdering to InterpretedOrdering and use newOrdering to build orderings SPARK-9143 Add planner rule for automatically inserting Unsafe < -> Safe row format converters SPARK-9247 Use BytesToBytesMap in unsafe broadcast join SPARK-9258 Remove all semi join physical operator SPARK-9266 Prevent "managed memory leak detected" exception from masking original exception SPARK-9285 Remove InternalRow's inheritance from Row SPARK-9329 Bring UnsafeRow up to feature parity with other InternalRow implementations SPARK-9331 Indent generated code properly when dumping them in debug code or exception mode SPARK-9334 Remove UnsafeRowConverter in favor of UnsafeProjection SPARK-9336 Remove all extra JoinedRows SPARK-9373 Support StructType in Tungsten style Projection SPARK-9389 Support ArrayType in Tungsten SPARK-9394 CodeFormatter should handle parentheses SPARK-9411 Make page size configurable SPARK-9412 Support records larger than a page size SPARK-9413 Support MapType in Tungsten SPARK-9418 Use sort-merge join as the default shuffle join Improvement RESOLVED Reynold Xin SPARK-9421 Fix null-handling bug in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType) SPARK-9448 GenerateUnsafeProjection should not share expressions across instances SPARK-9450 [INVALID] HashedRelation.get() could return an Iterator[Row] instead of Seq[Row] SPARK-9457 Sorting improvements SPARK-9464 Add property-based tests for UTF8String SPARK-9738 remove FromUnsafe and add its codegen version to GenerateSafe SPARK-9751 Audit operators to make sure they can support UnsafeRows SPARK-9784 Exchange.isUnsafe should check whether codegen and unsafe are enabled SPARK-9785 HashPartitioning compatibility should consider expression ordering SPARK-9815 Rename PlatformDependent.UNSAFE -> Platform
更多关于Tungsten项目的进展,请关注SPARK-7075和SPARK-9697。
本博客文章除特别声明,全部都是原创!原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Spark Tungsten项目的三阶段】(https://www.iteblog.com/archives/1502.html)