Apache Spark是快速的通用集群计算系统。它在Java、Scala以及Python等语言提供了高层次的API,并且在通用的图形计算方面提供了一个优化的引擎。同时,它也提供了丰富的高层次工具,这些工具包括了Spark SQL、结构化数据处理、机器学习工具(MLlib)、图形计算(GraphX)以及Spark Streaming。
下载
大家可以从Spark工程的下载页面得到相应的park。本文档是基于Spark 1.1.0版本。在下载页面包含了针对各个HDFS流行版本预先编译好的Spark包。如果你自己想从源码中编译Spark,可以阅读 building Spark with Maven。
Spark可以在Windows和类UNIX系统(比如Linux,Mac OS)上面运行。在一台机器上运行Spark很简单,只需要安装好Java,并且设置好相应的PATH以及JAVA_HOME环境变量。
Spark可以在Java 6+和Python 2.6+上运行。Spark 1.1.0用到了Scala 2.10,你需要用兼容版本的Scala(2.10.X)
运行例子以Shell
Spark安装包中一些例子程序,这些程序在examples/src/main目录中,并且有Scala, Java 和Python版本。如果想运行Java或Scala例子,可以在Spark的顶级目录运行bin/run-example
./bin/run-example SparkPi 10
你也可以通过交互式的Scala Shell上运行Spark,这是学习Spark框架非常有效的方法:
./bin/spark-shell --master local[2]
--master选项指定了分布式集群的master URL或者是local(用一个线程在本地运行)或者是local[N](用N个线程在本地运行)。你可以用local来测试Spark程序。如果你想了解全部的选项。可以在运行的时候加上--help选项。
./bin/spark-shell --help Usage: ./bin/spark-shell [options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. --conf PROP=VALUE Arbitrary Spark configuration property. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). --help, -h Show this help message and exit --verbose, -v Print additional debug output Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). --supervise If given, restarts the driver on failure. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. YARN-only: --executor-cores NUM Number of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --num-executors NUM Number of executors to launch (Default: 2). --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
Spark同时也提供了Python API,如果想交互式的运行Python interpreter。可以用bin/pyspark:
./bin/pyspark --master local[2]
Spark同时也提供Python版本的例子,可以用下面命令运行:
./bin/spark-submit examples/src/main/python/pi.py 10
在集群上运行Spark
在Spark集群模式预览页面上解释了在集群上运行Spark的核心概念。Spark既可以独自运行,也可以在现有的集群管理上运行。目前它为部署Spark集群提供了一些选项:
1、Amazon EC2: our EC2 scripts let you launch a cluster in about 5 minutes
2、Standalone Deploy Mode: 最简单的方式来部署Spark集群
3、Apache Mesos
4、Hadoop YARN
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Spark1.1.0预览文档(Spark Overview)】(https://www.iteblog.com/archives/1133.html)