Apache Spark是快速的通用集群计算系统。它在Java、Scala以及Python等语言提供了高层次的API,并且在通用的图形计算方面提供了一个优化的引擎。同时,它也提供了丰富的高层次工具,这些工具包括了Spark SQL、结构化数据处理、机器学习工具(MLlib)、图形计算(GraphX)以及Spark Streaming。
下载
大家可以从Spark工程的下载页面得到相应的park。本文档是基于Spark 1.1.0版本。在下载页面包含了针对各个HDFS流行版本预先编译好的Spark包。如果你自己想从源码中编译Spark,可以阅读 building Spark with Maven。
Spark可以在Windows和类UNIX系统(比如Linux,Mac OS)上面运行。在一台机器上运行Spark很简单,只需要安装好Java,并且设置好相应的PATH以及JAVA_HOME环境变量。
Spark可以在Java 6+和Python 2.6+上运行。Spark 1.1.0用到了Scala 2.10,你需要用兼容版本的Scala(2.10.X)
运行例子以Shell
Spark安装包中一些例子程序,这些程序在examples/src/main目录中,并且有Scala, Java 和Python版本。如果想运行Java或Scala例子,可以在Spark的顶级目录运行bin/run-example
./bin/run-example SparkPi 10
你也可以通过交互式的Scala Shell上运行Spark,这是学习Spark框架非常有效的方法:
./bin/spark-shell --master local[2]
--master选项指定了分布式集群的master URL或者是local(用一个线程在本地运行)或者是local[N](用N个线程在本地运行)。你可以用local来测试Spark程序。如果你想了解全部的选项。可以在运行的时候加上--help选项。
./bin/spark-shell --help
Usage: ./bin/spark-shell [options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
YARN-only:
--executor-cores NUM Number of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
Spark同时也提供了Python API,如果想交互式的运行Python interpreter。可以用bin/pyspark:
./bin/pyspark --master local[2]
Spark同时也提供Python版本的例子,可以用下面命令运行:
./bin/spark-submit examples/src/main/python/pi.py 10
在集群上运行Spark
在Spark集群模式预览页面上解释了在集群上运行Spark的核心概念。Spark既可以独自运行,也可以在现有的集群管理上运行。目前它为部署Spark集群提供了一些选项:
1、Amazon EC2: our EC2 scripts let you launch a cluster in about 5 minutes
2、Standalone Deploy Mode: 最简单的方式来部署Spark集群
3、Apache Mesos
4、Hadoop YARN
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Spark1.1.0预览文档(Spark Overview)】(https://www.iteblog.com/archives/1133.html)


