SparkContext 、SparkConf 、Spark-Shell
Spark 官网:http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
Spark Download:https://archive.apache.org/dist/spark/
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.
To create a SparkContext you first need to build a SparkConf object that contains information about your application.
第一步就是创建一个SparkContext,如果你想要进行Spark的编程的话,SparkContext可以告诉Spark如何去链接到集群,
创建SparkContext前,你需要创建一个SparkConf,来存放应用程序的相关信息。
1 | val conf = new SparkConf().setAppName(appName).setMaster(master) |
Spark-Shell 参数讲解
可选参数 | 释义 |
---|---|
–master MASTER_URL | spark://host:port, mesos://host:port, yarn, or local. |
–name NAME | A name of your application. |
–jars JARS | Comma-separated list of local jars to include on the driver and executor classpaths. |
–driver-memory MEM | Memory for driver (e.g. 1000M, 2G) (Default: 1024M). |
–executor-memory MEM | Memory per executor (e.g. 1000M, 2G) (Default: 1G). |
–executor-cores NUM | Number of cores per executor. (Default: 1 in YARN mode,or all available cores on the worker in standalone mode) |
–driver-cores NUM | Number of cores used by the driver, only in cluster mode |
–queue QUEUE_NAME | The YARN queue to submit to (Default: “default”). |
Spark-shell启动过程的日志信息:
Spark context Web UI available at http://ip:4040
Spark context available as ‘sc’ (master = local[2], app id = local-1577503784325).
Spark session available as ‘spark’.
在Spark-Master URL中启动过程的参数如下:
Master URL | Meaning |
---|---|
local | Run Spark locally with one worker thread (i.e. no parallelism at all). |
local[K] | Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). |
local[K,F] | Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable) |
local[*] | Run Spark locally with as many worker threads as logical cores on your machine. |
local[*,F] | Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures. |
spark://HOST:PORT | Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. |
spark://HOST1:PORT1,HOST2:PORT2 | Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default. |
mesos://HOST:PORT | Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://… To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher. |
yarn | Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. |
在Spark WebUI界面可以查看Spark的相关信息
Spark2.x 当中Saprk的Jars单独放置的,在1.x中是达成一个单独的包,此处存在调优点,
如果采用Spark on Yarn 的方式进行提交,这些jar包需要上传,对于资源的开销是非常大的。
在Spark-Shell中自己创建的SparkContext是不能够使用的。
1.spark.app.name Spark shell 的来源 ===> 底层也是采用的spark-submit 来进行提交作业的。
2.spark.app.id local-1577503784325 的来源
3.java.io.tmpdir 的位置是系统的/tmp