Я установил Spark 3.x в свою систему и отправил команду spark в режиме кластера пряжи с моего терминала Unix и получил следующую ошибку.
(base) vijee@vijee-Lenovo-IdeaPad-S510p:~/spark-3.0.1-bin-hadoop2.7$ bin/spark-submit --master yarn --deploy-mode cluster --py-files /home/vijee/Python/PythonScriptOnYARN.py
и выдает ошибку ниже
Error: Missing application resource
Вот полная информация о моем запросе:
(base) vijee@vijee-Lenovo-IdeaPad-S510p:~/spark-3.0.1-bin-hadoop2.7$ bin/spark-submit --master yarn --deploy-mode "cluster" --driver-memory 1G --executor-cores 2 --num-executors 1 --executor-memory 2g --py-files /home/vijee/Python/PythonScriptOnYARN.py
Error: Missing application resource.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/vijee/spark-3.0.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/vijee/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories are given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
.
.
.
Ниже приведены мои настройки конфигурации:
cat spark-env.sh
export HADOOP_HOME=/home/vijee/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS = "-Djava.library.path=$HADOOP_HOME/lib/native"
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH=/home/vijee/hadoop-2.7.7/etc/hadoop:/home/vijee/hadoop-2.7.7/share/hadoop/common/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/common/*:/home/vijee/hadoop-2.7.7/share/hadoop/hdfs:/home/vijee/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/hdfs/*:/home/vijee/hadoop-2.7.7/share/hadoop/yarn:/home/vijee/hadoop-2.7.7/share/hadoop/yarn/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/yarn/*:/home/vijee/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/mapreduce/*:/home/vijee/hadoop-2.7.7/contrib/capacity-scheduler/*.jar
cat .bashrc
export HADOOP_HOME=/home/vijee/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS = "-Djava.library.path=$HADOOP_HOME/lib/native"
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/home/vijee/spark-3.0.1-bin-hadoop2.7
export PATH = "$PATH:/home/vijee/spark-3.0.1-bin-hadoop2.7/bin"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/home/vijee/anaconda3/bin/python3
export PYSPARK_DRIVER_PYTHON=/home/vijee/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS = "notebook"
cat spark-defaults.cong
spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m
# Configure history server
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/sparklogs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://localhost:9000/sparklogs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.yarn.security.tokens.hive.enabled true
cat PythonScriptOnYARN.py
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import Row
SS = SparkSession.builder.master("yarn").appName("ProjectYARN").enableHiveSupport().getOrCreate()
sc = SS.sparkContext
rdd1 = sc.textFile("file:///home/vijee/Python/car_data.csv")
rdd2 = rdd1.filter(lambda b: "id" not in b)
rdd3 = rdd2.map(lambda a: a.split(","))
rdd4 = rdd3.map(lambda c: Row(id=int(c[0]),CarBrand=c[1],Price=int(c[2]),Caryear=int(c[3]),Color=c[4]))
df11 = SS.createDataFrame(rdd4)
df11.createOrReplaceTempView("table1")
SS.sql("select * from table1 where CarBrand='GMC'").show()
SS.stop()
Может ли кто-нибудь дать мне решение, где я делаю ошибку? Как решить эту проблему?
Удалить --py-files
. Это делается для добавления модулей, а не для указания скрипта, который вы собираетесь запустить.
bin/spark-submit --master yarn --deploy-mode cluster /home/vijee/Python/PythonScriptOnYARN.py