$ gradle init --type scala-library
Spark SQL
I use following list of tools to build Spark Jobs:
-
gradle - building environment
-
scala
Create new gradle project
Add Spark dependencies to build.gradle
dependencies { // We use spark-core_2.10 so spark with 2.10 line of scala compile "org.scala-lang:scala-library:2.10.6"
// Apache Spark core compile "org.apache.spark:spark-core_2.10:1.6.1"
// Apache Spark SQL compile "org.apache.spark:spark-sql_2.10:1.6.1" // Apache Spark Hive compile "org.apache.spark:spark-hive_2.10:1.6.1" // XML data source for Spark DataFrame // https://github.com/databricks/spark-xml compile "com.databricks:spark-xml_2.10:0.3.2" }
-
spark-sql library is mandatory.
-
spark-hive library is required, when you use window functions.
-
spark-xml library is helpful while working with xml files. It automatically converts xml record to DataFrame.
Add gradle setting for faster Scala compilation
tasks.withType(ScalaCompile) { scalaCompileOptions.useAnt = false }
Define gradle task to execute a job defined as Scala application.
task runSparkSQL(type: Exec, dependsOn: jar) { workingDir SPARK_HOME commandLine "${SPARK_HOME}/${SPARK_SUBMIT}", "--class", "sia.sql.App", "--jars", "${SPARK_JARS}", "--packages", "${SPARK_PACKAGES}", "${jar.archivePath}" }
In gradle.properties add following settings
SPARK_HOME=.../spark-1.6.1 SPARK_SUBMIT=bin/spark-submit SPARK_MASTER=spark://localhost:7077 SPARK_JARS=/usr/local/hadoop/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar,/Users/zedar/dev/hadoopdev/spark/extra-lib/spark-hive_2.10-1.5.1.jar SPARK_PACKAGES=com.databricks:spark-xml_2.10:0.3.2