Spark SQL

I use following list of tools to build Spark Jobs:

  • gradle - building environment

  • scala

Create new gradle project

$ gradle init --type scala-library

Add Spark dependencies to build.gradle

dependencies {
  // We use spark-core_2.10 so spark with 2.10 line of scala
  compile "org.scala-lang:scala-library:2.10.6"
// Apache Spark core
compile "org.apache.spark:spark-core_2.10:1.6.1"
  // Apache Spark SQL
  compile "org.apache.spark:spark-sql_2.10:1.6.1"
  // Apache Spark Hive
  compile "org.apache.spark:spark-hive_2.10:1.6.1"
  // XML data source for Spark DataFrame
  // https://github.com/databricks/spark-xml
  compile "com.databricks:spark-xml_2.10:0.3.2"
}
  • spark-sql library is mandatory.

  • spark-hive library is required, when you use window functions.

  • spark-xml library is helpful while working with xml files. It automatically converts xml record to DataFrame.

Add gradle setting for faster Scala compilation

tasks.withType(ScalaCompile) {
  scalaCompileOptions.useAnt = false
}

Define gradle task to execute a job defined as Scala application.

task runSparkSQL(type: Exec, dependsOn: jar) {
  workingDir SPARK_HOME
  commandLine "${SPARK_HOME}/${SPARK_SUBMIT}", "--class", "sia.sql.App", "--jars", "${SPARK_JARS}", "--packages", "${SPARK_PACKAGES}", "${jar.archivePath}"
}

In gradle.properties add following settings

SPARK_HOME=.../spark-1.6.1
SPARK_SUBMIT=bin/spark-submit
SPARK_MASTER=spark://localhost:7077
SPARK_JARS=/usr/local/hadoop/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar,/Users/zedar/dev/hadoopdev/spark/extra-lib/spark-hive_2.10-1.5.1.jar
SPARK_PACKAGES=com.databricks:spark-xml_2.10:0.3.2

results matching ""

    No results matching ""