Spark Job running in cron

Unix crontab schedules tasks to be run at specified time. In simple cases we can use it for scheduling Spark jobs. The use case: flume collects events from clickstream. Flume appends them to files in HDFS. Hive/Presto has external table pointing to the HDFS file system where events are stored. Flume has to roll files often. Only then events are visible in Hive/Presto tables. The result is that in HDFS is so many files. We should have a job to auto merge them and we can use crontab to run it at defined point in time.

  1. Put jar with Spark job on the server

  2. Define crontab task. We define to run it every 5 minutes (*/5)

        $ crontab -e
        */5 * * * * SPARK SUBMIT CMD
    Columns in crontab: minutes, hours, day, month, year. `*` means any. `*/n` means not at `n` minute in every hours by run a job `n` minutes from the last run.

Spark command to run:

SPARK_SUBMIT_CMD=. $HOME/.bash_profile; $SPARK_HOME/bin/spark-submit --class YOUR_JOB_CLASS --master spark://HOST:9077 --jars EXTRA_JARS JAR_WITH JOB JOB PARAMTERS

Note that . before $HOME/.bash_profile is important. It is a synonim for source command that initializes shell variables.

results matching ""

    No results matching ""