HDFS compression vs. Spark processing time

Processing conditions

Number of cores per spark job

48 (spark.cores.max)

Number of spark workerks instances per node

2 (SPARK_WORKER_INSTANCES)

Memory per worker

1GB

Input file size

Compression method File size

GZIP (non-splittable)

186.4MB

BZIP2 (splittable)

135.8MB

LZO (indexed, splittable)

271.3MB

Algorithm

Records are extracted from text files with delimiter set to 2 newlines. Every record is processed with a few regexp (Java regexp). Records are grouped by key. Grouped records are sorted in order to properly calculate time between them. The output is stored in HDFS as text file.

Environment

Spark job was executed with Job Server, non-blocking Ratpack Framework application.

Results for HDFS block size: 128MB (default)

PROCESSING TIME

GZIP2

58s

BZIP2

1,4m

LZO

25s

Results for HDFS block size: 32MB

PROCESSING TIME

GZIP2

60s

BZIP

24s

LZO

8s

Results for HDFS block size: 16MB

PROCESSING TIME

GZIP2

58s

BZIP2

15s

LZO

5s

Results for HDFS block size: 8MB

PROCESSING TIME

GZIP2

60s

BZI2

9s

LZO

4s

results matching ""

    No results matching ""