HDFS compression vs. Spark processing time

Processing conditions

Number of cores per spark job

48 (spark.cores.max)

Number of spark workerks instances per node

2 (SPARK_WORKER_INSTANCES)

Memory per worker

1GB

Input file size

Compression method	File size
GZIP (non-splittable)	186.4MB
BZIP2 (splittable)	135.8MB
LZO (indexed, splittable)	271.3MB

Compression method

File size

GZIP (non-splittable)

186.4MB

BZIP2 (splittable)

135.8MB

LZO (indexed, splittable)

271.3MB

Algorithm

Records are extracted from text files with delimiter set to 2 newlines. Every record is processed with a few regexp (Java regexp). Records are grouped by key. Grouped records are sorted in order to properly calculate time between them. The output is stored in HDFS as text file.

Environment

Spark job was executed with Job Server, non-blocking Ratpack Framework application.

Results for HDFS block size: 128MB (default)

PROCESSING TIME

GZIP2

58s

BZIP2

1,4m

LZO

25s

Results for HDFS block size: 32MB

PROCESSING TIME

GZIP2

60s

BZIP

24s

LZO

Results for HDFS block size: 16MB

PROCESSING TIME

GZIP2

58s

BZIP2

15s

LZO

Results for HDFS block size: 8MB

PROCESSING TIME

GZIP2

60s

BZI2

LZO

Compression vs. Performance

HDFS compression vs. Spark processing time

Processing conditions

Results for HDFS block size: 128MB (default)

Results for HDFS block size: 32MB

Results for HDFS block size: 16MB

Results for HDFS block size: 8MB

results matching ""

No results matching ""