HDFS compression vs. Spark processing time
Processing conditions
Number of cores per spark job |
48 (spark.cores.max) |
Number of spark workerks instances per node |
2 (SPARK_WORKER_INSTANCES) |
Memory per worker |
1GB |
Input file size
Compression method | File size |
---|---|
GZIP (non-splittable) |
186.4MB |
BZIP2 (splittable) |
135.8MB |
LZO (indexed, splittable) |
271.3MB |
Algorithm
Records are extracted from text files with delimiter set to 2 newlines. Every record is processed with a few regexp (Java regexp). Records are grouped by key. Grouped records are sorted in order to properly calculate time between them. The output is stored in HDFS as text file.
Environment
Spark job was executed with Job Server, non-blocking Ratpack Framework application.
Results for HDFS block size: 128MB (default)
PROCESSING TIME
GZIP2 |
58s |
BZIP2 |
1,4m |
LZO |
25s |
Results for HDFS block size: 32MB
PROCESSING TIME
GZIP2 |
60s |
BZIP |
24s |
LZO |
8s |
Results for HDFS block size: 16MB
PROCESSING TIME
GZIP2 |
58s |
BZIP2 |
15s |
LZO |
5s |
Results for HDFS block size: 8MB
PROCESSING TIME
GZIP2 |
60s |
BZI2 |
9s |
LZO |
4s |