Add LZO compresssion codecs to the Apache Hadoop and Spark

LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.

  • Install lzo and lzop codecs [OSX].

$ brew install lzo lzop
  • Find where the headers and libraries are installed

$ brew list lzo

The output should look like follows

/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files)
/usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib
/usr/local/Cellar/lzo/2.06/lib/ (2 other files)
/usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
  • Clone hadoop-lzo repository.

$ git clone https://github.com/twitter/hadoop-lzo
$ cd hadoop-lzo
  • Build the project (maven required)

$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
  • Copy the libraries into the Hadoop installation directory. We assume that the HADOOP_INSTALL points to the hadoop installation folder (for example /usr/local/hadoop)

$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib
$ mkdir -p $HADOOP_INSTALL/lib/lzo
$ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
  • Add hadoop-lzo jar and native libraries to hadoop’s classpath and library path. Do it either in ~/.bash_profile or $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
  • Add lzo compression codes to the hadoop’s $HADOOP_INSTALL/etc/hadoop/core-site.xml

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
  </value>
</property>
<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
  • Add lzo dependencies to the Apache Spark configuration $SPARK_INSTALL/conf/spark-env.sh

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
  • Add lzo compression codec to the Hadoop Configuration instance that you pass to SparkContext (driver) instance

conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
  • Convert file (for example bz2) to the lzo format and import new file to the Hadoop’s HDFS

$ bzip2 --stdout file.bz2 | lzop -o file.lzo
$ hdfs dfs -put file.lzo input
  • Index lzo compressed files directly in HDFS

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo

or index all lzo file in the input folder

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input

or index lzo files with map reduce job

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input

REFERENCES

results matching ""

    No results matching ""