$ brew install lzo lzop
Add LZO compresssion codecs to the Apache Hadoop and Spark
LZO is a splittable compression format for files stored in Hadoop’s HDFS
. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo
files could be splittable too.
-
Install
lzo
andlzop
codecs [OSX].
-
Find where the headers and libraries are installed
$ brew list lzo
The output should look like follows
/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files) /usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib /usr/local/Cellar/lzo/2.06/lib/ (2 other files) /usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
-
Clone
hadoop-lzo
repository.
$ git clone https://github.com/twitter/hadoop-lzo $ cd hadoop-lzo
-
Build the project (
maven
required)
$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
-
Copy the libraries into the
Hadoop
installation directory. We assume that theHADOOP_INSTALL
points to the hadoop installation folder (for example/usr/local/hadoop
)
$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib $ mkdir -p $HADOOP_INSTALL/lib/lzo $ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
-
Add
hadoop-lzo
jar and native libraries to hadoop’s classpath and library path. Do it either in~/.bash_profile
or$HADOOP_INSTALL/etc/hadoop/hadoop-env.sh
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
-
Add
lzo
compression codes to the hadoop’s$HADOOP_INSTALL/etc/hadoop/core-site.xml
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
-
Add
lzo
dependencies to the Apache Spark configuration$SPARK_INSTALL/conf/spark-env.sh
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
-
Add
lzo
compression codec to the HadoopConfiguration
instance that you pass toSparkContext
(driver) instance
conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
-
Convert file (for example
bz2
) to thelzo
format and import new file to the Hadoop’sHDFS
$ bzip2 --stdout file.bz2 | lzop -o file.lzo $ hdfs dfs -put file.lzo input
-
Index
lzo
compressed files directly inHDFS
$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo
or index all lzo
file in the input
folder
$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input
or index lzo
files with map reduce job
$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input
REFERENCES