Creating an Apache Hive Table with MongoDB: Setting Up the Environment

The following software is required for the chapter.

  • Hadoop 2.0.0 CDH 4.6
  • Hive 0.10.0 CDH 4.6
  • MongoDB Java Driver 2.11.3
  • MongoDB Storage Handler for Hive 0.0.3
  • MongoDB 2.6.3
  • Eclipse IDE for Java EE Developers
  • Java 7

Later versions of the listed software may also be used. Download the MongoDB storage handler for Hive from https://github.com/yc-huang/Hive-mongo and extract the zip file to a directory. The Jar files with the storage handler are in the Hive-mongo-master/release directory. Use the hive-mongo-0.0.3-jar-with- dependencies.jar file, which has all the required dependencies included. Download the MongoDB Java driver Jar file mongo-java-driver-2.11.3.jar or a later version from http://central.maven.org/maven2/org/mongodb/mongo-java-driver/.

Complete the following steps to set up the environment:

  1. We have used Oracle Linux 6.5, which is installed on Oracle VirtualBox 4.3. But a different distribution of Linux may be used instead. Oracle Linux is based on RedHat Linux, one of the most commonly used Linux distributions. Create a directory for MongoDB and other software and set its permissions.

mkdir /mongodb

chmod -R 777 /mongodb

cd /mongodb

  1. Download Java 7 and extract the file to the /mongodb directory. tar zxvf jdk-7u55-linux-i586.gz
  2. Download Hadoop 2.0.0 and extract the tar.gz file to a directory.

wget http://archive.cloudera.eom/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0.tar.gz tar -xvf hadoop-2.0.0-cdh4.6.0.tar.gz

  1. Create symlinks for the Hadoop bin and conf directories.

ln -s /mongodb/hadoop-2.0.0-cdh4.6.0/bin /oranosql/hadoop-2.0.0-cdh4.6.0/share/ hadoop/mapreduce2/bin

ln -s /mongodb/hadoop-2.0.0-cdh4.6.0/etc/hadoop /oranosql/hadoop-2.0.0-cdh4.6.0/ share/hadoop/mapreduce2/conf

  1. Configure Hadoop in the core-site.xml and hdfs-site.xml configuration files. In the core-site.xml, which is listed below, set the fs.defaultFS and hadoop. tmp.dir properties.

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://10.0.2.15:8020</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>file:///var/lib/hadoop-0.20/cache</value>

</property>

</configuration>

  1. Create the directory specified for the hadoop.tmp.dir property and set its permissions to global (777).

mkdir -p /var/lib/hadoop-0.20/cache

chmod -R 777 /var/lib/hadoop-0.20/cache

  1. In the hdfs-site.xml configuration file, which is listed below, set the dfs. permissions.superusergroup, dfs.namenode.name.dir, dfs.replication, and dfs.permissions properties.

<?xml version=”1.0″ encoding=”UTF-8″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>dfs.permissions.superusergroup</name>

<value>hadoop</value>

</property><property>

<name>dfs.namenode.name.dir</name>

<value>file:///data/1/dfs/nn</value>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

  1. Create the NameNode storage directory and set its permissions.

mkdir -p /data/1/dfs/nn

chmod -R 777 /data/1/dfs/nn

  1. Download and install Hive 0.10.0 CDH 4.6.

wget http://archive.cloudera.com/cdh4/cdh/4/hive-0.10.0-cdh4.6.0.tar.gz

tar -xvf hive-0.10.0-cdh4.6.0.tar.gz

  1. Create the hive-site.xml file from the template.

cd /mongodb/hive-0.10.0-cdh4.6.0/conf

cp hive-default.xml.template hive-site.xml

  1. By default, Hive makes use of the Embedded metastore. We shall use the Remote metastore for which we need to configure the hive.metastore.uris property to the remote metastore URI. Also set the hive.metastore.warehouse.dir property to the Hive storage directory, the directory in which Hive databases and tables are stored. The hive-site.xml configuration file is listed:

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> <configuration>

<property>

<name>hive.metastore.warehouse.dir</name>

<value>hdfs://10.0.2.15:8020/user/hive/warehouse</value>

</property>

</configuration>

<property>

<name>hive.metastore.uris</name>

<value>thrift://localhost:10000</value>

</property>

</configuration>

  1. Create the HDFS path directory specified in the hive.metastore.warehouse.dir property and set its permissions.

hadoop dfs -mkdir hdfs://10.0.2.15:8020/user/hive/warehouse

hadoop dfs -chmod -R g+w hdfs://10.0.2.15:8020/user/hive/warehouse

  1. Download and extract the MongoDB 2.6.3 file.

curl -O http://downloads.mongodb.Org/linux/mongodb-linux-i686-2.6.3.tgz

tar -zxvf mongodb-linux-i686-2.6.3.tgz

  1. Copy the mongo-java-driver-2.6.3.jar and hive-mongo-0.0.3-jar-with- dependencies.jar to the /mongodb/hive-0.10.0-cdh4.6.0/lib directory. Set the environment variables for Hadoop, Hive, Java, and MongoDB in the bash shell.

vi ~/.bashrc

export HADOOP_PREFIX=/mongodb/hadoop-2.0.0-cdh4.6.0

export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop

export MONGO_HOME=/mongodb/mongodb-linux-i686-2.6.3

export HIVE_HOME=/mongodb/hive-0.10.0-cdh4.6.0

export HIVE_CONF=$HIVE_HOME/conf export JAVA_HOME=/mongodb/jdk1.7.0_55

export HADOOP_MAPRED_HOME=/mongodb/hadoop-2.0.0-cdh4.6.0/bin

export HADOOP_HOME=/mongodb/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2

export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HIVE_HOME/lib/*:/

mongodb/mongo-java-driver-2.6.3.jar:/mongodb/hive-mongo-0.0.3-jar-with-

dependencies.jar:$HIVE_CONF

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME::$HIVE_HOME/bin:$MONGO_ HOME/bin

  1. Format the NameNode and start the HDFS, which comprises the NameNode and DataNode.

hadoop namenode -format

hadoop namenode

hadoop datanode

  1. Create a directory (and set its permissions) in the HDFS to put the Hive directory to make Hive available in the runtime classpath.

hdfs dfs -mkdir hdfs://localhost:8020/mongodb

hadoop dfs -chmod -R g+w hdfs://localhost:8020/mongodb

  1. Put the Hive directory in HDFS.

hdfs dfs -put /mongodb/hive-0.10.0-cdh4.6.0 hdfs://localhost:8020/mongodb

Source: Vohra Deepak (2015), Pro MongoDB™ Development, Apress; 1st ed. edition.

Leave a Reply

Your email address will not be published. Required fields are marked *