Skip to content

Running Shark on a Cluster

pwendell edited this page Nov 13, 2012 · 53 revisions

This guide describes how to get Shark up and running on a cluster. If you are interested in using Shark on Amazon EC2, see page Running Shark on EC2 to use the set of EC2 scripts to launch a pre-configured cluster in a few mins.

Dependencies

NOTE: Shark is a drop-in tool that can be used on top of existing Hive warehouses. It requires zero modification to an existing Hive deployment/metastore.

Running Shark on a cluster requires the following external components:

  • Scala 2.9.2
  • Spark 0.6
  • The Shark-specific Hive JAR (based on Hive 0.9), included in the Shark binary distribution
  • A HDFS cluster: setup not included in this guide.

Note that unlike earlier version of Spark and Shark, running the latest version on a cluster NO LONGER requires Apache Mesos.

Scala

If you don't have Scala 2.9.2 installed on your system, you can download it by:

$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.2.tgz
$ tar xvfz scala-2.9.2.tgz

Spark

We are using Spark's standalone deployment mode to run Shark on a cluster. You can (click on this click to find more information|http://www.spark-project.org/docs/0.6.0/spark-standalone.html).

Download Spark:

$ wget http://github.com/downloads/mesos/spark/spark-0.6.0-prebuilt.tar.gz
$ tar xvfz spark-0.6.0-prebuilt.tar.gz

Edit spark-0.6.0/conf/slaves to add the hostname of each slave, one per line.

Edit spark-0.6.0/conf/spark-env.sh to set SCALA_HOME and SPARK_WORKER_MEMORY

export SCALA_HOME=/path/to/scala-2.9.2
export SPARK_WORKER_MEMORY=16g

SPARK_WORKER_MEMORY is the maximum amount of memory that Spark can use on each node. Increasing this allows more data to be cached, but be sure to leave memory (e.g. 1 GB) for the OS and any other services that the node may be running.

Shark

Download the binary distribution of Shark 0.2, which also includes the patched Hive 0.9:

$ wget https://github.com/downloads/amplab/shark/shark-0.2-bin.tgz
$ tar xvfz shark-0.2-bin.tgz

Now edit shark-0.2/conf/shark-env.sh to set the HIVE_HOME, SCALA_HOME and MASTER environmental variables:

export HADOOP_HOME=/path/to/hadoop
export HIVE_HOME=/path/to/hive-0.9.0-bin
export MASTER=spark://<MASTER_IP>:7077
export SPARK_HOME=/path/to/spark
export SPARK_MEM=16g

source $SPARK_HOME/conf/spark-env.sh

The last line is there to avoid setting SCALA_HOME in two places. Make sure SPARK_MEM is not larger than SPARK_WORKER_MEMORY set in the previous section.

Copy the Spark and Shark directories to slaves. We assume that the user on the master can SSH to the slaves. For example:

$ while read slave_host; do
$   rsync -Pav spark-0.6.0 shark-0.2 $slave_host
$ done < /path/to/spark/conf/slaves

Launch the cluster by running the Spark cluster launch scripts:

$ cd spark-0.6.0
$ ./bin/start_all.sh

Testing

You can now launch Shark with the command

$ ./bin/shark-withinfo

More detailed information on Spark standalone scripts and options is also available.

To verify that Shark is running, you can try the following example, which creates a table with sample data:

CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;    
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;

See the Shark User Guide for more details on using Shark.