This repository has been archived by the owner on Nov 16, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 355
GetStarted_yarn
Andy Feng edited this page May 31, 2016
·
9 revisions
- Clone CaffeOnSpark code.
git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark
-
Install caffe prerequists per http://caffe.berkeleyvision.org/installation.html
-
Create a CaffeOnSpark/caffe-public/Makefile.config
pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
popd
Uncomment settings as needed:
CPU_ONLY := 1 #if you havce CPU
USE_CUDNN := 1 #if you want to use CUDNN
- Build CaffeOnSpark
pushd ${CAFFE_ON_SPARK}
make build
popd
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.0/lib64:/usr/local/mkl/lib/intel64/
- Installl Apache Hadoop 2.6 per http://hadoop.apache.org/releases.html, and install Apache Spark 1.6.0 per instruction at http://spark.apache.org/downloads.html.
${CAFFE_ON_SPARK}/scripts/local-setup-hadoop.sh
cp ${CAFFE_ON_SPARK}/scripts/*.xml ${HADOOP_HOME}/etc/hadoop
export HADOOP_HOME=$(pwd)/hadoop-2.6.4
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
${CAFFE_ON_SPARK}/scripts/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6
export PATH=${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${PATH}
If you cannot ssh to localhost without a passphrase, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
- Start YARN cluster
${HADOOP_HOME}/bin/hdfs namenode -format
${HADOOP_HOME}/sbin/start-dfs.sh
${HADOOP_HOME}/sbin/start-yarn.sh
- Install mnist and cifar10 dataset into its HDFS
hadoop fs -mkdir -p /projects/machine_learning/image_dataset
${CAFFE_ON_SPARK}/scripts/setup-mnist.sh
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/mnist_*_lmdb hdfs:/projects/machine_learning/image_dataset/
${CAFFE_ON_SPARK}/scripts/setup-cifar10.sh
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/cifar10_*_lmdb hdfs:/projects/machine_learning/image_dataset/
Adjust data/lenet_memory_solver.prototxt and data/cifar10_quick_solver.prototxt with appropriate mode.
solver_mode: CPU #GPU if you use GPU nodes
- Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection. If you have Infiniband interface, please use "-connection infiniband" instead.
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster \
--num-executors ${SPARK_WORKER_INSTANCES} \
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-train \
-features accuracy,loss -label label \
-conf lenet_memory_solver.prototxt \
-devices ${DEVICES} \
-connection ethernet \
-model hdfs:///mnist.model \
-output hdfs:///mnist_features_result
hadoop fs -ls hdfs:///mnist.model
hadoop fs -cat hdfs:///mnist_features_result/*
The training will produce a model and various snapshots.
-rw-r--r-- 3 root supergroup 1725052 2016-02-20 00:57 /mnist_lenet.model
-rw-r--r-- 3 root supergroup 1725052 2016-02-20 00:57 /mnist_lenet_iter_10000.caffemodel
-rw-r--r-- 3 root supergroup 1724462 2016-02-20 00:57 /mnist_lenet_iter_10000.solverstate
-rw-r--r-- 3 root supergroup 1725052 2016-02-20 00:56 /mnist_lenet_iter_5000.caffemodel
-rw-r--r-- 3 root supergroup 1724461 2016-02-20 00:56 /mnist_lenet_iter_5000.solverstate
The feature result file should look like:
{"SampleID":"00009597","accuracy":[1.0],"loss":[0.028171852],"label":[2.0]}
{"SampleID":"00009598","accuracy":[1.0],"loss":[0.028171852],"label":[6.0]}
{"SampleID":"00009599","accuracy":[1.0],"loss":[0.028171852],"label":[1.0]}
{"SampleID":"00009600","accuracy":[0.97],"loss":[0.0677709],"label":[5.0]}
{"SampleID":"00009601","accuracy":[0.97],"loss":[0.0677709],"label":[0.0]}
{"SampleID":"00009602","accuracy":[0.97],"loss":[0.0677709],"label":[1.0]}
{"SampleID":"00009603","accuracy":[0.97],"loss":[0.0677709],"label":[2.0]}
{"SampleID":"00009604","accuracy":[0.97],"loss":[0.0677709],"label":[3.0]}
{"SampleID":"00009605","accuracy":[0.97],"loss":[0.0677709],"label":[4.0]}
You could run a similar steps for cifar10 datasets.
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
hadoop fs -rm -f hdfs:///cifar10.model.h5
hadoop fs -rm -r -f hdfs:///cifar10_features_result
spark-submit --master yarn --deploy-mode cluster \
--num-executors ${SPARK_WORKER_INSTANCES} \
--files ${CAFFE_ON_SPARK}/data/cifar10_quick_solver.prototxt,${CAFFE_ON_SPARK}/data/cifar10_quick_train_test.prototxt,${CAFFE_ON_SPARK}/data/mean.binaryproto \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-train \
-features accuracy,loss -label label \
-conf cifar10_quick_solver.prototxt \
-devices ${DEVICES} \
-connection ethernet \
-model hdfs:///cifar10.model.h5 \
-output hdfs:///cifar10_features_result
hadoop fs -ls hdfs:///cifar10.model.h5
hadoop fs -cat hdfs:///cifar10_features_result/*
- Access CaffeOnSpark from Python
Get started with python on CaffeOnSpark
- Shutdown YARN cluster
${HADOOP_HOME}/sbin/stop-yarn.sh
${HADOOP_HOME}/sbin/stop-dfs.sh
rm -rf /tmp/hadoop-${USER}
rm -rf ${HADOOP_HOME}/logs