To start Hadoop, Yarn, Spark, and Spark History Server:
To stop Hadoop, Yarn, Spark, and Spark History Server:
To create the conda environment:
conda env create -f pyspark_conda_env.tar.gz
To activate the conda environment:
conda activate pyspark_conda_env
To run Query 1 with the Dataframe API:
python code/
To run Query 1 with the SQL API:
python code/
To run Query 2 with the Dataframe API:
python code/
To run Query 2 with the RDD API:
python code/
To change the number of executors change NUM_EXECUTORS
To run Query 3:
python code/
To run Query 4:
python code/
Turn off safe mode
hdfs dfsadmin -safemode leave
HDFS Report
hdfs dfsadmin -report
See size information about HDFS filesystem
hdfs dfs -du -h /user
Remove Staging Log and event log
hdfs dfs -rm -r -skipTrash /user/user/.sparkStaging
hdfs dfs -rm -r -skipTrash /spark.eventLog/*