Child pages
  • Working on the EMR cluster (CS 205)
Skip to end of metadata
Go to start of metadata

Very Important: Please terminate the cluster as soon as you are done. Otherwise, you will continue to be charged!

Once the cluster is ready

It will show the status as "Waiting". 
Master public SSH


If you click on "SSH", it will give you instructions on how to ssh.

Copying files to hadoop file system (HDFS)

Before you can work with any data (input files), they need to be loaded to HDFS.

For example, for a file "file.txt" (or directory "data"):

hdfs dfs -put file.txt 

hdfs dfs -put data

Running spark

You can run spark from the command line with:

spark-submit <python_script>


spark-submit <python_script> ..files or other options

Note that any data file should already be on hdfs.



For examples 2 and 3, you need to load the /usr/lib/spark/data directory to hdfs. So do:

hdfs dfs -put  /usr/lib/spark/data

    The program is in: /usr/lib/spark/examples/src/main/python
    You can try it on itself. To do that, load to hdfs

    hdfs dfs -put  /usr/lib/spark/examples/src/main/python/
    The program is in /usr/lib/spark/examples/src/main/python/mllib

    The program is in /usr/lib/spark/examples/src/main/python/mllib


jupyter access (formerly ipython notebook)

Once the cluster is up:

jupyter notebook is available at port 8880

Before you can connect to it, you need to ssh tunnel. In a terminal (on Mac and LInux), type:

Files relevant to the class

1) Large files for word count. You can clone this on EMR.

    The data directory in:

    has text files from the Gutenberg collection of books.

2) Ipython notebook with spark.

      The following is an EMR version of a notebook from CS 109:


     The original notebook is:

     The above is from the github repository


  • No labels