Stern School of Business, New York University

Running hadoop, hive and mahout at the Stern Center for Research Computing

Hadoop is a linux based processing system, so to use it you need to be reasonably familiar with unix/linux commands.
The login system for the hadoop cluster is bigdata.stern.nyu.edu. It has a large storage area (/bigtemp) for temporary file storage, since most users home directories are limited in space. (< 1GB)  The /bigtemp area can be used as a staging area to load your data into the hadoop cluster.
You should store your data in /bigtemp/yournetid, and then you can copy (put) the data into the hadoop system where you can manipulate it using hadoop.
i.e.  At the bigdata command prompt type
mkdir /bigtemp/yournetid
You can then use sftp, scp or wget   commands to move your data into
/bigtemp/yournetid
Once you have the data in /bigtemp/yournetid you can use hadoop commands to move it into the cluster.
To access hadoop,
ssh yournetid@bigdata.stern.nyu.edu
Typing
hadoop  fs  -mkdir  test
Should create a directory “test”  in /user/yournetid  (which is your default folder in the hadoop file system).
type
hadoop fs -lsr
and you will get a list of  all of your files in hadoop
hive
will enter the hive command line environment
mahout  options
will run a mahout job.
Important things to remember.
hadoop keeps all of its files in its own file system called “hdfs”. You need to move your files from linux to the hadoop files system with the
hadoop fs -put /mylocalpath/mylocalfile myhadoopfilename
command. That will copy  the file at
/mylocalpath/mylocalfile
to
myhadoopfilename
in hdfs:/user/yournetid/myhadoopfilename
If your files are in /bigtemp/yournetid
the command would look like this
hadoop  fs   -put  /bigtemp/yournetid/yourfilename  yourfilename
One thing to watch out for. Many hadoop commands (and hive and mahout) work on a directory/folder of files as opposed to single file.
So you often have to create a folder in hadoop and put your file(s) in the folder.
In this case, you would first create the folder.
hadoop fs -mkdir  yourproject
and then
hadoop fs -put /bigtemp/yournetid/yourfilename   yourproject/
This will create the file hdfs://user/yournetid/yourproject/yourfilename
To experiment, you might download the single user version of hadoop and run it locally to get used to where it stores files.