Hadoop – Big Data Processing
The Stern Hadoop cluster has been retired and replaced by the NYU HPC PEEL hadoop cluster. However much of the information below is still relevant in using the PEEL cluster. For more information on PEEL, go to
Overview of hadoop
Stern has a server named — bigdata.stern.nyu.edu that is used to access the tools for processing very large datasets. The bigdata Stern server is a Linux server with Hadoop, Hive, Datameer, Tableau, Mahout, SAS, R, Python, Matlab, MySQL, and STATA. The server is accessible using your Stern credentials.
Hadoop is an open source software framework written in the Java programming language that enables distributed storage and processing of large data sets across many nodes. What is unique about Hadoop is that it brings the processing to where the data resides unlike a typical relational database where the location of the data and the processing are typically independent of each other.
Stern Research Computing has a small Hadoop cluster. There are approximately 15 processing nodes with about 50 cores and about 6TB of disk.
On this page is a few short, interesting, and informative videos to get you started understanding the basics of Hadoop and Big Data Processing.
What is Hadoop?
How Hadoop Works
An important component of hadoop is MapReduce. It is a methodology that allows you to write programs that process large amounts of unstructured data in parallel across the hadoop nodes.