Stern School of Business, New York University

Large Scale Distributed Processing for Heavy input/output Intensive Jobs

The Center  has retired it’s Hadoop cluster which was used for large i/o intensive workloads. Users should instead use the NYU HPC cluster.

This infrastructure is based on the open source apache hadoop. In this environment, a data file is split across many systems, and  a program is sent to all of the nodes a file is on rather than bringing the data to single server to process. Google, Yahoo and Facebook, … use this environment for much of their back-end data processing. In some cases, they use tens of thousands of computers to process one job. Some hadoop applications of interest to Stern researchers are Mahout, a distributed data mining toolkit, Hbase , a distributed data base system, Hive, a data warehouse and Pig, a high level language for running hadoop jobs. See http://hadoop.apache.org/ for an overview of Hadoop. The university has a much larger hadoop cluster, see NYU HPC