Introduction to Cloud Computing - UMIACS

Introduction to Cloud Computing - UMIACS

Data-Intensive Text Processing (Bonus session) with MapReduce Tutorial at 2009 North American Chapter of the Association for Computational LinguisticsHuman Language Technologies Conference (NAACL HLT 2009) Jimmy Lin The iSchool University of Maryland

Chris Dyer Department of Linguistics University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Agenda Hadoop nuts and bolts Hello World Hadoop example

(distributed word count) Running Hadoop in standalone mode Running Hadoop on EC2

Open-source Hadoop ecosystem Exercises and office hours

Hadoop nuts and bolts Source: http://davidzinger.wordpress.com/2007/05/page/2/ Hadoop Zen Dont get frustrated (take a deep breath)

This is bleeding edge technology:

Remember this when you experience those W$*#[email protected]! moments Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none)

But Hadoop is the path to data nirvana? Cloud9 Library used for teaching cloud computing courses at Maryland

Demos, sample code, etc. Computing conditional probabilities

Pairs vs. stripes Complex data types Boilerplate code for working various IR collections Dog food for research

Open source, anonymous svn access Master node Client Client JobTracker

JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

TaskTracker Slave node Slave node Slave node From Theory to Practice

1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 You

Hadoop Cluster 5. Move data out of HDFS 6. Scp data from cluster Data Types in Hadoop Writable WritableComprable

IntWritable LongWritable Text Defines a de/serialization protocol. Every data type in Hadoop is a Writable. Defines a sort order. All keys must be

of this type (but not values). Concrete classes for different data types. Complex Data Types in Hadoop How do you implement complex data types?

The easiest way:

The hard way: Encoded it as Text, e.g., (a, b) = a:b Use regular expressions to parse and extract data

Works, but pretty hack-ish Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping Alternatives:

Cloud9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types Input Input file file (on (on HDFS) HDFS)

InputSplit InputSplit InputFormat RecordReader RecordReader Mapper Mapper

Partitioner Partitioner Reducer Reducer OutputFormat

RecordWriter RecordWriter Output Output file file (on (on HDFS) HDFS)

What version should I use? Hello World Hadoop example Hadoop in standalone mode Hadoop in EC2 From Theory to Practice

1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 You

Hadoop Cluster 5. Move data out of HDFS 6. Scp data from cluster On Amazon: With EC2 0. Allocate Hadoop cluster 1. Scp data to cluster 2. Move data into HDFS

EC2 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Your Hadoop Cluster You

5. Move data out of HDFS 6. Scp data from cluster 7. Clean up! Uh oh. Where did the data go? On Amazon: EC2 and S3 Copy from S3 to HDFS

S3 EC2 (Persistent Store) (The Cloud) Your Hadoop Cluster

Copy from HFDS to S3 Open-source Hadoop ecosystem Hadoop/HDFS Hadoop streaming

HDFS/FUSE EC2/S3/EBS EMR Pig HBase

Hypertable Hive Mahout Cassandra

Dryad CUDA CELL Beware of toys! Exercises

Questions? Comments? Thanks to the organizations who support our work:

Recently Viewed Presentations

  • Coach as Servant Leader?

    Coach as Servant Leader?

    New Testament Leaders. Steersman (overseer) kypernesis. Steward (manager) oikonomos. Servant (deacon) diakonia. The steersman leader comes from the Greek kypernesis, and refers to one who acts as pilot or helmsman (steers).The steward leader comes from the Greek word oikonomos.This refers...
  • Coaching for Resilience - East Midlands Councils

    Coaching for Resilience - East Midlands Councils

    Stress hardy people obviously have a natural advantage than those of us who do not have these personality traits; however research is suggesting that those of us who do not naturally have the stress hardy personality traits can actually learn...
  • G-Protein Coupled Receptors (GPCRs) Lectures: February 28, March

    G-Protein Coupled Receptors (GPCRs) Lectures: February 28, March

    (B) Illustration of the central core of rhodopsin. The core is viewed from the cytoplasm. The length and orientation of the TMs are deduced from the two-dimensional crystal of bovine and frog rhodopsin (Unger et al., 1997). 1. Basic Structure...
  • SSVF HMIS Requirements

    SSVF HMIS Requirements

    SSVF Data Requirements. Must ask every required data element for every client. Household members are clients. Must enter the data into HMIS accurately. Must make every effort to enter and update client records in HMIS within 24 hours of data...
  • City Council Study Session St. Louis Park Policing

    City Council Study Session St. Louis Park Policing

    St. Louis Park Police Mission. Provide citizens with quality service, professional conduct and a safe environment in which to live, work and learn. The St. Louis Park Police Department is committed to an active partnership with our community as we...
  • Language learning in humans and machines: making connections ...

    Language learning in humans and machines: making connections ...

    2 for training: 24min, 490 English word types. Arapaho (~1000 speakers) 8 narratives, several speakers (40min) 1 for training: 18min, 233 English word types. No re-tuning of hyperparameters, except threshold for returning matches. National Institute for Japanese Language and Linguistics...
  • Section 7-1 Graphing Exponential Functions

    Section 7-1 Graphing Exponential Functions

    Vocabulary. Natural Base, e: an irrational number with a value of 2.71828. ... Natural Logarithm. You can write an equivalent base e exponential equation for a natural logarithmic equation by using the fact that ln x=?????.
  • SCOR Group results at September 30, 2005

    SCOR Group results at September 30, 2005

    The model reproduces default statistics (e.g. S&P) and has been calibrated with Moody's KMV default probabilities The credit risk model ("PL") model predicts the credit spread derived from the default probability (EDF) Simulation study: simulated defaults in line with the...