Home > Hadoop, java > Hadoop 0.20.1 API: refactoring the InvertedLine example from Cloudera Training, removing deprecated classes (JobConf, others)

Hadoop 0.20.1 API: refactoring the InvertedLine example from Cloudera Training, removing deprecated classes (JobConf, others)


I’ve been learning Hadoop for the past 15 days and I have found lots of examples of source-code. The basic training offered by Cloudera uses the 0.18 API, as well as the Yahoo developer’s tutorial that describe the example of a the Inverted Line Index example. The input of this example is a list of one or more text files containing books, and the output is the index of words appearing on each of the files in the format “”, where word is found on a given line of the given fileName at the byte offset given. Although the example works without a problem, I’ve read documentations about the Pig application where the majority of the warnings are caused by the API change. I’m particularly in favour of clean code without warnings, whenever possible. So, I started dissecting the API and could re-implement the examples using the Hadoop 0.20.1. Furthermore, the MRUnit must also be refactored in order to make use of the new API.

Both the Yahoo Hadoop Tutorial and the Cloudera Basic Training documentation “Writing MapReduce Programs” give the example of the InvertedIndex application. I used the Cloudera VMWare implementation and source-code as a starting point.

The first major change was the inclusion of the mapreduce package, containing the new implementation of the Mapper and Reducer classes, which were Interfaces in the previous APIs in the package “mapred”.

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

class MyMapper extends Mapper {
    ...
    ...
}

class MyReducer extends Reducer {
    ...
    ...
}

Also, note that these classes use the Java generics capabilities and therefore, the methods “map()” and “reduce()” must follow the convention given in your implementation. Both methods removed the use of the reporter and collector by the use of a Context class, that is a static member class of each of the Mapper and Reducer classes.

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

class MyMapper extends Mapper {
    ...
    protected void map(K key, V value, Mapper.Context context) {
       ...
    }
}

class MyReducer extends Reducer {
    ...
    protected void reduce(K key, Iterable<V> values, Context context)
}

Consider K and V as generic Writable classes from the Hadoop API, they must be used in the implementation. For instance, I used to have an Iterable implementation for the key in the reducer, and the reduce method was never called with the wrong method signature. So, it is important to verify that you’re using the same Iterable class for the values.

The mapper class just need the new API from the new package. The new imported classes are highlighted in the mapper and reducer codes.

Mapper Class

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
 * LineIndexMapper Maps each observed word in a line to a (filename@offset) string.
 */
public class LineIndexMapper extends Mapper<LongWritable, Text, Text, Text> {

    public LineIndexMapper() {
    }

    /**
     * Google's search Stopwords
     */
    private static Set<String> googleStopwords;

    static {
        googleStopwords = new HashSet<String>();
        googleStopwords.add("I"); googleStopwords.add("a"); 
        googleStopwords.add("about"); googleStopwords.add("an"); 
        googleStopwords.add("are"); googleStopwords.add("as");
        googleStopwords.add("at"); googleStopwords.add("be"); 
        googleStopwords.add("by"); googleStopwords.add("com"); 
        googleStopwords.add("de"); googleStopwords.add("en");
        googleStopwords.add("for"); googleStopwords.add("from"); 
        googleStopwords.add("how"); googleStopwords.add("in"); 
        googleStopwords.add("is"); googleStopwords.add("it");
        googleStopwords.add("la"); googleStopwords.add("of"); 
        googleStopwords.add("on"); googleStopwords.add("or"); 
        googleStopwords.add("that"); googleStopwords.add("the");
        googleStopwords.add("this"); googleStopwords.add("to"); 
        googleStopwords.add("was"); googleStopwords.add("what"); 
        googleStopwords.add("when"); googleStopwords.add("where");
        googleStopwords.add("who"); googleStopwords.add("will"); 
        googleStopwords.add("with"); googleStopwords.add("and"); 
        googleStopwords.add("the"); googleStopwords.add("www");
    }

    /**
     * @param key is the byte offset of the current line in the file;
     * @param value is the line from the file
     * @param output has the method "collect()" to output the key,value pair
     * @param reporter allows us to retrieve some information about the job (like the current filename)
     *
     *     POST-CONDITION: Output <"word", "filename@offset"> pairs
     */
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Compile all the words using regex
        Pattern p = Pattern.compile("\\w+");
        Matcher m = p.matcher(value.toString());

        // Get the name of the file from the inputsplit in the context
        String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

        // build the values and write <k,v> pairs through the context
        StringBuilder valueBuilder = new StringBuilder();
        while (m.find()) {
            String matchedKey = m.group().toLowerCase();
            // remove names starting with non letters, digits, considered stopwords or containing other chars
            if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))
                    || googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {
                continue;
            }
            valueBuilder.append(fileName);
            valueBuilder.append("@");
            valueBuilder.append(key.get());
            // emit the partial <k,v>
            context.write(new Text(matchedKey), new Text(valueBuilder.toString()));
            valueBuilder.setLength(0);
        }
    }
}

Reducer Class

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * LineIndexReducer Takes a list of filename@offset entries for a single word and concatenates them into a list.
 */
public class LineIndexReducer extends Reducer<Text, Text, Text, Text> {

    public LineIndexReducer() {
    }

    /**
     * @param key is the key of the mapper
     * @param values are all the values aggregated during the mapping phase
     * @param context contains the context of the job run
     *
     *      PRE-CONDITION: receive a list of <"word", "filename@offset"> pairs
     *        <"marcello", ["a.txt@3345", "b.txt@344", "c.txt@785"]>
     *
     *      POST-CONDITION: emit the output a single key-value where all the file names
     *        are separated by a comma ",".
     *        <"marcello", "a.txt@3345,b.txt@344,c.txt@785">
     */
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        StringBuilder valueBuilder = new StringBuilder();

        for (Text val : values) {
            valueBuilder.append(val);
            valueBuilder.append(",");
        }
        //write the key and the adjusted value (removing the last comma)
        context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));
        valueBuilder.setLength(0);
    }
}

These are the changes necessary for the Mapper and Reducer classes, without the need to extend the base classes. In order to unit test these classes, changes on the MRUnit are also necessary. The drivers were also added a new “mapreduce” package with the same counterparts.

Instead of the mrunit.MapDriver, use the mapreduce.MapDriver. The same for the Reducer class. The rest of the code is just the same.

import org.apache.hadoop.mrunit.MapDriver;

import org.apache.hadoop.mrunit.mapreduce.MapDriver;

JUnit’s MapperTest

Some changes also are required in the MRUnit API classes, following the same pattern as the main API: the addition of the package “mapreduce” and new implementing classes.

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mock.MockInputSplit;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the inverted index mapper.
 */
public class MapperTest extends TestCase {

    private Mapper<LongWritable, Text, Text, Text> mapper;
    private MapDriver<LongWritable, Text, Text, Text> driver;

    /** We expect pathname@offset for the key from each of these */
    private final Text EXPECTED_OFFSET = new Text(MockInputSplit.getMockPath().toString() + "@0");

    @Before
    public void setUp() {
        mapper = new LineIndexMapper();
        driver = new MapDriver<LongWritable, Text, Text, Text>(mapper);
    }

    @Test
    public void testEmpty() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();

        assertListEquals(expected, out);
    }

    @Test
    public void testOneWord() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("foo")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("foo"), EXPECTED_OFFSET));

        assertListEquals(expected, out);
    }

    @Test
    public void testMultiWords() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("foo"), EXPECTED_OFFSET));
        expected.add(new Pair<Text, Text>(new Text("bar"), EXPECTED_OFFSET));
        expected.add(new Pair<Text, Text>(new Text("baz"), EXPECTED_OFFSET));

        assertListEquals(expected, out);
    }
}

JUnit’s ReducerTest

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the inverted index reducer.
 */
public class ReducerTest extends TestCase {

    private Reducer<Text, Text, Text, Text> reducer;
    private ReduceDriver<Text, Text, Text, Text> driver;

    @Before
    public void setUp() {
        reducer = new LineIndexReducer();
        driver = new ReduceDriver<Text, Text, Text, Text>(reducer);
    }

    @Test
    public void testOneOffset() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("word"), new Text("offset")));

        assertListEquals(expected, out);
    }

    @Test
    public void testMultiOffset() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset1")).withInputValue(
                    new Text("offset2")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("word"), new Text("offset1,offset2")));

        assertListEquals(expected, out);
    }
}

You can test them using the command “ant test” on the source-code directory as usual to confirm that the implementation is correct:

training@training-vm:~/git/exercises/shakespeare$ ant test
Buildfile: build.xml</span></span>

compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin

test:
[junit] Running index.AllTests
[junit] Testsuite: index.AllTests
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec
[junit]

BUILD SUCCESSFUL
Total time: 2 seconds

Replacing JobConf and other deprecated classes

Other changes related to the API is on the configuration of the execution of the jobs. The class “JobConf” was deprecated, but most of the tutorials have not been updated. So, here’s the updated version of the main example driver using the Configuration and Context classes. Note that the job is configured and executed with the default version of the configuration. It is the class responsible for configuring the execution of the tasks. Once again, the replacement of the classes located at the package “mapred” is important, since the new classes are located at the package “mapreduce”. The following code highlights the new classes imported and how they are used throughout the Driver.

InvertedIndex driver

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * LineIndexer Creates an inverted index over all the words in a document corpus, mapping each observed word to a list
 * of filename@offset locations where it occurs.
 */
public class LineIndexer extends Configured implements Tool {

    // where to put the data in hdfs when we're done
    private static final String OUTPUT_PATH = "output";

    // where to read the data from.
    private static final String INPUT_PATH = "input";

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf, "Line Indexer 1");

        job.setJarByClass(WordFrequenceInDocument.class);
        job.setMapperClass(LineIndexMapper.class);
        job.setReducerClass(LineIndexReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordFrequenceInDocument(), args);
        System.exit(res);
    }
}

After updating, make sure to run generate a new jar, remove anything under the directory “output” (since the program does not clean that up), and execute the new version.

training@training-vm:~/git/exercises/shakespeare$ ant jar
Buildfile: build.xml</span></span>

compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin

jar:
[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar

BUILD SUCCESSFUL
Total time: 1 second

I have added 2 ASCII books in the input directory: the works from Leonardo Da Vinci and the first volume of the book “The outline of science”.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input
Found 3 items
-rw-r--r--   1 training supergroup    5342761 2009-12-30 11:57 /user/training/input/all-shakespeare
-rw-r--r--   1 training supergroup    1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt
-rw-r--r--   1 training supergroup     674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt</span></span>

The execution and output of running this example is shown as follows.

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.LineIndexer
10/01/04 21:11:55 INFO input.FileInputFormat: Total input paths to process : 3
10/01/04 21:11:56 INFO mapred.JobClient: Running job: job_200912301017_0017
10/01/04 21:11:57 INFO mapred.JobClient:  map 0% reduce 0%
10/01/04 21:12:07 INFO mapred.JobClient:  map 33% reduce 0%
10/01/04 21:12:10 INFO mapred.JobClient:  map 58% reduce 0%
10/01/04 21:12:13 INFO mapred.JobClient:  map 63% reduce 0%
10/01/04 21:12:16 INFO mapred.JobClient:  map 100% reduce 11%
10/01/04 21:12:28 INFO mapred.JobClient:  map 100% reduce 77%
10/01/04 21:12:34 INFO mapred.JobClient:  map 100% reduce 100%
10/01/04 21:12:36 INFO mapred.JobClient: Job complete: job_200912301017_0017
10/01/04 21:12:36 INFO mapred.JobClient: Counters: 17
10/01/04 21:12:36 INFO mapred.JobClient:   Job Counters
10/01/04 21:12:36 INFO mapred.JobClient:     Launched reduce tasks=1
10/01/04 21:12:36 INFO mapred.JobClient:     Launched map tasks=3
10/01/04 21:12:36 INFO mapred.JobClient:     Data-local map tasks=3
10/01/04 21:12:36 INFO mapred.JobClient:   FileSystemCounters
10/01/04 21:12:36 INFO mapred.JobClient:     FILE_BYTES_READ=58068623
10/01/04 21:12:36 INFO mapred.JobClient:     HDFS_BYTES_READ=7445292
10/01/04 21:12:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=92132872
10/01/04 21:12:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=26638259
10/01/04 21:12:36 INFO mapred.JobClient:   Map-Reduce Framework
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce input groups=0
10/01/04 21:12:36 INFO mapred.JobClient:     Combine output records=0
10/01/04 21:12:36 INFO mapred.JobClient:     Map input records=220255
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce shuffle bytes=34064153
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce output records=0
10/01/04 21:12:36 INFO mapred.JobClient:     Spilled Records=2762272
10/01/04 21:12:36 INFO mapred.JobClient:     Map output bytes=32068217
10/01/04 21:12:36 INFO mapred.JobClient:     Combine input records=0
10/01/04 21:12:36 INFO mapred.JobClient:     Map output records=997959
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce input records=997959

The index entry for the word “abandoned” is an example of one present in all of the books:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat output/part-r-00000 | less
 ...
 ...
abandoned       leornardo-davinci-all.txt@1257995,leornardo-davinci-all.txt@652992,all-shakespeare@4657862,all-shakespeare@738818,the-outline-of-science-vol1.txt@642211,the-outline-of-science-vol1.txt@606442,the-outline-of-science-vol1.txt@641585
...
...
Advertisements
  1. February 18, 2010 at 9:26 pm

    Hi,
    I have a small question on how jobs are submitted. In your example you have used command
    ‘hadoop jar indexer.jar index.LineIndexer’ to run the job.

    Is there a way which I can submit jobs to the job tracker from a java class, rather than running is thrugh hadoop.

    Thanks.

  2. m
    July 15, 2010 at 4:04 pm

    Thanks for posting this.

    • July 15, 2010 at 4:24 pm

      No problem! I thought I could share that with the community…

  3. JArod
    July 19, 2010 at 1:22 pm

    Tried on CDH3 with wordcount example, and noticed that the reducer failed to add up values. Any ideas?

    • August 4, 2010 at 12:54 am

      I have no idea why the example does not work… I don’t know if there is a problem in the implementation, but I decided to setup a new environment using only Hadoop on a Ubuntu box. The process is described at http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) and I could run not only the word count, but other MapReduce apps I developed in CDH2 on top of the training artifacts on the VM (see /home/training/workspace/ projects). The only missing components on a fresh Hadoop install were the libs for Unit Testing: HADOOP/contrib/mrunit/hadoop-0.20.1+133-mrunit.jar and HADOOP/lib/junit-4.5.jar.

      You can download my implementations at http://code.google.com/p/programming-artifacts/source/browse/#svn/trunk/workspaces/hacking/hadoop-training/shakespeare/stub-src/src/index. You can try them on a CDH3 by verifying if the Unit testing libs are available.

      1. SCP any missing files from the old CDH2 to CDH3, or any other hadoop installation.
      [training@training-vm:/usr/lib/hadoop-0.20]$ scp contrib/mrunit/hadoop-0.20.1+133-mrunit.jar mdesales@cu064:/u1/hadoop/hadoop-0.20.2/contrib/mrunit/
      [training@training-vm:/usr/lib/hadoop-0.20]$ scp lib/junit-4.5.jar mdesales@cu064:/u1/hadoop/hadoop-0.20.2/lib/

      2. svn co from http://code.google.com/p/programming-artifacts/
      svn co https://programming-artifacts.googlecode.com/svn/trunk/ programming-artifacts –username YOUR_GOOGLE_USER

      3. cd /u1/hadoop/apps/programming-artifacts/workspaces/hacking/hadoop-training/shakespeare

      4. ant

      5. Before moving on, just make sure you have copied the text books to the HDFS.
      [mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -copyFromLocal /u1/hadoop/local-data/ascii-books/ input

      6. For example, you can run my implementation of the TF-IDF algorithm.
      [mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop jar indexer.jar index.WordsInCorpusTFIDF

      [mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop jar indexer.jar index.WordsInCorpusTFIDF
      10/08/04 01:25:35 INFO input.FileInputFormat: Total input paths to process : 3
      10/08/04 01:25:35 INFO mapred.JobClient: Running job: job_201008032330_0009
      10/08/04 01:25:36 INFO mapred.JobClient: map 0% reduce 0%
      10/08/04 01:25:44 INFO mapred.JobClient: map 66% reduce 0%
      10/08/04 01:25:47 INFO mapred.JobClient: map 100% reduce 0%
      10/08/04 01:25:54 INFO mapred.JobClient: map 100% reduce 22%
      10/08/04 01:26:04 INFO mapred.JobClient: map 100% reduce 100%
      10/08/04 01:26:06 INFO mapred.JobClient: Job complete: job_201008032330_0009
      10/08/04 01:26:06 INFO mapred.JobClient: Counters: 17
      10/08/04 01:26:06 INFO mapred.JobClient: Job Counters
      10/08/04 01:26:06 INFO mapred.JobClient: Launched reduce tasks=1
      10/08/04 01:26:06 INFO mapred.JobClient: Launched map tasks=3
      10/08/04 01:26:06 INFO mapred.JobClient: Data-local map tasks=3
      10/08/04 01:26:06 INFO mapred.JobClient: FileSystemCounters
      10/08/04 01:26:06 INFO mapred.JobClient: FILE_BYTES_READ=17321980
      10/08/04 01:26:06 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
      10/08/04 01:26:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=34644068
      10/08/04 01:26:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2047700
      10/08/04 01:26:06 INFO mapred.JobClient: Map-Reduce Framework
      10/08/04 01:26:06 INFO mapred.JobClient: Reduce input groups=53095
      10/08/04 01:26:06 INFO mapred.JobClient: Combine output records=0
      10/08/04 01:26:06 INFO mapred.JobClient: Map input records=77934
      10/08/04 01:26:06 INFO mapred.JobClient: Reduce shuffle bytes=14156791
      10/08/04 01:26:06 INFO mapred.JobClient: Reduce output records=53095
      10/08/04 01:26:06 INFO mapred.JobClient: Spilled Records=831020
      10/08/04 01:26:06 INFO mapred.JobClient: Map output bytes=16490954
      10/08/04 01:26:06 INFO mapred.JobClient: Combine input records=0
      10/08/04 01:26:06 INFO mapred.JobClient: Map output records=415510
      10/08/04 01:26:06 INFO mapred.JobClient: Reduce input records=415510
      10/08/04 01:26:06 INFO input.FileInputFormat: Total input paths to process : 1
      10/08/04 01:26:06 INFO mapred.JobClient: Running job: job_201008032330_0010
      10/08/04 01:26:07 INFO mapred.JobClient: map 0% reduce 0%
      10/08/04 01:26:17 INFO mapred.JobClient: map 100% reduce 0%
      10/08/04 01:26:30 INFO mapred.JobClient: map 100% reduce 100%
      10/08/04 01:26:32 INFO mapred.JobClient: Job complete: job_201008032330_0010
      10/08/04 01:26:32 INFO mapred.JobClient: Counters: 17
      10/08/04 01:26:32 INFO mapred.JobClient: Job Counters
      10/08/04 01:26:32 INFO mapred.JobClient: Launched reduce tasks=1
      10/08/04 01:26:32 INFO mapred.JobClient: Launched map tasks=1
      10/08/04 01:26:32 INFO mapred.JobClient: Data-local map tasks=1
      10/08/04 01:26:32 INFO mapred.JobClient: FileSystemCounters
      10/08/04 01:26:32 INFO mapred.JobClient: FILE_BYTES_READ=2153896
      10/08/04 01:26:32 INFO mapred.JobClient: HDFS_BYTES_READ=2047700
      10/08/04 01:26:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4307824
      10/08/04 01:26:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2410104
      10/08/04 01:26:32 INFO mapred.JobClient: Map-Reduce Framework
      10/08/04 01:26:32 INFO mapred.JobClient: Reduce input groups=3
      10/08/04 01:26:32 INFO mapred.JobClient: Combine output records=0
      10/08/04 01:26:32 INFO mapred.JobClient: Map input records=53095
      10/08/04 01:26:32 INFO mapred.JobClient: Reduce shuffle bytes=0
      10/08/04 01:26:32 INFO mapred.JobClient: Reduce output records=53095
      10/08/04 01:26:32 INFO mapred.JobClient: Spilled Records=106190
      10/08/04 01:26:32 INFO mapred.JobClient: Map output bytes=2047700
      10/08/04 01:26:32 INFO mapred.JobClient: Combine input records=0
      10/08/04 01:26:32 INFO mapred.JobClient: Map output records=53095
      10/08/04 01:26:32 INFO mapred.JobClient: Reduce input records=53095
      10/08/04 01:26:32 INFO input.FileInputFormat: Total input paths to process : 1
      10/08/04 01:26:32 INFO mapred.JobClient: Running job: job_201008032330_0011
      10/08/04 01:26:33 INFO mapred.JobClient: map 0% reduce 0%
      10/08/04 01:26:44 INFO mapred.JobClient: map 100% reduce 0%
      10/08/04 01:26:57 INFO mapred.JobClient: map 100% reduce 100%
      10/08/04 01:26:59 INFO mapred.JobClient: Job complete: job_201008032330_0011
      10/08/04 01:26:59 INFO mapred.JobClient: Counters: 17
      10/08/04 01:26:59 INFO mapred.JobClient: Job Counters
      10/08/04 01:26:59 INFO mapred.JobClient: Launched reduce tasks=1
      10/08/04 01:26:59 INFO mapred.JobClient: Launched map tasks=1
      10/08/04 01:26:59 INFO mapred.JobClient: Data-local map tasks=1
      10/08/04 01:26:59 INFO mapred.JobClient: FileSystemCounters
      10/08/04 01:26:59 INFO mapred.JobClient: FILE_BYTES_READ=2516300
      10/08/04 01:26:59 INFO mapred.JobClient: HDFS_BYTES_READ=2410104
      10/08/04 01:26:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5032632
      10/08/04 01:26:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3521291
      10/08/04 01:26:59 INFO mapred.JobClient: Map-Reduce Framework
      10/08/04 01:26:59 INFO mapred.JobClient: Reduce input groups=38637
      10/08/04 01:26:59 INFO mapred.JobClient: Combine output records=0
      10/08/04 01:26:59 INFO mapred.JobClient: Map input records=53095
      10/08/04 01:26:59 INFO mapred.JobClient: Reduce shuffle bytes=0
      10/08/04 01:26:59 INFO mapred.JobClient: Reduce output records=53095
      10/08/04 01:26:59 INFO mapred.JobClient: Spilled Records=106190
      10/08/04 01:26:59 INFO mapred.JobClient: Map output bytes=2410104
      10/08/04 01:26:59 INFO mapred.JobClient: Combine input records=0
      10/08/04 01:26:59 INFO mapred.JobClient: Map output records=53095
      10/08/04 01:26:59 INFO mapred.JobClient: Reduce input records=53095

      8. You can see the results in the HDFS directory “3-tf-idf”.
      [mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -ls 3-tf-idf
      Found 2 items
      drwxr-xr-x – mdesales supergroup 0 2010-08-04 01:26 /user/mdesales/3-tf-idf/_logs
      -rw-r–r– 1 mdesales supergroup 3521291 2010-08-04 01:26 /user/mdesales/3-tf-idf/part-r-00000
      [mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -cat 3-tf-idf/part-r-00000

      8. cat on the output of the execution to see the output, which is implemented to output sorted values.
      [mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -cat 3-tf-idf/part-r-00000

      a1@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
      aa@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 1/149599 , 0.00000319]
      aaron@joyce-james-book.txt [1/3 , 2/195261 , 0.00000489]
      ab@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 3/149599 , 0.00000957]
      abacho@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 1/149599 , 0.00000319]
      aback@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
      abacus@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 3/149599 , 0.00000957]
      abaft@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
      abandon@the-notebooks-of-leonardo-da-vinci.txt [2/3 , 6/149599 , 0.00000706]
      abandon@joyce-james-book.txt [2/3 , 1/195261 , 0.0000009]
      abandoned@the-outline-of-science-vol1.txt [3/3 , 3/70650 , 0.00004246]
      abandoned@the-notebooks-of-leonardo-da-vinci.txt [3/3 , 2/149599 , 0.00001337]
      abandoned@joyce-james-book.txt [3/3 , 7/195261 , 0.00003585]
      abandoning@the-notebooks-of-leonardo-da-vinci.txt [2/3 , 2/149599 , 0.00000235]
      abandoning@joyce-james-book.txt [2/3 , 1/195261 , 0.0000009]
      abandonment@the-outline-of-science-vol1.txt [2/3 , 1/70650 , 0.00000249]
      abandonment@joyce-james-book.txt [2/3 , 1/195261 , 0.0000009]
      abandons@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 2/149599 , 0.00000638]
      abasement@joyce-james-book.txt [1/3 , 2/195261 , 0.00000489]
      abatement@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
      abattoir@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
      abbas@joyce-james-book.txt [1/3 , 2/195261 , 0.00000489]
      abbate@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 1/149599 , 0.00000319]

      Any other questions, let me know! Sorry for the delay… I hope this can help you and others…

      Marcello

      • Gk
        December 22, 2011 at 1:20 am

        Hi ,

        I tried out this program on Hadoop 0.20.203rc1 on Ubuntu VM . The code is the same however, I am facing an weird issue where the output file is zero bytes at the end of Mapreduce job which runs successfully.

        11/12/22 03:09:00 INFO input.FileInputFormat: Total input paths to process : 3
        11/12/22 03:09:00 INFO mapred.JobClient: Running job: job_201112180234_0054
        11/12/22 03:09:01 INFO mapred.JobClient: map 0% reduce 0%
        11/12/22 03:09:21 INFO mapred.JobClient: map 66% reduce 0%
        11/12/22 03:09:39 INFO mapred.JobClient: map 100% reduce 0%
        11/12/22 03:09:42 INFO mapred.JobClient: map 100% reduce 22%
        11/12/22 03:09:51 INFO mapred.JobClient: map 100% reduce 100%
        11/12/22 03:09:56 INFO mapred.JobClient: Job complete: job_201112180234_0054
        11/12/22 03:09:56 INFO mapred.JobClient: Counters: 24
        11/12/22 03:09:56 INFO mapred.JobClient: Job Counters
        11/12/22 03:09:56 INFO mapred.JobClient: Launched reduce tasks=1
        11/12/22 03:09:56 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=48203
        11/12/22 03:09:56 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
        11/12/22 03:09:56 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
        11/12/22 03:09:56 INFO mapred.JobClient: Launched map tasks=3
        11/12/22 03:09:56 INFO mapred.JobClient: Data-local map tasks=3
        11/12/22 03:09:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=28753
        11/12/22 03:09:56 INFO mapred.JobClient: File Output Format Counters
        11/12/22 03:09:56 INFO mapred.JobClient: Bytes Written=0
        11/12/22 03:09:56 INFO mapred.JobClient: FileSystemCounters
        11/12/22 03:09:56 INFO mapred.JobClient: FILE_BYTES_READ=6
        11/12/22 03:09:56 INFO mapred.JobClient: HDFS_BYTES_READ=1112294
        11/12/22 03:09:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=85473
        11/12/22 03:09:56 INFO mapred.JobClient: File Input Format Counters
        11/12/22 03:09:56 INFO mapred.JobClient: Bytes Read=1111938

        The bytes written and HDFS_BYTES_WRITTEN counter seems to be zero where the problem seems to be for file output format.

        Thanks
        Gk

  1. December 31, 2011 at 11:53 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: