Hadoop 0.20.1 API: refactoring the InvertedLine example from Cloudera Training, removing deprecated classes (JobConf, others)
I’ve been learning Hadoop for the past 15 days and I have found lots of examples of source-code. The basic training offered by Cloudera uses the 0.18 API, as well as the Yahoo developer’s tutorial that describe the example of a the Inverted Line Index example. The input of this example is a list of one or more text files containing books, and the output is the index of words appearing on each of the files in the format “”, where word is found on a given line of the given fileName at the byte offset given. Although the example works without a problem, I’ve read documentations about the Pig application where the majority of the warnings are caused by the API change. I’m particularly in favour of clean code without warnings, whenever possible. So, I started dissecting the API and could re-implement the examples using the Hadoop 0.20.1. Furthermore, the MRUnit must also be refactored in order to make use of the new API.
Both the Yahoo Hadoop Tutorial and the Cloudera Basic Training documentation “Writing MapReduce Programs” give the example of the InvertedIndex application. I used the Cloudera VMWare implementation and source-code as a starting point.
The first major change was the inclusion of the mapreduce package, containing the new implementation of the Mapper and Reducer classes, which were Interfaces in the previous APIs in the package “mapred”.
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
class MyMapper extends Mapper {
...
...
}
class MyReducer extends Reducer {
...
...
}
Also, note that these classes use the Java generics capabilities and therefore, the methods “map()” and “reduce()” must follow the convention given in your implementation. Both methods removed the use of the reporter and collector by the use of a Context class, that is a static member class of each of the Mapper and Reducer classes.
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
class MyMapper extends Mapper {
...
protected void map(K key, V value, Mapper.Context context) {
...
}
}
class MyReducer extends Reducer {
...
protected void reduce(K key, Iterable<V> values, Context context)
}
Consider K and V as generic Writable classes from the Hadoop API, they must be used in the implementation. For instance, I used to have an Iterable implementation for the key in the reducer, and the reduce method was never called with the wrong method signature. So, it is important to verify that you’re using the same Iterable class for the values.
The mapper class just need the new API from the new package. The new imported classes are highlighted in the mapper and reducer codes.
Mapper Class
// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
/**
* LineIndexMapper Maps each observed word in a line to a (filename@offset) string.
*/
public class LineIndexMapper extends Mapper<LongWritable, Text, Text, Text> {
public LineIndexMapper() {
}
/**
* Google's search Stopwords
*/
private static Set<String> googleStopwords;
static {
googleStopwords = new HashSet<String>();
googleStopwords.add("I"); googleStopwords.add("a");
googleStopwords.add("about"); googleStopwords.add("an");
googleStopwords.add("are"); googleStopwords.add("as");
googleStopwords.add("at"); googleStopwords.add("be");
googleStopwords.add("by"); googleStopwords.add("com");
googleStopwords.add("de"); googleStopwords.add("en");
googleStopwords.add("for"); googleStopwords.add("from");
googleStopwords.add("how"); googleStopwords.add("in");
googleStopwords.add("is"); googleStopwords.add("it");
googleStopwords.add("la"); googleStopwords.add("of");
googleStopwords.add("on"); googleStopwords.add("or");
googleStopwords.add("that"); googleStopwords.add("the");
googleStopwords.add("this"); googleStopwords.add("to");
googleStopwords.add("was"); googleStopwords.add("what");
googleStopwords.add("when"); googleStopwords.add("where");
googleStopwords.add("who"); googleStopwords.add("will");
googleStopwords.add("with"); googleStopwords.add("and");
googleStopwords.add("the"); googleStopwords.add("www");
}
/**
* @param key is the byte offset of the current line in the file;
* @param value is the line from the file
* @param output has the method "collect()" to output the key,value pair
* @param reporter allows us to retrieve some information about the job (like the current filename)
*
* POST-CONDITION: Output <"word", "filename@offset"> pairs
*/
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Compile all the words using regex
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(value.toString());
// Get the name of the file from the inputsplit in the context
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
// build the values and write <k,v> pairs through the context
StringBuilder valueBuilder = new StringBuilder();
while (m.find()) {
String matchedKey = m.group().toLowerCase();
// remove names starting with non letters, digits, considered stopwords or containing other chars
if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))
|| googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {
continue;
}
valueBuilder.append(fileName);
valueBuilder.append("@");
valueBuilder.append(key.get());
// emit the partial <k,v>
context.write(new Text(matchedKey), new Text(valueBuilder.toString()));
valueBuilder.setLength(0);
}
}
}
Reducer Class
// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* LineIndexReducer Takes a list of filename@offset entries for a single word and concatenates them into a list.
*/
public class LineIndexReducer extends Reducer<Text, Text, Text, Text> {
public LineIndexReducer() {
}
/**
* @param key is the key of the mapper
* @param values are all the values aggregated during the mapping phase
* @param context contains the context of the job run
*
* PRE-CONDITION: receive a list of <"word", "filename@offset"> pairs
* <"marcello", ["a.txt@3345", "b.txt@344", "c.txt@785"]>
*
* POST-CONDITION: emit the output a single key-value where all the file names
* are separated by a comma ",".
* <"marcello", "a.txt@3345,b.txt@344,c.txt@785">
*/
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
StringBuilder valueBuilder = new StringBuilder();
for (Text val : values) {
valueBuilder.append(val);
valueBuilder.append(",");
}
//write the key and the adjusted value (removing the last comma)
context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));
valueBuilder.setLength(0);
}
}
These are the changes necessary for the Mapper and Reducer classes, without the need to extend the base classes. In order to unit test these classes, changes on the MRUnit are also necessary. The drivers were also added a new “mapreduce” package with the same counterparts.
Instead of the mrunit.MapDriver, use the mapreduce.MapDriver. The same for the Reducer class. The rest of the code is just the same.
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
JUnit’s MapperTest
Some changes also are required in the MRUnit API classes, following the same pattern as the main API: the addition of the package “mapreduce” and new implementing classes.
// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;
import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import junit.framework.TestCase;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mock.MockInputSplit;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;
/**
* Test cases for the inverted index mapper.
*/
public class MapperTest extends TestCase {
private Mapper<LongWritable, Text, Text, Text> mapper;
private MapDriver<LongWritable, Text, Text, Text> driver;
/** We expect pathname@offset for the key from each of these */
private final Text EXPECTED_OFFSET = new Text(MockInputSplit.getMockPath().toString() + "@0");
@Before
public void setUp() {
mapper = new LineIndexMapper();
driver = new MapDriver<LongWritable, Text, Text, Text>(mapper);
}
@Test
public void testEmpty() {
List<Pair<Text, Text>> out = null;
try {
out = driver.withInput(new LongWritable(0), new Text("")).run();
} catch (IOException ioe) {
fail();
}
List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
assertListEquals(expected, out);
}
@Test
public void testOneWord() {
List<Pair<Text, Text>> out = null;
try {
out = driver.withInput(new LongWritable(0), new Text("foo")).run();
} catch (IOException ioe) {
fail();
}
List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
expected.add(new Pair<Text, Text>(new Text("foo"), EXPECTED_OFFSET));
assertListEquals(expected, out);
}
@Test
public void testMultiWords() {
List<Pair<Text, Text>> out = null;
try {
out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();
} catch (IOException ioe) {
fail();
}
List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
expected.add(new Pair<Text, Text>(new Text("foo"), EXPECTED_OFFSET));
expected.add(new Pair<Text, Text>(new Text("bar"), EXPECTED_OFFSET));
expected.add(new Pair<Text, Text>(new Text("baz"), EXPECTED_OFFSET));
assertListEquals(expected, out);
}
}
JUnit’s ReducerTest
// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;
import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import junit.framework.TestCase;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;
/**
* Test cases for the inverted index reducer.
*/
public class ReducerTest extends TestCase {
private Reducer<Text, Text, Text, Text> reducer;
private ReduceDriver<Text, Text, Text, Text> driver;
@Before
public void setUp() {
reducer = new LineIndexReducer();
driver = new ReduceDriver<Text, Text, Text, Text>(reducer);
}
@Test
public void testOneOffset() {
List<Pair<Text, Text>> out = null;
try {
out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset")).run();
} catch (IOException ioe) {
fail();
}
List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
expected.add(new Pair<Text, Text>(new Text("word"), new Text("offset")));
assertListEquals(expected, out);
}
@Test
public void testMultiOffset() {
List<Pair<Text, Text>> out = null;
try {
out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset1")).withInputValue(
new Text("offset2")).run();
} catch (IOException ioe) {
fail();
}
List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
expected.add(new Pair<Text, Text>(new Text("word"), new Text("offset1,offset2")));
assertListEquals(expected, out);
}
}
You can test them using the command “ant test” on the source-code directory as usual to confirm that the implementation is correct:
training@training-vm:~/git/exercises/shakespeare$ ant test Buildfile: build.xml</span></span> compile: [javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin test: [junit] Running index.AllTests [junit] Testsuite: index.AllTests [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec [junit] BUILD SUCCESSFUL Total time: 2 seconds
Replacing JobConf and other deprecated classes
Other changes related to the API is on the configuration of the execution of the jobs. The class “JobConf” was deprecated, but most of the tutorials have not been updated. So, here’s the updated version of the main example driver using the Configuration and Context classes. Note that the job is configured and executed with the default version of the configuration. It is the class responsible for configuring the execution of the tasks. Once again, the replacement of the classes located at the package “mapred” is important, since the new classes are located at the package “mapreduce”. The following code highlights the new classes imported and how they are used throughout the Driver.
InvertedIndex driver
// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* LineIndexer Creates an inverted index over all the words in a document corpus, mapping each observed word to a list
* of filename@offset locations where it occurs.
*/
public class LineIndexer extends Configured implements Tool {
// where to put the data in hdfs when we're done
private static final String OUTPUT_PATH = "output";
// where to read the data from.
private static final String INPUT_PATH = "input";
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, "Line Indexer 1");
job.setJarByClass(WordFrequenceInDocument.class);
job.setMapperClass(LineIndexMapper.class);
job.setReducerClass(LineIndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordFrequenceInDocument(), args);
System.exit(res);
}
}
After updating, make sure to run generate a new jar, remove anything under the directory “output” (since the program does not clean that up), and execute the new version.
training@training-vm:~/git/exercises/shakespeare$ ant jar Buildfile: build.xml</span></span> compile: [javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin jar: [jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar BUILD SUCCESSFUL Total time: 1 second
I have added 2 ASCII books in the input directory: the works from Leonardo Da Vinci and the first volume of the book “The outline of science”.
training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input Found 3 items -rw-r--r-- 1 training supergroup 5342761 2009-12-30 11:57 /user/training/input/all-shakespeare -rw-r--r-- 1 training supergroup 1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt -rw-r--r-- 1 training supergroup 674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt</span></span>
The execution and output of running this example is shown as follows.
training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.LineIndexer 10/01/04 21:11:55 INFO input.FileInputFormat: Total input paths to process : 3 10/01/04 21:11:56 INFO mapred.JobClient: Running job: job_200912301017_0017 10/01/04 21:11:57 INFO mapred.JobClient: map 0% reduce 0% 10/01/04 21:12:07 INFO mapred.JobClient: map 33% reduce 0% 10/01/04 21:12:10 INFO mapred.JobClient: map 58% reduce 0% 10/01/04 21:12:13 INFO mapred.JobClient: map 63% reduce 0% 10/01/04 21:12:16 INFO mapred.JobClient: map 100% reduce 11% 10/01/04 21:12:28 INFO mapred.JobClient: map 100% reduce 77% 10/01/04 21:12:34 INFO mapred.JobClient: map 100% reduce 100% 10/01/04 21:12:36 INFO mapred.JobClient: Job complete: job_200912301017_0017 10/01/04 21:12:36 INFO mapred.JobClient: Counters: 17 10/01/04 21:12:36 INFO mapred.JobClient: Job Counters 10/01/04 21:12:36 INFO mapred.JobClient: Launched reduce tasks=1 10/01/04 21:12:36 INFO mapred.JobClient: Launched map tasks=3 10/01/04 21:12:36 INFO mapred.JobClient: Data-local map tasks=3 10/01/04 21:12:36 INFO mapred.JobClient: FileSystemCounters 10/01/04 21:12:36 INFO mapred.JobClient: FILE_BYTES_READ=58068623 10/01/04 21:12:36 INFO mapred.JobClient: HDFS_BYTES_READ=7445292 10/01/04 21:12:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=92132872 10/01/04 21:12:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26638259 10/01/04 21:12:36 INFO mapred.JobClient: Map-Reduce Framework 10/01/04 21:12:36 INFO mapred.JobClient: Reduce input groups=0 10/01/04 21:12:36 INFO mapred.JobClient: Combine output records=0 10/01/04 21:12:36 INFO mapred.JobClient: Map input records=220255 10/01/04 21:12:36 INFO mapred.JobClient: Reduce shuffle bytes=34064153 10/01/04 21:12:36 INFO mapred.JobClient: Reduce output records=0 10/01/04 21:12:36 INFO mapred.JobClient: Spilled Records=2762272 10/01/04 21:12:36 INFO mapred.JobClient: Map output bytes=32068217 10/01/04 21:12:36 INFO mapred.JobClient: Combine input records=0 10/01/04 21:12:36 INFO mapred.JobClient: Map output records=997959 10/01/04 21:12:36 INFO mapred.JobClient: Reduce input records=997959
The index entry for the word “abandoned” is an example of one present in all of the books:
training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat output/part-r-00000 | less ... ... abandoned leornardo-davinci-all.txt@1257995,leornardo-davinci-all.txt@652992,all-shakespeare@4657862,all-shakespeare@738818,the-outline-of-science-vol1.txt@642211,the-outline-of-science-vol1.txt@606442,the-outline-of-science-vol1.txt@641585 ... ...
Hi,
I have a small question on how jobs are submitted. In your example you have used command
‘hadoop jar indexer.jar index.LineIndexer’ to run the job.
Is there a way which I can submit jobs to the job tracker from a java class, rather than running is thrugh hadoop.
Thanks.
Maybe you’re looking for submitting jobs to queues. Take a look at http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html, on the topic “Submitting Jobs to Queues”. Sorry for the long delay to reply.
Thanks for posting this.
No problem! I thought I could share that with the community…
Tried on CDH3 with wordcount example, and noticed that the reducer failed to add up values. Any ideas?
I have no idea why the example does not work… I don’t know if there is a problem in the implementation, but I decided to setup a new environment using only Hadoop on a Ubuntu box. The process is described at http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) and I could run not only the word count, but other MapReduce apps I developed in CDH2 on top of the training artifacts on the VM (see /home/training/workspace/ projects). The only missing components on a fresh Hadoop install were the libs for Unit Testing: HADOOP/contrib/mrunit/hadoop-0.20.1+133-mrunit.jar and HADOOP/lib/junit-4.5.jar.
You can download my implementations at http://code.google.com/p/programming-artifacts/source/browse/#svn/trunk/workspaces/hacking/hadoop-training/shakespeare/stub-src/src/index. You can try them on a CDH3 by verifying if the Unit testing libs are available.
1. SCP any missing files from the old CDH2 to CDH3, or any other hadoop installation.
[training@training-vm:/usr/lib/hadoop-0.20]$ scp contrib/mrunit/hadoop-0.20.1+133-mrunit.jar mdesales@cu064:/u1/hadoop/hadoop-0.20.2/contrib/mrunit/
[training@training-vm:/usr/lib/hadoop-0.20]$ scp lib/junit-4.5.jar mdesales@cu064:/u1/hadoop/hadoop-0.20.2/lib/
2. svn co from http://code.google.com/p/programming-artifacts/
svn co https://programming-artifacts.googlecode.com/svn/trunk/ programming-artifacts –username YOUR_GOOGLE_USER
3. cd /u1/hadoop/apps/programming-artifacts/workspaces/hacking/hadoop-training/shakespeare
4. ant
5. Before moving on, just make sure you have copied the text books to the HDFS.
[mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -copyFromLocal /u1/hadoop/local-data/ascii-books/ input
6. For example, you can run my implementation of the TF-IDF algorithm.
[mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop jar indexer.jar index.WordsInCorpusTFIDF
[mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop jar indexer.jar index.WordsInCorpusTFIDF
10/08/04 01:25:35 INFO input.FileInputFormat: Total input paths to process : 3
10/08/04 01:25:35 INFO mapred.JobClient: Running job: job_201008032330_0009
10/08/04 01:25:36 INFO mapred.JobClient: map 0% reduce 0%
10/08/04 01:25:44 INFO mapred.JobClient: map 66% reduce 0%
10/08/04 01:25:47 INFO mapred.JobClient: map 100% reduce 0%
10/08/04 01:25:54 INFO mapred.JobClient: map 100% reduce 22%
10/08/04 01:26:04 INFO mapred.JobClient: map 100% reduce 100%
10/08/04 01:26:06 INFO mapred.JobClient: Job complete: job_201008032330_0009
10/08/04 01:26:06 INFO mapred.JobClient: Counters: 17
10/08/04 01:26:06 INFO mapred.JobClient: Job Counters
10/08/04 01:26:06 INFO mapred.JobClient: Launched reduce tasks=1
10/08/04 01:26:06 INFO mapred.JobClient: Launched map tasks=3
10/08/04 01:26:06 INFO mapred.JobClient: Data-local map tasks=3
10/08/04 01:26:06 INFO mapred.JobClient: FileSystemCounters
10/08/04 01:26:06 INFO mapred.JobClient: FILE_BYTES_READ=17321980
10/08/04 01:26:06 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
10/08/04 01:26:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=34644068
10/08/04 01:26:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2047700
10/08/04 01:26:06 INFO mapred.JobClient: Map-Reduce Framework
10/08/04 01:26:06 INFO mapred.JobClient: Reduce input groups=53095
10/08/04 01:26:06 INFO mapred.JobClient: Combine output records=0
10/08/04 01:26:06 INFO mapred.JobClient: Map input records=77934
10/08/04 01:26:06 INFO mapred.JobClient: Reduce shuffle bytes=14156791
10/08/04 01:26:06 INFO mapred.JobClient: Reduce output records=53095
10/08/04 01:26:06 INFO mapred.JobClient: Spilled Records=831020
10/08/04 01:26:06 INFO mapred.JobClient: Map output bytes=16490954
10/08/04 01:26:06 INFO mapred.JobClient: Combine input records=0
10/08/04 01:26:06 INFO mapred.JobClient: Map output records=415510
10/08/04 01:26:06 INFO mapred.JobClient: Reduce input records=415510
10/08/04 01:26:06 INFO input.FileInputFormat: Total input paths to process : 1
10/08/04 01:26:06 INFO mapred.JobClient: Running job: job_201008032330_0010
10/08/04 01:26:07 INFO mapred.JobClient: map 0% reduce 0%
10/08/04 01:26:17 INFO mapred.JobClient: map 100% reduce 0%
10/08/04 01:26:30 INFO mapred.JobClient: map 100% reduce 100%
10/08/04 01:26:32 INFO mapred.JobClient: Job complete: job_201008032330_0010
10/08/04 01:26:32 INFO mapred.JobClient: Counters: 17
10/08/04 01:26:32 INFO mapred.JobClient: Job Counters
10/08/04 01:26:32 INFO mapred.JobClient: Launched reduce tasks=1
10/08/04 01:26:32 INFO mapred.JobClient: Launched map tasks=1
10/08/04 01:26:32 INFO mapred.JobClient: Data-local map tasks=1
10/08/04 01:26:32 INFO mapred.JobClient: FileSystemCounters
10/08/04 01:26:32 INFO mapred.JobClient: FILE_BYTES_READ=2153896
10/08/04 01:26:32 INFO mapred.JobClient: HDFS_BYTES_READ=2047700
10/08/04 01:26:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4307824
10/08/04 01:26:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2410104
10/08/04 01:26:32 INFO mapred.JobClient: Map-Reduce Framework
10/08/04 01:26:32 INFO mapred.JobClient: Reduce input groups=3
10/08/04 01:26:32 INFO mapred.JobClient: Combine output records=0
10/08/04 01:26:32 INFO mapred.JobClient: Map input records=53095
10/08/04 01:26:32 INFO mapred.JobClient: Reduce shuffle bytes=0
10/08/04 01:26:32 INFO mapred.JobClient: Reduce output records=53095
10/08/04 01:26:32 INFO mapred.JobClient: Spilled Records=106190
10/08/04 01:26:32 INFO mapred.JobClient: Map output bytes=2047700
10/08/04 01:26:32 INFO mapred.JobClient: Combine input records=0
10/08/04 01:26:32 INFO mapred.JobClient: Map output records=53095
10/08/04 01:26:32 INFO mapred.JobClient: Reduce input records=53095
10/08/04 01:26:32 INFO input.FileInputFormat: Total input paths to process : 1
10/08/04 01:26:32 INFO mapred.JobClient: Running job: job_201008032330_0011
10/08/04 01:26:33 INFO mapred.JobClient: map 0% reduce 0%
10/08/04 01:26:44 INFO mapred.JobClient: map 100% reduce 0%
10/08/04 01:26:57 INFO mapred.JobClient: map 100% reduce 100%
10/08/04 01:26:59 INFO mapred.JobClient: Job complete: job_201008032330_0011
10/08/04 01:26:59 INFO mapred.JobClient: Counters: 17
10/08/04 01:26:59 INFO mapred.JobClient: Job Counters
10/08/04 01:26:59 INFO mapred.JobClient: Launched reduce tasks=1
10/08/04 01:26:59 INFO mapred.JobClient: Launched map tasks=1
10/08/04 01:26:59 INFO mapred.JobClient: Data-local map tasks=1
10/08/04 01:26:59 INFO mapred.JobClient: FileSystemCounters
10/08/04 01:26:59 INFO mapred.JobClient: FILE_BYTES_READ=2516300
10/08/04 01:26:59 INFO mapred.JobClient: HDFS_BYTES_READ=2410104
10/08/04 01:26:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5032632
10/08/04 01:26:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3521291
10/08/04 01:26:59 INFO mapred.JobClient: Map-Reduce Framework
10/08/04 01:26:59 INFO mapred.JobClient: Reduce input groups=38637
10/08/04 01:26:59 INFO mapred.JobClient: Combine output records=0
10/08/04 01:26:59 INFO mapred.JobClient: Map input records=53095
10/08/04 01:26:59 INFO mapred.JobClient: Reduce shuffle bytes=0
10/08/04 01:26:59 INFO mapred.JobClient: Reduce output records=53095
10/08/04 01:26:59 INFO mapred.JobClient: Spilled Records=106190
10/08/04 01:26:59 INFO mapred.JobClient: Map output bytes=2410104
10/08/04 01:26:59 INFO mapred.JobClient: Combine input records=0
10/08/04 01:26:59 INFO mapred.JobClient: Map output records=53095
10/08/04 01:26:59 INFO mapred.JobClient: Reduce input records=53095
8. You can see the results in the HDFS directory “3-tf-idf”.
[mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -ls 3-tf-idf
Found 2 items
drwxr-xr-x – mdesales supergroup 0 2010-08-04 01:26 /user/mdesales/3-tf-idf/_logs
-rw-r–r– 1 mdesales supergroup 3521291 2010-08-04 01:26 /user/mdesales/3-tf-idf/part-r-00000
[mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -cat 3-tf-idf/part-r-00000
8. cat on the output of the execution to see the output, which is implemented to output sorted values.
[mdesales@cu064 shakespeare]$ /u1/hadoop/hadoop-0.20.2/bin/hadoop dfs -cat 3-tf-idf/part-r-00000
a1@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
aa@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 1/149599 , 0.00000319]
aaron@joyce-james-book.txt [1/3 , 2/195261 , 0.00000489]
ab@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 3/149599 , 0.00000957]
abacho@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 1/149599 , 0.00000319]
aback@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
abacus@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 3/149599 , 0.00000957]
abaft@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
abandon@the-notebooks-of-leonardo-da-vinci.txt [2/3 , 6/149599 , 0.00000706]
abandon@joyce-james-book.txt [2/3 , 1/195261 , 0.0000009]
abandoned@the-outline-of-science-vol1.txt [3/3 , 3/70650 , 0.00004246]
abandoned@the-notebooks-of-leonardo-da-vinci.txt [3/3 , 2/149599 , 0.00001337]
abandoned@joyce-james-book.txt [3/3 , 7/195261 , 0.00003585]
abandoning@the-notebooks-of-leonardo-da-vinci.txt [2/3 , 2/149599 , 0.00000235]
abandoning@joyce-james-book.txt [2/3 , 1/195261 , 0.0000009]
abandonment@the-outline-of-science-vol1.txt [2/3 , 1/70650 , 0.00000249]
abandonment@joyce-james-book.txt [2/3 , 1/195261 , 0.0000009]
abandons@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 2/149599 , 0.00000638]
abasement@joyce-james-book.txt [1/3 , 2/195261 , 0.00000489]
abatement@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
abattoir@joyce-james-book.txt [1/3 , 1/195261 , 0.00000244]
abbas@joyce-james-book.txt [1/3 , 2/195261 , 0.00000489]
abbate@the-notebooks-of-leonardo-da-vinci.txt [1/3 , 1/149599 , 0.00000319]
Any other questions, let me know! Sorry for the delay… I hope this can help you and others…
Marcello
Hi ,
I tried out this program on Hadoop 0.20.203rc1 on Ubuntu VM . The code is the same however, I am facing an weird issue where the output file is zero bytes at the end of Mapreduce job which runs successfully.
11/12/22 03:09:00 INFO input.FileInputFormat: Total input paths to process : 3
11/12/22 03:09:00 INFO mapred.JobClient: Running job: job_201112180234_0054
11/12/22 03:09:01 INFO mapred.JobClient: map 0% reduce 0%
11/12/22 03:09:21 INFO mapred.JobClient: map 66% reduce 0%
11/12/22 03:09:39 INFO mapred.JobClient: map 100% reduce 0%
11/12/22 03:09:42 INFO mapred.JobClient: map 100% reduce 22%
11/12/22 03:09:51 INFO mapred.JobClient: map 100% reduce 100%
11/12/22 03:09:56 INFO mapred.JobClient: Job complete: job_201112180234_0054
11/12/22 03:09:56 INFO mapred.JobClient: Counters: 24
11/12/22 03:09:56 INFO mapred.JobClient: Job Counters
11/12/22 03:09:56 INFO mapred.JobClient: Launched reduce tasks=1
11/12/22 03:09:56 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=48203
11/12/22 03:09:56 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/22 03:09:56 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/22 03:09:56 INFO mapred.JobClient: Launched map tasks=3
11/12/22 03:09:56 INFO mapred.JobClient: Data-local map tasks=3
11/12/22 03:09:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=28753
11/12/22 03:09:56 INFO mapred.JobClient: File Output Format Counters
11/12/22 03:09:56 INFO mapred.JobClient: Bytes Written=0
11/12/22 03:09:56 INFO mapred.JobClient: FileSystemCounters
11/12/22 03:09:56 INFO mapred.JobClient: FILE_BYTES_READ=6
11/12/22 03:09:56 INFO mapred.JobClient: HDFS_BYTES_READ=1112294
11/12/22 03:09:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=85473
11/12/22 03:09:56 INFO mapred.JobClient: File Input Format Counters
11/12/22 03:09:56 INFO mapred.JobClient: Bytes Read=1111938
The bytes written and HDFS_BYTES_WRITTEN counter seems to be zero where the problem seems to be for file output format.
Thanks
Gk