An encrypted message for you… The math behind the RSA… Step-by-step…

So, you’ve come here because you’re curious about my encrypted message 😀 This is to show the output of application I wrote few years back during my undergrad in Computer Science based on the RSA Encryption. So, here’s the encrypted message:

1142022-35601-1-64-1-6020454-8000-5985272-1331-2299968-1753099-
1873700-4200047-1227254-343-3740201-752788-3764509-32768-2406104-
2874958-1771561-238328-1-4830182-3014099-8000-856850-1-373248-
900525-2207342-1985915-27-9261-912673-9261-226981-2689412-945762-
64-5509130-4680829-4567886-729-3969407-1873700-9261-3978260-
4869518-1331-8000-5130977-2197-8-4913-74088-1367631-941192-
3957527-6859-2779352-1-68921-3129271-4913-64-3734521-2523631-
3553443-2763573-6190443-5037509-1-5347086-729-6226022-68921-2406104

Information Hiding is a very old technique used to hide secrets 🙂 So, I use RSA to encrypt messages…

#### RSA Public Key Creation Process ####

-> Configuring random prime numbers
        P = 2591
        Q = 2411

-> Calculating public keys

   N = P * Q; N = 6246901
-> FI = (P-1) * (Q-1); FI = 6241900

-> Calculating (E)
           While MCD(n >= 2 , 6241900) != 1
           MCD(2 , 6241900) = 2
           MCD(3 , 6241900) = 1 Correct!
          E = 3

       Public Key (N,E) = (6246901 , 3)

#### RSA Private Key Creation Process ####

-> Calculating private keys
         Initializing (p1,p2,p3) = (1, 0 , FI(n))
         Initializing (q1,q2,q3) = (0, 1 ,  E  ))
         While q3 != 0
             quoc = p3 / q3
             (t1,t2,t3) = (p1,p2,p3) - quoc * (q1,q2,q3)
             After, arrange the values:
             (p1,p2,p3) = (q1,q2,q3)
             (q1,q2,q3) = (t1,t2,t3)

           (3 <> 0) , then:
             quoc = 6241900 / 3 = 2080633
             (t1,t2,t3) = (0,1,3) - 2080633 * (1,-2080633,1) = (1,-2080633,1)
             (p1,p2,p3) = (1,-2080633,1)
             (q1,q2,q3) = (1,-2080633,1)

           (1 <> 0) , then:
             quoc = 3 / 1 = 3
             (t1,t2,t3) = (1,-2080633,1) - 3 * (-3,6241900,0) = (-3,6241900,0)
             (p1,p2,p3) = (-3,6241900,0)
             (q1,q2,q3) = (-3,6241900,0)

         q3 is zero(0). Now, verify the value of p2. In case of negative, invert it by summing it with FI. (represent the negative number of z(n) by a positive.)

         u2 = -2080633;
          Since u2 is negative, we have:
          D = u2 + FI; D = -2080633 + 6241900 = 4161267

      Private Key (N,D) = (6246901, 4161267);

#### RSA Keys Summary ####

Public Key (N, E) = (6246901, 3)
Private Key (N, D) = (6246901, 4161267)

The public key is used to give to the receiving party. The Secret key is kept in secret 🙂 Ready to see what I want during the Hacker Dojo Day??? Consider the 2 RSA Keys above and follow the calculations to the SECRET MESSAGE 😀

#### Decoding the Message ####

-> Setting the private key
(N , D) = (6246901 , 4161267)

-> Decripting each block
Ascii(x) = x ^ D mod N

Ascii(1142022) = 1142022 ^ 4161267 mod 6246901 = 1771
Ascii(35601) = 35601 ^ 4161267 mod 6246901 = 972
Ascii(1) = 1 ^ 4161267 mod 6246901 = 1
Ascii(64) = 64 ^ 4161267 mod 6246901 = 4
Ascii(1) = 1 ^ 4161267 mod 6246901 = 1
Ascii(6020454) = 6020454 ^ 4161267 mod 6246901 = 99201
Ascii(8000) = 8000 ^ 4161267 mod 6246901 = 20
Ascii(5985272) = 5985272 ^ 4161267 mod 6246901 = 82082
Ascii(1331) = 1331 ^ 4161267 mod 6246901 = 11
Ascii(2299968) = 2299968 ^ 4161267 mod 6246901 = 132
Ascii(1753099) = 1753099 ^ 4161267 mod 6246901 = 200
Ascii(1873700) = 1873700 ^ 4161267 mod 6246901 = 201
Ascii(4200047) = 4200047 ^ 4161267 mod 6246901 = 13218
Ascii(1227254) = 1227254 ^ 4161267 mod 6246901 = 319
Ascii(343) = 343 ^ 4161267 mod 6246901 = 7
Ascii(3740201) = 3740201 ^ 4161267 mod 6246901 = 20820
Ascii(752788) = 752788 ^ 4161267 mod 6246901 = 1215
Ascii(3764509) = 3764509 ^ 4161267 mod 6246901 = 1581
Ascii(32768) = 32768 ^ 4161267 mod 6246901 = 32
Ascii(2406104) = 2406104 ^ 4161267 mod 6246901 = 134
Ascii(2874958) = 2874958 ^ 4161267 mod 6246901 = 17120
Ascii(1771561) = 1771561 ^ 4161267 mod 6246901 = 121
Ascii(238328) = 238328 ^ 4161267 mod 6246901 = 62
Ascii(1) = 1 ^ 4161267 mod 6246901 = 1
Ascii(4830182) = 4830182 ^ 4161267 mod 6246901 = 6205
Ascii(3014099) = 3014099 ^ 4161267 mod 6246901 = 210
Ascii(8000) = 8000 ^ 4161267 mod 6246901 = 20
Ascii(856850) = 856850 ^ 4161267 mod 6246901 = 3132
Ascii(1) = 1 ^ 4161267 mod 6246901 = 1
Ascii(373248) = 373248 ^ 4161267 mod 6246901 = 72
Ascii(900525) = 900525 ^ 4161267 mod 6246901 = 2052
Ascii(2207342) = 2207342 ^ 4161267 mod 6246901 = 1420
Ascii(1985915) = 1985915 ^ 4161267 mod 6246901 = 12001
Ascii(27) = 27 ^ 4161267 mod 6246901 = 3
Ascii(9261) = 9261 ^ 4161267 mod 6246901 = 21
Ascii(912673) = 912673 ^ 4161267 mod 6246901 = 97
Ascii(9261) = 9261 ^ 4161267 mod 6246901 = 21
Ascii(226981) = 226981 ^ 4161267 mod 6246901 = 61
Ascii(2689412) = 2689412 ^ 4161267 mod 6246901 = 3221
Ascii(945762) = 945762 ^ 4161267 mod 6246901 = 620
Ascii(64) = 64 ^ 4161267 mod 6246901 = 4
Ascii(5509130) = 5509130 ^ 4161267 mod 6246901 = 2011
Ascii(4680829) = 4680829 ^ 4161267 mod 6246901 = 32172
Ascii(4567886) = 4567886 ^ 4161267 mod 6246901 = 1971
Ascii(729) = 729 ^ 4161267 mod 6246901 = 9
Ascii(3969407) = 3969407 ^ 4161267 mod 6246901 = 9207
Ascii(1873700) = 1873700 ^ 4161267 mod 6246901 = 201
Ascii(9261) = 9261 ^ 4161267 mod 6246901 = 21
Ascii(3978260) = 3978260 ^ 4161267 mod 6246901 = 41321
Ascii(4869518) = 4869518 ^ 4161267 mod 6246901 = 682
Ascii(1331) = 1331 ^ 4161267 mod 6246901 = 11
Ascii(8000) = 8000 ^ 4161267 mod 6246901 = 20
Ascii(5130977) = 5130977 ^ 4161267 mod 6246901 = 6211
Ascii(2197) = 2197 ^ 4161267 mod 6246901 = 13
Ascii(8) = 8 ^ 4161267 mod 6246901 = 2
Ascii(4913) = 4913 ^ 4161267 mod 6246901 = 17
Ascii(74088) = 74088 ^ 4161267 mod 6246901 = 42
Ascii(1367631) = 1367631 ^ 4161267 mod 6246901 = 111
Ascii(941192) = 941192 ^ 4161267 mod 6246901 = 98
Ascii(3957527) = 3957527 ^ 4161267 mod 6246901 = 132170
Ascii(6859) = 6859 ^ 4161267 mod 6246901 = 19
Ascii(2779352) = 2779352 ^ 4161267 mod 6246901 = 72052
Ascii(1) = 1 ^ 4161267 mod 6246901 = 1
Ascii(68921) = 68921 ^ 4161267 mod 6246901 = 41
Ascii(3129271) = 3129271 ^ 4161267 mod 6246901 = 32140
Ascii(4913) = 4913 ^ 4161267 mod 6246901 = 17
Ascii(64) = 64 ^ 4161267 mod 6246901 = 4
Ascii(3734521) = 3734521 ^ 4161267 mod 6246901 = 1972
Ascii(2523631) = 2523631 ^ 4161267 mod 6246901 = 1013
Ascii(3553443) = 3553443 ^ 4161267 mod 6246901 = 214
Ascii(2763573) = 2763573 ^ 4161267 mod 6246901 = 9154
Ascii(6190443) = 6190443 ^ 4161267 mod 6246901 = 1441
Ascii(5037509) = 5037509 ^ 4161267 mod 6246901 = 32150
Ascii(1) = 1 ^ 4161267 mod 6246901 = 1
Ascii(5347086) = 5347086 ^ 4161267 mod 6246901 = 4814
Ascii(729) = 729 ^ 4161267 mod 6246901 = 9
Ascii(6226022) = 6226022 ^ 4161267 mod 6246901 = 1481
Ascii(68921) = 68921 ^ 4161267 mod 6246901 = 41
Ascii(2406104) = 2406104 ^ 4161267 mod 6246901 = 134

-> Complete message in ASCII
177197214199201208208211132200201132183197208201215158132134171201216216205210203132172205214201200
132197216132216204201132172197199207201214132168211206211132174211198132170197205214132140174197210
132149154144132150148149148141134
-> Original Message
Marcello de Sales: "Breaking the rules"

This is a simple notion of an inverse function. I first implemented this during my undergrad school in Computer Science… So, here’s the original encryption process. I’m looking for a Job, this is my “hacking” way of doing it :P… Marcello de Sales: “Breaking the rules”. Here’s how I calculated the encrypted message.

-> Original Message
Marcello de Sales: "Breaking the rules"

-> Setting the receiver's public key
(N , E) = (6246901 , 3)

-> Transforming the message to ASCII code
177197214199201208208211132200201132183197208201215158132134171201216216205210203132172205214201200
132197216132216204201132172197199207201214132168211206211132174211198132170197205214132140174197210
132149154144132150148149148141134

-> Configuring randomly selected blocks from the ASCII message
Bloco(x) = x ^ E mod N

Block(1771) = 1771 ^ 3 mod 6246901 = 1142022
Block(972) = 972 ^ 3 mod 6246901 = 35601
Block(1) = 1 ^ 3 mod 6246901 = 1
Block(4) = 4 ^ 3 mod 6246901 = 64
Block(1) = 1 ^ 3 mod 6246901 = 1
Block(99201) = 99201 ^ 3 mod 6246901 = 6020454
Block(20) = 20 ^ 3 mod 6246901 = 8000
Block(82082) = 82082 ^ 3 mod 6246901 = 5985272
Block(11) = 11 ^ 3 mod 6246901 = 1331
Block(132) = 132 ^ 3 mod 6246901 = 2299968
Block(200) = 200 ^ 3 mod 6246901 = 1753099
Block(201) = 201 ^ 3 mod 6246901 = 1873700
Block(13218) = 13218 ^ 3 mod 6246901 = 4200047
Block(319) = 319 ^ 3 mod 6246901 = 1227254
Block(7) = 7 ^ 3 mod 6246901 = 343
Block(20820) = 20820 ^ 3 mod 6246901 = 3740201
Block(1215) = 1215 ^ 3 mod 6246901 = 752788
Block(1581) = 1581 ^ 3 mod 6246901 = 3764509
Block(32) = 32 ^ 3 mod 6246901 = 32768
Block(134) = 134 ^ 3 mod 6246901 = 2406104
Block(17120) = 17120 ^ 3 mod 6246901 = 2874958
Block(121) = 121 ^ 3 mod 6246901 = 1771561
Block(62) = 62 ^ 3 mod 6246901 = 238328
Block(1) = 1 ^ 3 mod 6246901 = 1
Block(6205) = 6205 ^ 3 mod 6246901 = 4830182
Block(210) = 210 ^ 3 mod 6246901 = 3014099
Block(20) = 20 ^ 3 mod 6246901 = 8000
Block(3132) = 3132 ^ 3 mod 6246901 = 856850
Block(1) = 1 ^ 3 mod 6246901 = 1
Block(72) = 72 ^ 3 mod 6246901 = 373248
Block(2052) = 2052 ^ 3 mod 6246901 = 900525
Block(1420) = 1420 ^ 3 mod 6246901 = 2207342
Block(12001) = 12001 ^ 3 mod 6246901 = 1985915
Block(3) = 3 ^ 3 mod 6246901 = 27
Block(21) = 21 ^ 3 mod 6246901 = 9261
Block(97) = 97 ^ 3 mod 6246901 = 912673
Block(21) = 21 ^ 3 mod 6246901 = 9261
Block(61) = 61 ^ 3 mod 6246901 = 226981
Block(3221) = 3221 ^ 3 mod 6246901 = 2689412
Block(620) = 620 ^ 3 mod 6246901 = 945762
Block(4) = 4 ^ 3 mod 6246901 = 64
Block(2011) = 2011 ^ 3 mod 6246901 = 5509130
Block(32172) = 32172 ^ 3 mod 6246901 = 4680829
Block(1971) = 1971 ^ 3 mod 6246901 = 4567886
Block(9) = 9 ^ 3 mod 6246901 = 729
Block(9207) = 9207 ^ 3 mod 6246901 = 3969407
Block(201) = 201 ^ 3 mod 6246901 = 1873700
Block(21) = 21 ^ 3 mod 6246901 = 9261
Block(41321) = 41321 ^ 3 mod 6246901 = 3978260
Block(682) = 682 ^ 3 mod 6246901 = 4869518
Block(11) = 11 ^ 3 mod 6246901 = 1331
Block(20) = 20 ^ 3 mod 6246901 = 8000
Block(6211) = 6211 ^ 3 mod 6246901 = 5130977
Block(13) = 13 ^ 3 mod 6246901 = 2197
Block(2) = 2 ^ 3 mod 6246901 = 8
Block(17) = 17 ^ 3 mod 6246901 = 4913
Block(42) = 42 ^ 3 mod 6246901 = 74088
Block(111) = 111 ^ 3 mod 6246901 = 1367631
Block(98) = 98 ^ 3 mod 6246901 = 941192
Block(132170) = 132170 ^ 3 mod 6246901 = 3957527
Block(19) = 19 ^ 3 mod 6246901 = 6859
Block(72052) = 72052 ^ 3 mod 6246901 = 2779352
Block(1) = 1 ^ 3 mod 6246901 = 1
Block(41) = 41 ^ 3 mod 6246901 = 68921
Block(32140) = 32140 ^ 3 mod 6246901 = 3129271
Block(17) = 17 ^ 3 mod 6246901 = 4913
Block(4) = 4 ^ 3 mod 6246901 = 64
Block(1972) = 1972 ^ 3 mod 6246901 = 3734521
Block(1013) = 1013 ^ 3 mod 6246901 = 2523631
Block(214) = 214 ^ 3 mod 6246901 = 3553443
Block(9154) = 9154 ^ 3 mod 6246901 = 2763573
Block(1441) = 1441 ^ 3 mod 6246901 = 6190443
Block(32150) = 32150 ^ 3 mod 6246901 = 5037509
Block(1) = 1 ^ 3 mod 6246901 = 1
Block(4814) = 4814 ^ 3 mod 6246901 = 5347086
Block(9) = 9 ^ 3 mod 6246901 = 729
Block(1481) = 1481 ^ 3 mod 6246901 = 6226022
Block(41) = 41 ^ 3 mod 6246901 = 68921
Block(134) = 134 ^ 3 mod 6246901 = 2406104

-> Encrypted Message
1142022-35601-1-64-1-6020454-8000-5985272-1331-2299968-1753099-1873700-4200047-1227254-343-3740201-752788-3764509-32768-2406104-2874958-1771561-238328-1-4830182-3014099-8000-856850-1-373248-900525-2207342-1985915-27-9261-912673-9261-226981-2689412-945762-64-5509130-4680829-4567886-729-3969407-1873700-9261-3978260-4869518-1331-8000-5130977-2197-8-4913-74088-1367631-941192-3957527-6859-2779352-1-68921-3129271-4913-64-3734521-2523631-3553443-2763573-6190443-5037509-1-5347086-729-6226022-68921-2406104
Advertisements
Categories: Encryption, RSA

RSA Algorithm Explained: a step-by-step process

The art of information hiding, or Cryptography, is one of my favorite applications of Mathematics in computer science. Information Hiding dates from thousand of years ago in “non-standard hieroglyphs carved into monuments from Egypt‘s Old Kingdom (ca 4500+ years ago)” [History of Cryptography]. It has different real-world applications ranging from credit-card security to online transactions in protected websites (those with the security lock). There are different approaches of information hiding, and the one used in the applications cited is the one first described by RivestShamir and Adleman, the RSA. A great post about public key cryptography is summarized by Dr. Duke O’Connor on his website post “RSA is 30, and counting” about the 30th anniversary of the paper gave the origin to RSA:

R. Rivest, A. Shamir, L. Adleman.  A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, Vol. 21 (2), pp.120–126. 1978.

The application used on the Internet can be summarized in the following video:

If you want to teach your children :), take a look at this other video.

RSA Algorithm on a T-ShirtDr. O’Connor summarizes two versions of the Mathematical foundation of the algorithm: one with the first flaw and the corrected version. The “fixed” version of the algorithm is what I had studied in school while studying Abstract Algebra, where the students were distributed into groups of three to implement their own Crypto systems and exchange keys and information in order to be tested. Of course, different assumptions were considered for the exercise. This post is just the output of the application I had developed in Java, what I called Cryptonline. I decided to use my free time to develop a Groovy on Grails Online Forum application where users can only see the messages if they login and use your public/private keys. The intention of the application is just for educational purposes. Read Dr. O’Connor’s post for background background information before moving on (in case you don’t have the background on it :D). As the T-Shirt says, it is just an algorithm.

Suppose you are transferring an important message over an unreliable communication channel such as a letter to someone, or simply to give your bank account to someone for transfer purposes.

Bank Transfer between you and the bank.

As you can see, a reliable communication channel is the one where your message is encrypted and no one in between the sender and receiver can read it because the message is an arrangement that only you and the bank are capable of reading. The only way to do so is the creation of a keys that can lock and unlock the message hidden in the message. The RSA algorithm is based on the idea of Private and Public keys, where both are used to encode and decode the encrypted message. Suppose the string “John Smith” is the text to be decoded. Encryption is the process to hide the information on an Encrypted version and Decryption is the process of decoding the Encrypted text to its original format.

Private Key Use

Of course, the idea is to have a key where you can distribute, so that only people you want to read the message can read it.

Private and Public Keys

This is the basic idea of the algorithm. There are different variations of this algorithm depending on the application. Whenever you visit a website with the lock enabled means that any information exchanged between your computer and the visited website’s computer is encrypted.

By now you should have the background on this subject (considering you have gone to Dr. O’Connor’s website and read the Mathematical background). The source-code below is the actual output from Cryptonline, the one I developed back in 2000 at school. It is divided into 3 steps: Keys creation, Text Encryption and Text Decryption.

Public Key Creation

Let’s start from the Public Key.

-> Configuring random prime numbers
        P = 1277
        Q = 2311
-> Calculating public keys
          N = P * Q; N = 2951147
-> FI = (P-1) * (Q-1); FI = 2947560

-> Calculating (E)
           While MCD(n >= 2 , 2947560) != 1
           MCD(2 , 2947560) = 2
           MCD(3 , 2947560) = 3
           MCD(4 , 2947560) = 4
           MCD(5 , 2947560) = 5
           MCD(6 , 2947560) = 6
           MCD(7 , 2947560) = 7
           MCD(8 , 2947560) = 8
           MCD(9 , 2947560) = 3
           MCD(10 , 2947560) = 10
           MCD(11 , 2947560) = 11
           MCD(12 , 2947560) = 12
           MCD(13 , 2947560) = 1 Correct!
          E = 13

       Public Key (N,E) = (2951147 , 13)

Private Key Creation

The calculation of the Private Key is based on the matrix calculation of the Phi number, considering E.

-> Calculating private keys
         Initializing (p1,p2,p3) = (1, 0 , FI(n))
         Initializing (q1,q2,q3) = (0, 1 ,  E  ))
         While q3 != 0
             quoc = p3 / q3
             (t1,t2,t3) = (p1,p2,p3) - quoc * (q1,q2,q3)
             After, arrange the values:
             (p1,p2,p3) = (q1,q2,q3)
             (q1,q2,q3) = (t1,t2,t3)

           (13 <> 0) , then:
             quoc = 2947560 / 13 = 226735
             (t1,t2,t3) = (0,1,13) - 226735 * (1,-226735,5) = (1,-226735,5)
             (p1,p2,p3) = (1,-226735,5)
             (q1,q2,q3) = (1,-226735,5)

           (5 <> 0) , then:
             quoc = 13 / 5 = 2
             (t1,t2,t3) = (1,-226735,5) - 2 * (-2,453471,3) = (-2,453471,3)
             (p1,p2,p3) = (-2,453471,3)
             (q1,q2,q3) = (-2,453471,3)

           (3 <> 0) , then:
             quoc = 5 / 3 = 1
             (t1,t2,t3) = (-2,453471,3) - 1 * (3,-680206,2) = (3,-680206,2)
             (p1,p2,p3) = (3,-680206,2)
             (q1,q2,q3) = (3,-680206,2)

           (2 <> 0) , then:
             quoc = 3 / 2 = 1
             (t1,t2,t3) = (3,-680206,2) - 1 * (-5,1133677,1) = (-5,1133677,1)
             (p1,p2,p3) = (-5,1133677,1)
             (q1,q2,q3) = (-5,1133677,1)

           (1 <> 0) , then:
             quoc = 2 / 1 = 2
             (t1,t2,t3) = (-5,1133677,1) - 2 * (13,-2947560,0) = (13,-2947560,0)
             (p1,p2,p3) = (13,-2947560,0)
             (q1,q2,q3) = (13,-2947560,0)

         q3 is zero(0). Now, verify the value of p2. In case of negative, invert it by summing it with FI. (represent the negative number of z(n) by a positive.)

         u2 = 1133677;
         D = u2; D = 1133677

      Private Key (N,D) = (2951147, 1133677);

Using the Private and Public Keys

To summarize, the program outputs the keys used throughout the application.

#### All RSA Information ####

Public Key (N, E) = (2951147, 13)
Private Key (N, D) = (2951147, 1133677)

Of course you cannot give the original Prime Numbers to anybody since they are the factors that created the public and private key. So, the only key you can give away to other people is the Public one. Now, in order to illustrate Encryption and Decryption, my encryption machine uses ASCII-based character mapping for the Mathematical calculations. Consider my blog’s title as the input “Marcello de Sales: because solving problems is addicting”. Each character of the string is transformed into the ASCII added other numbers.

Encryption Process

-> Original Message
Marcello de Sales: because solving problems is addicting

-> Setting the receiver's public key
(N , E) = (2951147 , 13)

-> Transforming the message to ASCII code
177197214199201208208211132200201132183197208201215158132198201199197217215201132215211208218205210203132212214211198208201209215132205215132197200200205199216205210203

-> Configuring randomly selected blocks from the ASCII message
Bloco(x) = x ^ E mod N

Block(17) = 17 ^ 13 mod 2951147 = 2920887
Block(71) = 71 ^ 13 mod 2951147 = 1483408
Block(972) = 972 ^ 13 mod 2951147 = 363316
Block(1419) = 1419 ^ 13 mod 2951147 = 1419505
Block(920) = 920 ^ 13 mod 2951147 = 213548
Block(1) = 1 ^ 13 mod 2951147 = 1
Block(20) = 20 ^ 13 mod 2951147 = 93651
Block(8) = 8 ^ 13 mod 2951147 = 1394993
Block(2082) = 2082 ^ 13 mod 2951147 = 2878680
Block(1113) = 1113 ^ 13 mod 2951147 = 770001
Block(2200) = 2200 ^ 13 mod 2951147 = 2301917
Block(20113) = 20113 ^ 13 mod 2951147 = 787047
Block(2183) = 2183 ^ 13 mod 2951147 = 424239
Block(19) = 19 ^ 13 mod 2951147 = 1557862
Block(7) = 7 ^ 13 mod 2951147 = 2854397
Block(208) = 208 ^ 13 mod 2951147 = 375871
Block(2012) = 2012 ^ 13 mod 2951147 = 491468
Block(151) = 151 ^ 13 mod 2951147 = 2348470
Block(58) = 58 ^ 13 mod 2951147 = 966721
Block(13219) = 13219 ^ 13 mod 2951147 = 2596853
Block(820) = 820 ^ 13 mod 2951147 = 1336058
Block(11991) = 11991 ^ 13 mod 2951147 = 2624815
Block(97) = 97 ^ 13 mod 2951147 = 1340264
Block(21721) = 21721 ^ 13 mod 2951147 = 1760166
Block(52011) = 52011 ^ 13 mod 2951147 = 1685895
Block(32215) = 32215 ^ 13 mod 2951147 = 1202590
Block(21) = 21 ^ 13 mod 2951147 = 2752293
Block(1208) = 1208 ^ 13 mod 2951147 = 1414540
Block(21820) = 21820 ^ 13 mod 2951147 = 1733373
Block(5) = 5 ^ 13 mod 2951147 = 1879414
Block(21020) = 21020 ^ 13 mod 2951147 = 310870
Block(3132) = 3132 ^ 13 mod 2951147 = 519822
Block(212) = 212 ^ 13 mod 2951147 = 1315135
Block(2142) = 2142 ^ 13 mod 2951147 = 2430603
Block(1119) = 1119 ^ 13 mod 2951147 = 748920
Block(8208) = 8208 ^ 13 mod 2951147 = 2808982
Block(20) = 20 ^ 13 mod 2951147 = 93651
Block(1209) = 1209 ^ 13 mod 2951147 = 906866
Block(215) = 215 ^ 13 mod 2951147 = 396673
Block(13) = 13 ^ 13 mod 2951147 = 2564672
Block(2205) = 2205 ^ 13 mod 2951147 = 337248
Block(2) = 2 ^ 13 mod 2951147 = 8192
Block(151) = 151 ^ 13 mod 2951147 = 2348470
Block(32) = 32 ^ 13 mod 2951147 = 1191513
Block(197200) = 197200 ^ 13 mod 2951147 = 2266852
Block(200) = 200 ^ 13 mod 2951147 = 104075
Block(20) = 20 ^ 13 mod 2951147 = 93651
Block(519) = 519 ^ 13 mod 2951147 = 1459225
Block(9) = 9 ^ 13 mod 2951147 = 1601171
Block(21620) = 21620 ^ 13 mod 2951147 = 2477239
Block(5210) = 5210 ^ 13 mod 2951147 = 1598948
Block(203) = 203 ^ 13 mod 2951147 = 644537

-> Encrypted Message
2920887-1483408-363316-1419505-213548-1-93651-1394993-2878680-770001-2301917-787047-424239-1557862-2854397-375871-491468-2348470-966721-2596853-1336058-2624815-1340264-1760166-1685895-1202590-2752293-1414540-1733373-1879414-310870-519822-1315135-2430603-748920-2808982-93651-906866-396673-2564672-337248-8192-2348470-1191513-2266852-104075-93651-1459225-1601171-2477239-1598948-644537

Decryption Process

Upon receiving the encrypted message, the receiver needs to use the public key from the sender in order to decrypt the message. The receiver has the same encryption machine, but needs your public key in order to decipher it. The program developed decrypts the message as shown below.

-> Encrypted Message
2920887-1483408-363316-1419505-213548-1-93651-1394993-2878680-770001-2301917-787047-424239-1557862-2854397-375871-491468-2348470-966721-2596853-1336058-2624815-1340264-1760166-1685895-1202590-2752293-1414540-1733373-1879414-310870-519822-1315135-2430603-748920-2808982-93651-906866-396673-2564672-337248-8192-2348470-1191513-2266852-104075-93651-1459225-1601171-2477239-1598948-644537

-> Setting the private key
(N , D) = (2951147 , 1133677)

-> Decripting each block
Ascii(x) = x ^ D mod N

Ascii(2920887) = 2920887 ^ 1133677 mod 2951147 = 17
Ascii(1483408) = 1483408 ^ 1133677 mod 2951147 = 71
Ascii(363316) = 363316 ^ 1133677 mod 2951147 = 972
Ascii(1419505) = 1419505 ^ 1133677 mod 2951147 = 1419
Ascii(213548) = 213548 ^ 1133677 mod 2951147 = 920
Ascii(1) = 1 ^ 1133677 mod 2951147 = 1
Ascii(93651) = 93651 ^ 1133677 mod 2951147 = 20
Ascii(1394993) = 1394993 ^ 1133677 mod 2951147 = 8
Ascii(2878680) = 2878680 ^ 1133677 mod 2951147 = 2082
Ascii(770001) = 770001 ^ 1133677 mod 2951147 = 1113
Ascii(2301917) = 2301917 ^ 1133677 mod 2951147 = 2200
Ascii(787047) = 787047 ^ 1133677 mod 2951147 = 20113
Ascii(424239) = 424239 ^ 1133677 mod 2951147 = 2183
Ascii(1557862) = 1557862 ^ 1133677 mod 2951147 = 19
Ascii(2854397) = 2854397 ^ 1133677 mod 2951147 = 7
Ascii(375871) = 375871 ^ 1133677 mod 2951147 = 208
Ascii(491468) = 491468 ^ 1133677 mod 2951147 = 2012
Ascii(2348470) = 2348470 ^ 1133677 mod 2951147 = 151
Ascii(966721) = 966721 ^ 1133677 mod 2951147 = 58
Ascii(2596853) = 2596853 ^ 1133677 mod 2951147 = 13219
Ascii(1336058) = 1336058 ^ 1133677 mod 2951147 = 820
Ascii(2624815) = 2624815 ^ 1133677 mod 2951147 = 11991
Ascii(1340264) = 1340264 ^ 1133677 mod 2951147 = 97
Ascii(1760166) = 1760166 ^ 1133677 mod 2951147 = 21721
Ascii(1685895) = 1685895 ^ 1133677 mod 2951147 = 52011
Ascii(1202590) = 1202590 ^ 1133677 mod 2951147 = 32215
Ascii(2752293) = 2752293 ^ 1133677 mod 2951147 = 21
Ascii(1414540) = 1414540 ^ 1133677 mod 2951147 = 1208
Ascii(1733373) = 1733373 ^ 1133677 mod 2951147 = 21820
Ascii(1879414) = 1879414 ^ 1133677 mod 2951147 = 5
Ascii(310870) = 310870 ^ 1133677 mod 2951147 = 21020
Ascii(519822) = 519822 ^ 1133677 mod 2951147 = 3132
Ascii(1315135) = 1315135 ^ 1133677 mod 2951147 = 212
Ascii(2430603) = 2430603 ^ 1133677 mod 2951147 = 2142
Ascii(748920) = 748920 ^ 1133677 mod 2951147 = 1119
Ascii(2808982) = 2808982 ^ 1133677 mod 2951147 = 8208
Ascii(93651) = 93651 ^ 1133677 mod 2951147 = 20
Ascii(906866) = 906866 ^ 1133677 mod 2951147 = 1209
Ascii(396673) = 396673 ^ 1133677 mod 2951147 = 215
Ascii(2564672) = 2564672 ^ 1133677 mod 2951147 = 13
Ascii(337248) = 337248 ^ 1133677 mod 2951147 = 2205
Ascii(8192) = 8192 ^ 1133677 mod 2951147 = 2
Ascii(2348470) = 2348470 ^ 1133677 mod 2951147 = 151
Ascii(1191513) = 1191513 ^ 1133677 mod 2951147 = 32
Ascii(2266852) = 2266852 ^ 1133677 mod 2951147 = 197200
Ascii(104075) = 104075 ^ 1133677 mod 2951147 = 200
Ascii(93651) = 93651 ^ 1133677 mod 2951147 = 20
Ascii(1459225) = 1459225 ^ 1133677 mod 2951147 = 519
Ascii(1601171) = 1601171 ^ 1133677 mod 2951147 = 9
Ascii(2477239) = 2477239 ^ 1133677 mod 2951147 = 21620
Ascii(1598948) = 1598948 ^ 1133677 mod 2951147 = 5210
Ascii(644537) = 644537 ^ 1133677 mod 2951147 = 203

-> Complete message in ASCII
177197214199201208208211132200201132183197208201215158132198201199197217215201132215211208218205210203132212214211198208201209215132205215132197200200205199216205210203

-> Original Message
Marcello de Sales: because solving problems is addicting

This is fun! Now you can talk anything with your peers :D. Different practical applications are in Internet chat rooms. I’ve used Adium for Mac in order to chat through a secure communication channel over the Internet :). I keep thinking which the keys are when I’m communicating (when I’m bored…).

TF-IDF in Hadoop Part 3: Documents in Corpus and TFIDF Computation

The previous 2 parts of this post did the small part of the job for calculating the TF-IDF for each “term” in different documents in “corpus”. Since the implementation depends on concepts of Information Retrieval, specially for starters in Information Retrieval, take a look at the book Christopher D. ManningPrabhakar Raghavan and Hinrich SchützeIntroduction to Information Retrieval, Cambridge University Press. 2008. The authors are professors at Stanford and Stuttgart Universities, have different exercises in the subject, and I found out good resources in Chapter 7, showing the basic concepts of the TF-IDF algorithm. As I mentioned before, I had first read the term 7 years ago when I was writing my BS in Computer Science degree report (Portuguese) for the problem of user profile matching and clustering. Interestingly enough, I started learning hadoop about 2 weeks ago and I was stoked about it, because my first contact with MapReduce was actually using mongoDB during my MS in Computer Science thesis report when I needed to generate a report over a data-centric collection of data collected from an environmental Sensor Network using the mongoDB’s MapReduce API over distributed mongoDB Shards. All in all, it seems the time to put this in practice is now :).

Job 3: Documents in Corpus and TF-IDF computation

In order to summarize the idea of scoring words based on its occurrence in corpus, I will use graphical and textual examples of the algorithm to take advantage of this post and make it clear what the exercises really are. Ricky Ho has implemented TF-IDF using Apache PIG and documented exactly the steps described by the algorithm in a very nice diagram shown below. Ricky also made a good summary about the terms “term frequency” and “inverse document frequency”, so check his website out in case you need.

The TF-IDF MapReduce Phases by Ricky Ho

This post implements the third round of the implementation, where the count of words are done by counting the size of the array that brings all the documents for each of the words, taking into consideration the output of the previous phase. Let’s take a look at the data format from the previous job and see if it matches the description of the diagram presented (note that the order of the terms are not sorted. I selected the term “therefore” at random):

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 2-word-counts/part-r-00000 | less
...
therefore@all-shakespeare       652/738781
therefore@leornardo-davinci-all.txt     124/149612
therefore@the-outline-of-science-vol1.txt       36/70650
...

The output shows each term at each document, and the number of its occurrence on the given document, accompanied by the total number of terms in the document. So, the final Mapper and Reducer were defined as follows:

  • Map:
    • Input: ((term@document), n/N)
    • Re-arrange the mapper to have the word as the key, since we need to count the number of documents where it occurs
    • Output: (term, document=n/N)
  • Reducer:
    • D = total number of documents in corpus. This can be passed by the driver as a constant;
    • d = number of documents in corpus where the term appears. It is a counter over the reduced values for each term;
    • TFIDF = n/N * log(D/d);
    • Output: ((word@document), d/D, (n/N), TFIDF)

This post shows the implementation of the third step, which counts the number of documents in which a “term” appears in each document in corpus and calculates the TF-IDF. I have made some assumptions for the final output to better present the results and of course to deal with the scope of the example.

  • The first problem was to maintain this step as the last one for the completion of the exercise. In order to do so, the calculation of the number of documents in corpus could be made by another MapReduce phase as described in the Cloudera’s documentation. However, they concluded the class slides by mentioning that this last phase could be done without such additional phase. I remembered in the classes that you use the JobConf for the purpose of parameters passed to the jobs. So, I used the FileSystem class to count the number of documents 😀 in the original input directory, since that is a constant number. I tried using the Context/Configuration classes of the Hadoop 0.20.1 API to pass that number to the last Reducer, but the get(key) returns null. So, the only way I could pass the number of documents was using the JobName 🙂 I know, it is a dirty hack, but it works;
  • Since the number of documents in corpus is small, the chances that a word appears in all documents are higher than applying the algorithm for web indexing on thousands or millions of documents. Therefore, the term log(totalDocs/docsPerWord) can result in “nulling” the result (log(3/3)=0) , so I simplified the calculation by using tfIdf = tf, since the log function results in 100% of occurrence in all documents in corpus (you could implement it as tfIdf=tf * 1 as well);
  • I decided to add more information to the output just in purpose of documentation. It shows [word@document, documentsFrequency/documentsCorpus, wordFrequency/totalWordsInDocument, TF-IDF];
  • The final result is formatted to a smaller to have only a few decimal points for purposes of displaying the values in this exercise. Therefore, in production these values matter.

Job3, Mapper

package index;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
 * WordsInCorpusTFIDFMapper implements the Job 3 specification for the TF-IDF algorithm
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordsInCorpusTFIDFMapper extends Mapper<LongWritable, Text, Text, Text> {

    public WordsInCorpusTFIDFMapper() {
    }

    /**
     * @param key is the byte offset of the current line in the file;
     * @param value is the line from the file
     * @param output has the method "collect()" to output the key,value pair
     * @param reporter allows us to retrieve some information about the job (like the current filename)
     *
     *     PRE-CONDITION: marcello@book.txt  \t  3/1500
     *     POST-CONDITION: marcello, book.txt=3/1500
     */
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] wordAndCounters = value.toString().split("\t");
        String[] wordAndDoc = wordAndCounters[0].split("@");                 //3/1500
        context.write(new Text(wordAndDoc[0]), new Text(wordAndDoc[1] + "=" + wordAndCounters[1]));
    }
}
Job3, Reducer
package index;

import java.io.IOException;
import java.text.DecimalFormat;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * WordsInCorpusTFIDFReducer calculates the number of documents in corpus that a given key occurs and the TF-IDF computation.
 * The total number of D is acquired from the job name 🙂 It is a dirty hack, but the only way I could communicate the number from
 * the driver.
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordsInCorpusTFIDFReducer extends Reducer<Text, Text, Text, Text> {

    private static final DecimalFormat DF = new DecimalFormat("###.########");

    public WordsInCorpusTFIDFReducer() {
    }

    /**
     * @param key is the key of the mapper
     * @param values are all the values aggregated during the mapping phase
     * @param context contains the context of the job run
     *
     *             PRECONDITION: receive a list of <word, ["doc1=n1/N1", "doc2=n2/N2"]>
     *             POSTCONDITION: <"word@doc1,  [d/D, n/N, TF-IDF]">
     */
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        // get the number of documents indirectly from the file-system (stored in the job name on purpose)
        int numberOfDocumentsInCorpus = Integer.parseInt(context.getJobName());
        // total frequency of this word
        int numberOfDocumentsInCorpusWhereKeyAppears = 0;
        Map<String, String> tempFrequencies = new HashMap<String, String>();
        for (Text val : values) {
            String[] documentAndFrequencies = val.toString().split("=");
            numberOfDocumentsInCorpusWhereKeyAppears++;
            tempFrequencies.put(documentAndFrequencies[0], documentAndFrequencies[1]);
        }
        for (String document : tempFrequencies.keySet()) {
            String[] wordFrequenceAndTotalWords = tempFrequencies.get(document).split("/");

            //Term frequency is the quocient of the number of terms in document and the total number of terms in doc
            double tf = Double.valueOf(Double.valueOf(wordFrequenceAndTotalWords[0])
                    / Double.valueOf(wordFrequenceAndTotalWords[1]));

            //interse document frequency quocient between the number of docs in corpus and number of docs the term appears
            double idf = (double) numberOfDocumentsInCorpus / (double) numberOfDocumentsInCorpusWhereKeyAppears;

            //given that log(10) = 0, just consider the term frequency in documents
            double tfIdf = numberOfDocumentsInCorpus == numberOfDocumentsInCorpusWhereKeyAppears ?
                    tf : tf * Math.log10(idf);

            context.write(new Text(key + "@" + document), new Text("[" + numberOfDocumentsInCorpusWhereKeyAppears + "/"
                    + numberOfDocumentsInCorpus + " , " + wordFrequenceAndTotalWords[0] + "/"
                    + wordFrequenceAndTotalWords[1] + " , " + DF.format(tfIdf) + "]"));
        }
    }
}

I have implemented the TestCases for both the Mapper and Reducer classes, but for simplification of this post, I will skip those. Let’s take a look a the driver written, since it captures the total number of documents directly from the filesystem using the buil-in Hadoop API. Definitely no need for another MapReduce phase for that. As described in the Cloudera’s training, the less the better since we are saving resources utilization :). Anyway, let’s go to the Driver implementation.

Job3, Driver
package index;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * WordFrequenceInDocument Creates the index of the words in documents,
 * mapping each of them to their frequency.
 * @author Marcello de Sales (marcello.desales@gmail.com)
 * @version "Hadoop 0.20.1"
 */
public class WordsInCorpusTFIDF extends Configured implements Tool {

    // where to put the data in hdfs when we're done
    private static final String OUTPUT_PATH = "3-tf-idf";

    // where to read the data from.
    private static final String INPUT_PATH = "2-word-counts";

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf, "Word in Corpus, TF-IDF");

        job.setJarByClass(WordsInCorpusTFIDF.class);
        job.setMapperClass(WordsInCorpusTFIDFMapper.class);
        job.setReducerClass(WordsInCorpusTFIDFReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

        //Getting the number of documents from the original input directory.
        Path inputPath = new Path("input");
        FileSystem fs = inputPath.getFileSystem(conf);
        FileStatus[] stat = fs.listStatus(inputPath);

        //Dirty hack to pass the total number of documents as the job name.
        //The call to context.getConfiguration.get("docsInCorpus") returns null when I tried to pass
        //conf.set("docsInCorpus", String.valueOf(stat.length)) Or even
        //conf.setInt("docsInCorpus", stat.length)
        job.setJobName(String.valueOf(stat.length));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordsInCorpusTFIDF(), args);
        System.exit(res);
    }
}

Continuing with the implementation, the only thing to do is to compile and run the final driver. Note that the input directory is the one containing the partial counts of documents of job 2, that is, “2-word-counts”. The output is the directory reserved fro step 3, or “3-tf-idf”. Then, the only way that I could send the total number in corpus was using the jobName. The Hadoop 0.20.1 API does not pass the values of the configuration at any cost. As documented, I tried using the context reference in the reducer class, but the reference only returned “null” for the call “context.getConfiguration().get(“docsInCorpus”)”. I gave up and looked for an option, and the JobName was the only way I could :).

I skipped the test session and compiled everything and ran the driver as follows:

training@training-vm:~/git/exercises/shakespeare$ ant
Buildfile: build.xml

compile:
    [javac] Compiling 11 source files to /home/training/git/exercises/shakespeare/bin

jar:
      [jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar

BUILD SUCCESSFUL

Then, finally running the calculator of words:

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordsInCorpusTFIDF
10/01/09 21:41:40 INFO input.FileInputFormat: Total input paths to process : 1
10/01/09 21:41:41 INFO mapred.JobClient: Running job: job_200912301017_0115
10/01/09 21:41:42 INFO mapred.JobClient:  map 0% reduce 0%
10/01/09 21:41:51 INFO mapred.JobClient:  map 100% reduce 0%
10/01/09 21:42:00 INFO mapred.JobClient:  map 100% reduce 100%
10/01/09 21:42:02 INFO mapred.JobClient: Job complete: job_200912301017_0115
10/01/09 21:42:02 INFO mapred.JobClient: Counters: 17
10/01/09 21:42:02 INFO mapred.JobClient:   Job Counters
10/01/09 21:42:02 INFO mapred.JobClient:     Launched reduce tasks=1
10/01/09 21:42:02 INFO mapred.JobClient:     Launched map tasks=1
10/01/09 21:42:02 INFO mapred.JobClient:     Data-local map tasks=1
10/01/09 21:42:02 INFO mapred.JobClient:   FileSystemCounters
10/01/09 21:42:02 INFO mapred.JobClient:     FILE_BYTES_READ=2017995
10/01/09 21:42:02 INFO mapred.JobClient:     HDFS_BYTES_READ=1920431
10/01/09 21:42:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4036022
10/01/09 21:42:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2943390
10/01/09 21:42:02 INFO mapred.JobClient:   Map-Reduce Framework
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce input groups=0
10/01/09 21:42:02 INFO mapred.JobClient:     Combine output records=0
10/01/09 21:42:02 INFO mapred.JobClient:     Map input records=48779
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce shuffle bytes=2017995
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce output records=0
10/01/09 21:42:02 INFO mapred.JobClient:     Spilled Records=97558
10/01/09 21:42:02 INFO mapred.JobClient:     Map output bytes=1920431
10/01/09 21:42:02 INFO mapred.JobClient:     Combine input records=0
10/01/09 21:42:02 INFO mapred.JobClient:     Map output records=48779
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce input records=48779

Note that the final number of values are the same as the previous Job step. So, everything went ok as expected. Taking a look and the file in the output directory:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls 3-tf-idf
Found 2 items
drwxr-xr-x   - training supergroup          0 2010-01-09 21:41 /user/training/3-tf-idf/_logs
-rw-r--r--   1 training supergroup    2943390 2010-01-09 21:41 /user/training/3-tf-idf/part-r-00000

I decided to take a look at the same word I mentioned above “therefore”. Here’s the result for them, this time the output was automatically sorted by Hadoop.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 3-tf-idf/part-r-00000 | less
...
abook@leornardo-davinci-all.txt [1/3 , 3/149612 , 0.00000957]
aboriginal@the-outline-of-science-vol1.txt      [1/3 , 1/70650 , 0.00000675]
abortive@all-shakespeare        [2/3 , 4/738781 , 0.00000095]
...
therefore@all-shakespeare       [3/3 , 652/738781 , 0.00088253]
therefore@the-outline-of-science-vol1.txt       [3/3 , 36/70650 , 0.00050955]
therefore@leornardo-davinci-all.txt     [3/3 , 124/149612 , 0.00082881]
...

Taking a look at chapter 7 of the book, it is clear to make the following conclusions about the output presented:

  • The term “therefore” is more relevant in the document “all-shakespeare”, since its occurrence is more likely to happen than in the other documents;
  • Other terms that does not appear in all documents such as “abook”, “aboriginal” and “abortive” have very small relevance for the given corpus of documents.

What’s Next?

I had so much fun with my first contact with Hadoop that I am going to the Cloudera Training here in the Bay Area in about 10 days. Although the training covers the Hadoop 0.18 API, I decided to use the Hadoop 0.20.1 API because I just wanted to try a “cleaner” API.

As for next steps from this exercise, one could do the categorized classification of the terms per document in a data-centric way (document -> term -> tf-idf), or whatever your need is. Time to go play with Apache PIG and HBase.

TF-IDF in Hadoop Part 2: Word Counts For Docs

January 6, 2010 1 comment

The TF-IDF algorithm can be implemented in different ways. The Cloudera Hadoop training defines different steps on the implementation of each of the steps through different Jobs. I decided to take the approach of persisting the intermediate data before the execution of the subsequent steps. This part documents the implementation of Job 2 as the second part of my experiments with Hadoop.

Part 1’s goal is to generate the word frequency for each of the documents in the input path provided, persisted at the “1-word-freq” output directory, as shown below:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 1-word-freq/part-r-00000 | less
…
therefore@all-shakespeare 652
therefore@leornardo-davinci-all.txt 124
therefore@the-outline-of-science-vol1.txt 36

The definition of Job 2 will take into account the structure of this data in the creation of the Mapper and Reducer classes.

Job 2: Word Counts for Docs

The goal of this job is to count the total number of words for each document, in a way to compare each word with the total number of words. I’ve tried to implement a default InputFormat and I couldn’t find examples related to it. As I understood, the values could be read in the same format they are saved (Text, IntWritable), but I will keep it simple and use the same default InputFormat as before. Following the same definition as in part one, the specification of the Map and Reduce are as follows:

  • Map:
    • Input: ((word@document), n)
    • Re-arrange the mapper to have the key based on each document
    • Output: (document, word=n)
  • Reducer
    • N = totalWordsInDoc = sum [word=n]) for each document
    • Output: ((word@document), (n/N))

Note that the format used for the input of the mapper is the output for the previous job. The delimiters “@” and “/” were randomly picked to better represent the intent of the data. So, feel free to pick anything you prefer. The reducer just need to sum the total number of values in a document and pass this value over to the next step, along with the previous number of values, as necessary data for the next step.

I have learned that the Iterable values in the values of the Reducer class can’t be iterated more than once. The loop just did not enter when two foreach operations were performed, so I implemented it using a temporary map.
Job2, Mapper
package index;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * LineIndexMapper Maps each observed word in a line to a (filename@offset) string.
 * Hadoop 0.20.1 API
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordCountsForDocsMapper extends Mapper<LongWritable, Text, Text, Text> {

    public WordCountsForDocsMapper() {
    }

    /**
     * @param key is the byte offset of the current line in the file;
     * @param value is the line from the file
     * @param context
     *
     *     PRE-CONDITION: aa@leornardo-davinci-all.txt    1
     *                    aaron@all-shakespeare   98
     *                    ab@leornardo-davinci-all.txt    3
     *
     *     POST-CONDITION: Output <"all-shakespeare", "aaron=98"> pairs
     */
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] wordAndDocCounter = value.toString().split("\t");
        String[] wordAndDoc = wordAndDocCounter[0].split("@");
        context.write(new Text(wordAndDoc[1]), new Text(wordAndDoc[0] + "=" + wordAndDocCounter[1]));
    }
}
Job2, Mapper Unit Test
I have just simplified the unit test to verify if the test Mapper generates the format needed for the Reducer.
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the word count mapper.
 * Hadoop 0.20.1 API
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordCountsForDocsMapperTest extends TestCase {

    private Mapper<LongWritable, Text, Text, Text> mapper;
    private MapDriver<LongWritable, Text, Text, Text> driver;

    @Before
    public void setUp() {
        mapper = new WordCountsForDocsMapper();
        driver = new MapDriver<LongWritable, Text, Text, Text>(mapper);
    }

    @Test
    public void testMultiWords() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("crazy@all-shakespeare\t25")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("all-shakespeare"), new Text("crazy=25")));
        assertListEquals(expected, out);
    }
}
Job 2, Reducer
package index;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * WordCountsForDocsReducer counts the number of documents in the
 * Hadoop 0.20.1 API
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordCountsForDocsReducer extends Reducer<Text, Text, Text, Text> {

    public WordCountsForDocsReducer() {
    }

    /**
     * @param key is the key of the mapper
     * @param values are all the values aggregated during the mapping phase
     * @param context contains the context of the job run
     *
     *        PRE-CONDITION: receive a list of <document, ["word=n", "word-b=x"]>
     *            pairs <"a.txt", ["word1=3", "word2=5", "word3=5"]>
     *
     *       POST-CONDITION: <"word1@a.txt, 3/13">,
     *            <"word2@a.txt, 5/13">
     */
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        int sumOfWordsInDocument = 0;
        Map<String, Integer> tempCounter = new HashMap<String, Integer>();
        for (Text val : values) {
            String[] wordCounter = val.toString().split("=");
            tempCounter.put(wordCounter[0], Integer.valueOf(wordCounter[1]));
            sumOfWordsInDocument += Integer.parseInt(val.toString().split("=")[1]);
        }
        for (String wordKey : tempCounter.keySet()) {
            context.write(new Text(wordKey + "@" + key.toString()), new Text(tempCounter.get(wordKey) + "/"
                    + sumOfWordsInDocument));
        }
    }
}
Job 2, Reducer Unit Test
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the reducer of the word counts.
 * Hadoop 0.20.1 API
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordCountsForDocsReducerTest extends TestCase {

    private Reducer<Text, Text, Text, Text> reducer;
    private ReduceDriver<Text, Text, Text, Text> driver;

    @Before
    public void setUp() {
        reducer = new WordCountsForDocsReducer();
        driver = new ReduceDriver<Text, Text, Text, Text>(reducer);
    }

    @Test
    public void testMultiWords() {
        List<Pair<Text, Text>> out = null;

        try {
            List<Text> values = new ArrayList<Text>();
            values.add(new Text("car=50"));
            values.add(new Text("hadoop=15"));
            values.add(new Text("algorithms=25"));
            out = driver.withInput(new Text("document"), values).run();

        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("car@document"), new Text("50/90")));
        expected.add(new Pair<Text, Text>(new Text("hadoop@document"), new Text("15/90")));
        expected.add(new Pair<Text, Text>(new Text("algorithms@document"), new Text("25/90")));
        assertListEquals(expected, out);
    }

}

Once again, following our Test-Driven Development approach, let’s test our Mapper and Reducer classes in order to verify its “correctness” of the generated data. The JUnit 4 Test suit is updated as follows:

Tests Suit
package index;

import junit.framework.Test;
import junit.framework.TestSuite;

/**
 * All tests for the TF-IDF algorithm
 * Hadoop 0.20.1 API
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public final class AllTests  {

  private AllTests() { }

  public static Test suite() {
    TestSuite suite = new TestSuite("Tests for the TF-IDF algorithm");

    suite.addTestSuite(WordFreqMapperTest.class);
    suite.addTestSuite(WordFreqReducerTest.class);
    suite.addTestSuite(WordCountsForDocsMapperTest.class);
    suite.addTestSuite(WordCountsForDocsReducerTest.class);

    return suite;
  }
}

Just testing it with the ANT task test, defined in the build.xml artifact.

training@training-vm:~/git/exercises/shakespeare$ ant test
Buildfile: build.xml

compile:
[javac] Compiling 12 source files to /home/training/git/exercises/shakespeare/bin

test:
[junit] Running index.AllTests
[junit] Testsuite: index.AllTests
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.424 sec
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.424 sec
[junit]

BUILD SUCCESSFUL
Similar to the previous Part 1, the the execution of the Driver is safer to proceed with tested classes. Furthermore, it includes the definitions of the mapper and reducer classes, as well as defining the combiner class to be the same as the reducer class. Also, note that the definition of the outputKeyClass and outputValueClass are the same as the ones defined by the Reducer class!!! Once again, Hadoop complains whey they are different 🙂
Job2, Driver
package index;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * WordCountsInDocuments counts the total number of words in each document and
 * produces data with the relative and total number of words for each document.
 * Hadoop 0.20.1 API
 * @author Marcello de Sales (marcello.desales@gmail.com)
 */
public class WordCountsInDocuments extends Configured implements Tool {

    // where to put the data in hdfs when we're done
    private static final String OUTPUT_PATH = "2-word-counts";

    // where to read the data from.
    private static final String INPUT_PATH = "1-word-freq";

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf, "Words Counts");

        job.setJarByClass(WordCountsInDocuments.class);
        job.setMapperClass(WordCountsForDocsMapper.class);
        job.setReducerClass(WordCountsForDocsReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordCountsInDocuments(), args);
        System.exit(res);
    }
}

The input data is located in the directory of the first step “1-word-freq”, and the output persisted in the directory “2-word-counts” as listed in the main training directory in the HDFS. If you need to take a look at the ANT build and other classes, go to my personal resources at my Google Code Library Project. Recompile the project and generate the updated Jar with the driver.

training@training-vm:~/git/exercises/shakespeare$ ant
Buildfile: build.xml

compile:
  [javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin
  [javac] Note: Some input files use or override a deprecated API.
  [javac] Note: Recompile with -Xlint:deprecation for details.

jar:
  [jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar

BUILD SUCCESSFUL
Total time: 1 second

Now, executing the driver…

 training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordCountsInDocuments
10/01/06 16:28:04 INFO input.FileInputFormat: Total input paths to process : 1
10/01/06 16:28:04 INFO mapred.JobClient: Running job: job_200912301017_0048
10/01/06 16:28:05 INFO mapred.JobClient: map 0% reduce 0%
10/01/06 16:28:12 INFO mapred.JobClient: map 100% reduce 0%
10/01/06 16:28:18 INFO mapred.JobClient: map 100% reduce 100%
10/01/06 16:28:20 INFO mapred.JobClient: Job complete: job_200912301017_0048
10/01/06 16:28:20 INFO mapred.JobClient: Counters: 17
10/01/06 16:28:20 INFO mapred.JobClient: Job Counters
10/01/06 16:28:20 INFO mapred.JobClient: Launched reduce tasks=1
10/01/06 16:28:20 INFO mapred.JobClient: Launched map tasks=1
10/01/06 16:28:20 INFO mapred.JobClient: Data-local map tasks=1
10/01/06 16:28:20 INFO mapred.JobClient: FileSystemCounters
10/01/06 16:28:20 INFO mapred.JobClient: FILE_BYTES_READ=1685803
10/01/06 16:28:20 INFO mapred.JobClient: HDFS_BYTES_READ=1588239
10/01/06 16:28:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3371638
10/01/06 16:28:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1920431
10/01/06 16:28:20 INFO mapred.JobClient: Map-Reduce Framework
10/01/06 16:28:20 INFO mapred.JobClient: Reduce input groups=0
10/01/06 16:28:20 INFO mapred.JobClient: Combine output records=0
10/01/06 16:28:20 INFO mapred.JobClient: Map input records=48779
10/01/06 16:28:20 INFO mapred.JobClient: Reduce shuffle bytes=1685803
10/01/06 16:28:20 INFO mapred.JobClient: Reduce output records=0
10/01/06 16:28:20 INFO mapred.JobClient: Spilled Records=97558
10/01/06 16:28:20 INFO mapred.JobClient: Map output bytes=1588239
10/01/06 16:28:20 INFO mapred.JobClient: Combine input records=0
10/01/06 16:28:20 INFO mapred.JobClient: Map output records=48779
10/01/06 16:28:20 INFO mapred.JobClient: Reduce input records=48779

Note that the execution generates tens of thousands of documents shuffled from ~1.6 million entries. Let’s check the result using the hadoop fs -cat command once again and navigate through the result. The most important thing to note is that the relation n/N are maintained throughout the results, for each word and each total number for each document.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 2-word-counts/part-r-00000 | less
....
relished@all-shakespeare 1/738781
therefore@all-shakespeare 652/738781
eastward@all-shakespeare 1/738781
....
irrespective@leornardo-davinci-all.txt 1/149612
ignorance@leornardo-davinci-all.txt 12/149612
drawing@leornardo-davinci-all.txt 174/149612
relief@leornardo-davinci-all.txt 36/149612
...
answer@the-outline-of-science-vol1.txt 25/70650
sleeve@the-outline-of-science-vol1.txt 1/70650
regard@the-outline-of-science-vol1.txt 22/70650

Part 3 will conclude this job by combining two different steps. I’m still using the original basic tutorial from Cloudera, but using the Hadoop 0.20.1 API. Any suggestions for improvements are welcomed:

  • How to write data pipes between 2 different jobs?
  • How to write a custom input format?

Those questions might be answered after the training in Sunnyvale on January 19-21, during the Hadoop Training I’m excited to attend. I can also accept suggestions!!! 😀

Hadoop 0.20.1 API: refactoring the InvertedLine example from Cloudera Training, removing deprecated classes (JobConf, others)

January 5, 2010 8 comments

I’ve been learning Hadoop for the past 15 days and I have found lots of examples of source-code. The basic training offered by Cloudera uses the 0.18 API, as well as the Yahoo developer’s tutorial that describe the example of a the Inverted Line Index example. The input of this example is a list of one or more text files containing books, and the output is the index of words appearing on each of the files in the format “”, where word is found on a given line of the given fileName at the byte offset given. Although the example works without a problem, I’ve read documentations about the Pig application where the majority of the warnings are caused by the API change. I’m particularly in favour of clean code without warnings, whenever possible. So, I started dissecting the API and could re-implement the examples using the Hadoop 0.20.1. Furthermore, the MRUnit must also be refactored in order to make use of the new API.

Both the Yahoo Hadoop Tutorial and the Cloudera Basic Training documentation “Writing MapReduce Programs” give the example of the InvertedIndex application. I used the Cloudera VMWare implementation and source-code as a starting point.

The first major change was the inclusion of the mapreduce package, containing the new implementation of the Mapper and Reducer classes, which were Interfaces in the previous APIs in the package “mapred”.

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

class MyMapper extends Mapper {
    ...
    ...
}

class MyReducer extends Reducer {
    ...
    ...
}

Also, note that these classes use the Java generics capabilities and therefore, the methods “map()” and “reduce()” must follow the convention given in your implementation. Both methods removed the use of the reporter and collector by the use of a Context class, that is a static member class of each of the Mapper and Reducer classes.

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

class MyMapper extends Mapper {
    ...
    protected void map(K key, V value, Mapper.Context context) {
       ...
    }
}

class MyReducer extends Reducer {
    ...
    protected void reduce(K key, Iterable<V> values, Context context)
}

Consider K and V as generic Writable classes from the Hadoop API, they must be used in the implementation. For instance, I used to have an Iterable implementation for the key in the reducer, and the reduce method was never called with the wrong method signature. So, it is important to verify that you’re using the same Iterable class for the values.

The mapper class just need the new API from the new package. The new imported classes are highlighted in the mapper and reducer codes.

Mapper Class

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
 * LineIndexMapper Maps each observed word in a line to a (filename@offset) string.
 */
public class LineIndexMapper extends Mapper<LongWritable, Text, Text, Text> {

    public LineIndexMapper() {
    }

    /**
     * Google's search Stopwords
     */
    private static Set<String> googleStopwords;

    static {
        googleStopwords = new HashSet<String>();
        googleStopwords.add("I"); googleStopwords.add("a"); 
        googleStopwords.add("about"); googleStopwords.add("an"); 
        googleStopwords.add("are"); googleStopwords.add("as");
        googleStopwords.add("at"); googleStopwords.add("be"); 
        googleStopwords.add("by"); googleStopwords.add("com"); 
        googleStopwords.add("de"); googleStopwords.add("en");
        googleStopwords.add("for"); googleStopwords.add("from"); 
        googleStopwords.add("how"); googleStopwords.add("in"); 
        googleStopwords.add("is"); googleStopwords.add("it");
        googleStopwords.add("la"); googleStopwords.add("of"); 
        googleStopwords.add("on"); googleStopwords.add("or"); 
        googleStopwords.add("that"); googleStopwords.add("the");
        googleStopwords.add("this"); googleStopwords.add("to"); 
        googleStopwords.add("was"); googleStopwords.add("what"); 
        googleStopwords.add("when"); googleStopwords.add("where");
        googleStopwords.add("who"); googleStopwords.add("will"); 
        googleStopwords.add("with"); googleStopwords.add("and"); 
        googleStopwords.add("the"); googleStopwords.add("www");
    }

    /**
     * @param key is the byte offset of the current line in the file;
     * @param value is the line from the file
     * @param output has the method "collect()" to output the key,value pair
     * @param reporter allows us to retrieve some information about the job (like the current filename)
     *
     *     POST-CONDITION: Output <"word", "filename@offset"> pairs
     */
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Compile all the words using regex
        Pattern p = Pattern.compile("\\w+");
        Matcher m = p.matcher(value.toString());

        // Get the name of the file from the inputsplit in the context
        String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

        // build the values and write <k,v> pairs through the context
        StringBuilder valueBuilder = new StringBuilder();
        while (m.find()) {
            String matchedKey = m.group().toLowerCase();
            // remove names starting with non letters, digits, considered stopwords or containing other chars
            if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))
                    || googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {
                continue;
            }
            valueBuilder.append(fileName);
            valueBuilder.append("@");
            valueBuilder.append(key.get());
            // emit the partial <k,v>
            context.write(new Text(matchedKey), new Text(valueBuilder.toString()));
            valueBuilder.setLength(0);
        }
    }
}

Reducer Class

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * LineIndexReducer Takes a list of filename@offset entries for a single word and concatenates them into a list.
 */
public class LineIndexReducer extends Reducer<Text, Text, Text, Text> {

    public LineIndexReducer() {
    }

    /**
     * @param key is the key of the mapper
     * @param values are all the values aggregated during the mapping phase
     * @param context contains the context of the job run
     *
     *      PRE-CONDITION: receive a list of <"word", "filename@offset"> pairs
     *        <"marcello", ["a.txt@3345", "b.txt@344", "c.txt@785"]>
     *
     *      POST-CONDITION: emit the output a single key-value where all the file names
     *        are separated by a comma ",".
     *        <"marcello", "a.txt@3345,b.txt@344,c.txt@785">
     */
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        StringBuilder valueBuilder = new StringBuilder();

        for (Text val : values) {
            valueBuilder.append(val);
            valueBuilder.append(",");
        }
        //write the key and the adjusted value (removing the last comma)
        context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));
        valueBuilder.setLength(0);
    }
}

These are the changes necessary for the Mapper and Reducer classes, without the need to extend the base classes. In order to unit test these classes, changes on the MRUnit are also necessary. The drivers were also added a new “mapreduce” package with the same counterparts.

Instead of the mrunit.MapDriver, use the mapreduce.MapDriver. The same for the Reducer class. The rest of the code is just the same.

import org.apache.hadoop.mrunit.MapDriver;

import org.apache.hadoop.mrunit.mapreduce.MapDriver;

JUnit’s MapperTest

Some changes also are required in the MRUnit API classes, following the same pattern as the main API: the addition of the package “mapreduce” and new implementing classes.

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mock.MockInputSplit;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the inverted index mapper.
 */
public class MapperTest extends TestCase {

    private Mapper<LongWritable, Text, Text, Text> mapper;
    private MapDriver<LongWritable, Text, Text, Text> driver;

    /** We expect pathname@offset for the key from each of these */
    private final Text EXPECTED_OFFSET = new Text(MockInputSplit.getMockPath().toString() + "@0");

    @Before
    public void setUp() {
        mapper = new LineIndexMapper();
        driver = new MapDriver<LongWritable, Text, Text, Text>(mapper);
    }

    @Test
    public void testEmpty() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();

        assertListEquals(expected, out);
    }

    @Test
    public void testOneWord() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("foo")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("foo"), EXPECTED_OFFSET));

        assertListEquals(expected, out);
    }

    @Test
    public void testMultiWords() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("foo"), EXPECTED_OFFSET));
        expected.add(new Pair<Text, Text>(new Text("bar"), EXPECTED_OFFSET));
        expected.add(new Pair<Text, Text>(new Text("baz"), EXPECTED_OFFSET));

        assertListEquals(expected, out);
    }
}

JUnit’s ReducerTest

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the inverted index reducer.
 */
public class ReducerTest extends TestCase {

    private Reducer<Text, Text, Text, Text> reducer;
    private ReduceDriver<Text, Text, Text, Text> driver;

    @Before
    public void setUp() {
        reducer = new LineIndexReducer();
        driver = new ReduceDriver<Text, Text, Text, Text>(reducer);
    }

    @Test
    public void testOneOffset() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("word"), new Text("offset")));

        assertListEquals(expected, out);
    }

    @Test
    public void testMultiOffset() {
        List<Pair<Text, Text>> out = null;

        try {
            out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset1")).withInputValue(
                    new Text("offset2")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();
        expected.add(new Pair<Text, Text>(new Text("word"), new Text("offset1,offset2")));

        assertListEquals(expected, out);
    }
}

You can test them using the command “ant test” on the source-code directory as usual to confirm that the implementation is correct:

training@training-vm:~/git/exercises/shakespeare$ ant test
Buildfile: build.xml</span></span>

compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin

test:
[junit] Running index.AllTests
[junit] Testsuite: index.AllTests
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec
[junit]

BUILD SUCCESSFUL
Total time: 2 seconds

Replacing JobConf and other deprecated classes

Other changes related to the API is on the configuration of the execution of the jobs. The class “JobConf” was deprecated, but most of the tutorials have not been updated. So, here’s the updated version of the main example driver using the Configuration and Context classes. Note that the job is configured and executed with the default version of the configuration. It is the class responsible for configuring the execution of the tasks. Once again, the replacement of the classes located at the package “mapred” is important, since the new classes are located at the package “mapreduce”. The following code highlights the new classes imported and how they are used throughout the Driver.

InvertedIndex driver

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * LineIndexer Creates an inverted index over all the words in a document corpus, mapping each observed word to a list
 * of filename@offset locations where it occurs.
 */
public class LineIndexer extends Configured implements Tool {

    // where to put the data in hdfs when we're done
    private static final String OUTPUT_PATH = "output";

    // where to read the data from.
    private static final String INPUT_PATH = "input";

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf, "Line Indexer 1");

        job.setJarByClass(WordFrequenceInDocument.class);
        job.setMapperClass(LineIndexMapper.class);
        job.setReducerClass(LineIndexReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordFrequenceInDocument(), args);
        System.exit(res);
    }
}

After updating, make sure to run generate a new jar, remove anything under the directory “output” (since the program does not clean that up), and execute the new version.

training@training-vm:~/git/exercises/shakespeare$ ant jar
Buildfile: build.xml</span></span>

compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin

jar:
[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar

BUILD SUCCESSFUL
Total time: 1 second

I have added 2 ASCII books in the input directory: the works from Leonardo Da Vinci and the first volume of the book “The outline of science”.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input
Found 3 items
-rw-r--r--   1 training supergroup    5342761 2009-12-30 11:57 /user/training/input/all-shakespeare
-rw-r--r--   1 training supergroup    1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt
-rw-r--r--   1 training supergroup     674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt</span></span>

The execution and output of running this example is shown as follows.

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.LineIndexer
10/01/04 21:11:55 INFO input.FileInputFormat: Total input paths to process : 3
10/01/04 21:11:56 INFO mapred.JobClient: Running job: job_200912301017_0017
10/01/04 21:11:57 INFO mapred.JobClient:  map 0% reduce 0%
10/01/04 21:12:07 INFO mapred.JobClient:  map 33% reduce 0%
10/01/04 21:12:10 INFO mapred.JobClient:  map 58% reduce 0%
10/01/04 21:12:13 INFO mapred.JobClient:  map 63% reduce 0%
10/01/04 21:12:16 INFO mapred.JobClient:  map 100% reduce 11%
10/01/04 21:12:28 INFO mapred.JobClient:  map 100% reduce 77%
10/01/04 21:12:34 INFO mapred.JobClient:  map 100% reduce 100%
10/01/04 21:12:36 INFO mapred.JobClient: Job complete: job_200912301017_0017
10/01/04 21:12:36 INFO mapred.JobClient: Counters: 17
10/01/04 21:12:36 INFO mapred.JobClient:   Job Counters
10/01/04 21:12:36 INFO mapred.JobClient:     Launched reduce tasks=1
10/01/04 21:12:36 INFO mapred.JobClient:     Launched map tasks=3
10/01/04 21:12:36 INFO mapred.JobClient:     Data-local map tasks=3
10/01/04 21:12:36 INFO mapred.JobClient:   FileSystemCounters
10/01/04 21:12:36 INFO mapred.JobClient:     FILE_BYTES_READ=58068623
10/01/04 21:12:36 INFO mapred.JobClient:     HDFS_BYTES_READ=7445292
10/01/04 21:12:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=92132872
10/01/04 21:12:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=26638259
10/01/04 21:12:36 INFO mapred.JobClient:   Map-Reduce Framework
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce input groups=0
10/01/04 21:12:36 INFO mapred.JobClient:     Combine output records=0
10/01/04 21:12:36 INFO mapred.JobClient:     Map input records=220255
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce shuffle bytes=34064153
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce output records=0
10/01/04 21:12:36 INFO mapred.JobClient:     Spilled Records=2762272
10/01/04 21:12:36 INFO mapred.JobClient:     Map output bytes=32068217
10/01/04 21:12:36 INFO mapred.JobClient:     Combine input records=0
10/01/04 21:12:36 INFO mapred.JobClient:     Map output records=997959
10/01/04 21:12:36 INFO mapred.JobClient:     Reduce input records=997959

The index entry for the word “abandoned” is an example of one present in all of the books:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat output/part-r-00000 | less
 ...
 ...
abandoned       leornardo-davinci-all.txt@1257995,leornardo-davinci-all.txt@652992,all-shakespeare@4657862,all-shakespeare@738818,the-outline-of-science-vol1.txt@642211,the-outline-of-science-vol1.txt@606442,the-outline-of-science-vol1.txt@641585
...
...

TF-IDF in Hadoop Part 1: Word Frequency in Doc

December 31, 2009 4 comments

My interest about parallel computing dates since my undergraduate school, just one or two years after Google’s paper was published about how to make efficient data processing. From that time on, I was wondering how they manage to index “the web”. As I started learning the API and the HDFS, as well as exploring the implementation of the TF-IDF algorithm, as explained by the Cloudera training. I started this implementation after I implemented the InvertedIndex example using both the Hadoop 0.18 and the 0.20.1 APIs. The parts of my experiences are defined as follows:

This code uses the Hadoop 0.20.1 API.

7 years passed and while writing my thesis project, I started dealing with the same questions regarding large datasets… How to process them on a database level? I mean, how to efficiently process with the computational resources you’ve got? Interestingly enough, my first contact with a MapReduce processing was with the mongoDB’s MapReduce API to access data in parallel in different shards in of a database cluster. If the data is stored in different shards depending on different properties of the data. And of course, one of the tools to process the distributed data is a MapReduce API. I learned how to use that API thanks to the Cloudera’s Basic Training on MapReduce and HDFS. This first documentation was produced after studying and completing the first exercises of the Cloudera’s InverseIndex Example using Hadoop, where I have downloaded the VMPlayer image and played with initial examples, driven by the PDF explaining the exercises. Although the source-code works without a problem, it uses the Hadoop 0.18 API, and if you get buzzed by the warnings on Eclipse, I have updated and documented the necessary changes to remove those and use the refactored version of InverseIndex using the Hadoop 0.20.1 API.

I finally found the Cloudera basic introduction training on MapReduce and Hadoop… and let me tell you, they made the nicest introduction to MapReduce I’ve ever seen 🙂 The slides and documentation are very well structured and nice to follow (considering you came from the academic world)… They actually worked closely with Google and the University of Washington to get to that level… I’m was very pleased to read and understand the concept… My only need on that time was to use that knowledge on the MapReduce engine from mongoDB… I did a simple application and it proved to be interesting…

So, I’ve been studying the Cloudera basic training in Hadoop, and that was the only way I could learn MapReduce! If you have a good background on Java 5/6, Linux, Operating System, Shell, etc, you can definitely move on… If you don’t have experience with Hadoop, I definitely suggest following the basic training from sessions 1 – 5, including the InvertedIndex exercise. You will find the exercises describing the TF-IDF algorithm in one of the PDFs.

The first implementation I did with Hadoop was the implementation of the indexing of words on All the Shakespeare collection. However, I was intrigued and could not resist and downloaded more e-books from the Gutenberg project (all Da-Vinci books and The Outline of Science Vol1). The input directory includes the collection from Shakespeare books, but I had to put the new ones into the filesystem. You can add the downloaded files to the Hadoop File System by using the “copyFromLocal” command:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -copyFromLocal the-outline-of-science-vol1.txt input
training@training-vm:~/git/exercises/shakespeare$ hadoop fs -copyFromLocal leornardo-davinci-all.txt input

You can verify if the files were added by listing the contents of the “input” directory.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input
Found 3 items
-rw-r--r--   1 training supergroup    5342761 2009-12-30 11:57 /user/training/input/all-shakespeare
-rw-r--r--   1 training supergroup    1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt
-rw-r--r--   1 training supergroup     674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt

Note that the command “hadoop fs” proxies any unix program to its filesystem. “-ls”, “-cat”, among others. Following the suggestion of the documentation, the approach I took to easily understand the concepts was to device-to-conquer. Each of the jobs are executed in separate as an exercise, saving the generated reduced values into the HDFS.

Job 1: Word Frequency in Doc

As mentioned before, the word frequency phase is designed in a Job whose task is to count the number of words in each of the documents in the input directory. In this case, the specification of the Map and Reduce are as follows:

  • Map:
    • Input: (document, each line contents)
    • Output: (word@document, 1))
  • Reducer
    • n = sum of the values of for each key “word@document”
    • Output: ((word@document), n)

In order to decrease the payload received by reducers, I’m considering the very-high-frequency words such as “the” as the Google’s stopwords list. Also, the result of each job is the intermediate values for the next jobs are saved to the regular file, followed by the next MapReduce pass. In general, the strategy is:

  1. Reduces the map phase by using the lower-case values, because they will be aggregated before the reduce phase;
  2. Don’t use unnecessary words by verifying in the stopwords dictionary (Google search stopwords);
  3. Use RegEx to select only words, removing punctuation and other data anomalies;

Job1, Mapper

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
 * WordFrequenceInDocMapper implements the Job 1 specification for the TF-IDF algorithm
 */
public class WordFrequenceInDocMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    public WordFrequenceInDocMapper() {
    }

    /**
     * Google's search Stopwords
     */
    private static Set<String> googleStopwords;

    static {
        googleStopwords = new HashSet<String>();
        googleStopwords.add("I"); googleStopwords.add("a");
        googleStopwords.add("about"); googleStopwords.add("an");
        googleStopwords.add("are"); googleStopwords.add("as");
        googleStopwords.add("at"); googleStopwords.add("be");
        googleStopwords.add("by"); googleStopwords.add("com");
        googleStopwords.add("de"); googleStopwords.add("en");
        googleStopwords.add("for"); googleStopwords.add("from");
        googleStopwords.add("how"); googleStopwords.add("in");
        googleStopwords.add("is"); googleStopwords.add("it");
        googleStopwords.add("la"); googleStopwords.add("of");
        googleStopwords.add("on"); googleStopwords.add("or");
        googleStopwords.add("that"); googleStopwords.add("the");
        googleStopwords.add("this"); googleStopwords.add("to");
        googleStopwords.add("was"); googleStopwords.add("what");
        googleStopwords.add("when"); googleStopwords.add("where");
        googleStopwords.add("who"); googleStopwords.add("will");
        googleStopwords.add("with"); googleStopwords.add("and");
        googleStopwords.add("the"); googleStopwords.add("www");
    }

    /**
     * @param key is the byte offset of the current line in the file;
     * @param value is the line from the file
     * @param output has the method "collect()" to output the key,value pair
     * @param reporter allows us to retrieve some information about the job (like the current filename)
     *
     *     POST-CONDITION: Output <"word", "filename@offset"> pairs
     */
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Compile all the words using regex
        Pattern p = Pattern.compile("\\w+");
        Matcher m = p.matcher(value.toString());

        // Get the name of the file from the inputsplit in the context
        String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

        // build the values and write <k,v> pairs through the context
        StringBuilder valueBuilder = new StringBuilder();
        while (m.find()) {
            String matchedKey = m.group().toLowerCase();
            // remove names starting with non letters, digits, considered stopwords or containing other chars
            if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))
                    || googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {
                continue;
            }
            valueBuilder.append(matchedKey);
            valueBuilder.append("@");
            valueBuilder.append(fileName);
            // emit the partial <k,v>
            context.write(new Text(valueBuilder.toString()), new IntWritable(1));
        }
    }
}

Job1, Mapper Unit Test

Note that the unit tests use the JUnit 4 API. The MRUnit API is also updated to use the Hadoop 0.20.1 API for the Mapper and the respective MapDriver. Generics are used to emulate the actual implementation as well.

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mock.MockInputSplit;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the word frequency mapper.
 */
public class WordFreqMapperTest extends TestCase {

    private Mapper<LongWritable, Text, Text, IntWritable> mapper;
    private MapDriver<LongWritable, Text, Text, IntWritable> driver;

    /** We expect pathname@offset for the key from each of these */
    private final Text KEY_SUFIX = new Text("@" + MockInputSplit.getMockPath().toString());

    @Before
    public void setUp() {
        mapper = new WordFrequenceInDocMapper();
        driver = new MapDriver<LongWritable, Text, Text, IntWritable>(mapper);
    }

    @Test
    public void testEmpty() {
        List<Pair<Text, IntWritable>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, Text>> expected = new ArrayList<Pair<Text, Text>>();

        assertListEquals(expected, out);
    }

    @Test
    public void testOneWord() {
        List<Pair<Text, IntWritable>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("foo")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, IntWritable>> expected = new ArrayList<Pair<Text, IntWritable>>();
        expected.add(new Pair<Text, IntWritable>(new Text("foo" + KEY_SUFIX), new IntWritable(1)));

        assertListEquals(expected, out);
    }

    @Test
    public void testMultiWords() {
        List<Pair<Text, IntWritable>> out = null;

        try {
            out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, IntWritable>> expected = new ArrayList<Pair<Text, IntWritable>>();
        expected.add(new Pair<Text, IntWritable>(new Text("foo" + KEY_SUFIX), new IntWritable(1)));
        expected.add(new Pair<Text, IntWritable>(new Text("bar" + KEY_SUFIX), new IntWritable(1)));
        expected.add(new Pair<Text, IntWritable>(new Text("baz" + KEY_SUFIX), new IntWritable(1)));

        assertListEquals(expected, out);
    }
}

Job1, Reducer

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * LineIndexReducer Takes a list of filename@offset entries for a single word and concatenates them into a list.
 */
public class WordFrequenceInDocReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    public WordFrequenceInDocReducer() {
    }

    /**
     * @param key is the key of the mapper
     * @param values are all the values aggregated during the mapping phase
     * @param context contains the context of the job run
     *
     *      PRE-CONDITION: receive a list of <"word@filename",[1, 1, 1, ...]> pairs
     *        <"marcello@a.txt", [1, 1]>
     *
     *      POST-CONDITION: emit the output a single key-value where the sum of the occurrences.
     *        <"marcello@a.txt", 2>
     */
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        //write the key and the adjusted value (removing the last comma)
        context.write(key, new IntWritable(sum));
    }
}

Job1, Reducer Unit Test

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
 * Test cases for the inverted index reducer.
 */
public class WordFreqReducerTest extends TestCase {

    private Reducer<Text, IntWritable, Text, IntWritable> reducer;
    private ReduceDriver<Text, IntWritable, Text, IntWritable> driver;

    @Before
    public void setUp() {
        reducer = new WordFrequenceInDocReducer();
        driver = new ReduceDriver<Text, IntWritable, Text, IntWritable>(reducer);
    }

    @Test
    public void testOneItem() {
        List<Pair<Text, IntWritable>> out = null;

        try {
            out = driver.withInputKey(new Text("word")).withInputValue(new IntWritable(1)).run();
        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, IntWritable>> expected = new ArrayList<Pair<Text, IntWritable>>();
        expected.add(new Pair<Text, IntWritable>(new Text("word"), new IntWritable(1)));

        assertListEquals(expected, out);
    }

    @Test
    public void testMultiWords() {
        List<Pair<Text, IntWritable>> out = null;

        try {
            List<IntWritable> values = new ArrayList<IntWritable>();
            values.add(new IntWritable(2));
            values.add(new IntWritable(5));
            values.add(new IntWritable(8));
            out = driver.withInput(new Text("word1"), values).run();

        } catch (IOException ioe) {
            fail();
        }

        List<Pair<Text, IntWritable>> expected = new ArrayList<Pair<Text, IntWritable>>();
        expected.add(new Pair<Text, IntWritable>(new Text("word1"), new IntWritable(15)));

        assertListEquals(expected, out);
    }
}

Before executing the hadoop application, make sure that the Mapper and Reducer classes are passing the unit tests for each of them. Test-Driven Development helps during the development of the Mappers and Reducers by identifying problems related to incorrect inherited methods (Generics in special), where wrong “map” or “reduce” method signatures may lead to skipping designed phases. Therefore, run the test cases before the actual execution of the driver classes is safer.

training@training-vm:~/git/exercises/shakespeare$ ant test
Buildfile: build.xml

compile:
[javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

test:
[junit] Running index.AllTests
[junit] Testsuite: index.AllTests
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.279 sec
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.279 sec
[junit]

BUILD SUCCESSFUL
Total time: 2 seconds

Then, the execution of the Driver can proceed. It includes the definitions of the mapper and reducer classes, as well as defining the combiner class to be the same as the reducer class. Also, note that the definition of the outputKeyClass and outputValueClass are the same as the ones defined by the Reducer class!!! If not, Hadoop will complain! 🙂

Job1, Driver
// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * WordFrequenceInDocument Creates the index of the words in documents,
 * mapping each of them to their frequency.
 */
public class WordFrequenceInDocument extends Configured implements Tool {

    // where to put the data in hdfs when we're done
    private static final String OUTPUT_PATH = "1-word-freq";

    // where to read the data from.
    private static final String INPUT_PATH = "input";

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf, "Word Frequence In Document");

        job.setJarByClass(WordFrequenceInDocument.class);
        job.setMapperClass(WordFrequenceInDocMapper.class);
        job.setReducerClass(WordFrequenceInDocReducer.class);
        job.setCombinerClass(WordFrequenceInDocReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordFrequenceInDocument(), args);
        System.exit(res);
    }
}
As specified by the Driver class, the data is read from the books listed in the input directory from the HDFS and the output is the directory from this first step “1-word-freq”. The training virtual machine contains the necessary build scripts to compile and generate the jars for the execution of the map reduce application, as well as running Unit Tests for each of the Mapper and Reducer classes.
training@training-vm:~/git/exercises/shakespeare$ ant
Buildfile: build.xml

compile:
[javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

jar:
[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar

BUILD SUCCESSFUL
Total time: 1 second

After making sure that everything is working according to the tests, it is time to execute the main driver.

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordFrequenceInDocument
10/01/05 16:34:54 INFO input.FileInputFormat: Total input paths to process : 3
10/01/05 16:34:54 INFO mapred.JobClient: Running job: job_200912301017_0046
10/01/05 16:34:55 INFO mapred.JobClient:  map 0% reduce 0%
10/01/05 16:35:10 INFO mapred.JobClient:  map 50% reduce 0%
10/01/05 16:35:13 INFO mapred.JobClient:  map 66% reduce 0%
10/01/05 16:35:16 INFO mapred.JobClient:  map 100% reduce 0%
10/01/05 16:35:19 INFO mapred.JobClient:  map 100% reduce 33%
10/01/05 16:35:25 INFO mapred.JobClient:  map 100% reduce 100%
10/01/05 16:35:27 INFO mapred.JobClient: Job complete: job_200912301017_0046
10/01/05 16:35:27 INFO mapred.JobClient: Counters: 17
10/01/05 16:35:27 INFO mapred.JobClient:   Job Counters
10/01/05 16:35:27 INFO mapred.JobClient:     Launched reduce tasks=1
10/01/05 16:35:27 INFO mapred.JobClient:     Launched map tasks=3
10/01/05 16:35:27 INFO mapred.JobClient:     Data-local map tasks=3
10/01/05 16:35:27 INFO mapred.JobClient:   FileSystemCounters
10/01/05 16:35:27 INFO mapred.JobClient:     FILE_BYTES_READ=3129067
10/01/05 16:35:27 INFO mapred.JobClient:     HDFS_BYTES_READ=7445292
10/01/05 16:35:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4901739
10/01/05 16:35:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1588239
10/01/05 16:35:27 INFO mapred.JobClient:   Map-Reduce Framework
10/01/05 16:35:27 INFO mapred.JobClient:     Reduce input groups=0
10/01/05 16:35:27 INFO mapred.JobClient:     Combine output records=94108
10/01/05 16:35:27 INFO mapred.JobClient:     Map input records=220255
10/01/05 16:35:27 INFO mapred.JobClient:     Reduce shuffle bytes=1772576
10/01/05 16:35:27 INFO mapred.JobClient:     Reduce output records=0
10/01/05 16:35:27 INFO mapred.JobClient:     Spilled Records=142887
10/01/05 16:35:27 INFO mapred.JobClient:     Map output bytes=27375962
10/01/05 16:35:27 INFO mapred.JobClient:     Combine input records=1004372
10/01/05 16:35:27 INFO mapred.JobClient:     Map output records=959043
10/01/05 16:35:27 INFO mapred.JobClient:     Reduce input records=48779

The execution generates the output as shown in the following listing (note that I had piped the cat process to the less process for you to navigate over the stream). Searching for the word “therefore” shows its use on the different documents.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 1-word-freq/part-r-00000 | less</span>
...
therefore@all-shakespeare       652
therefore@leornardo-davinci-all.txt     124
therefore@the-outline-of-science-vol1.txt       36
...

The results produced are the intermediate data necessary as the input for the execution of the Job 2, specified in the Part 2 of this tutorial.

Finished… 23 Months Gone!!! Got my M.S. degree

December 17, 2009 1 comment

Finally, it is over!!! I got my M.S. degree in Computer Science at SFSU. It was a long, fun, and sometimes tiring journey, but I finally accomplished another dream of mine…

The thesis write up was a big rush… 4 Months, no sleep… Got lots of experience in Data Persistence for Sensor Networks, Cloud-computing techniques in Data Persistence using Key-Value-Pair data model, Database Shards and Partition…

Well, that is all…

Categories: Uncategorized
%d bloggers like this: