Archive for the ‘java’ Category

Managing and Building version-controlled Maven Repos using Git, Gradle and Nexus Server

I currently work in a VERY OLD code base that uses a thirdparty directory as a version-controlled “library” directory, dated from the era before Maven. Some colleagues decided it was time to adopt Maven for building new components and that was exciting… However, you can imagine having a build system composed of “dino” ANT scripts to manage really old stuff, and then the introduction of Maven pom.xml.

I personally chose to use Gradle for the projects that I had started and integrated it well to publish generated jars to the “thirdparty” directory…

Gradle for Maven Dependencies

The most updated version of this script is shown below. Note that this approach differs from the one in the stackoverflow in that this version DOES NOT generate the Maven Metadata files (pom.xml, *sha…), but simply copies the generated versioned jars into the “thirdparty” directory (scm cached dir).

Then, I used Gradle’s dependencies mechanism to use the same thirdparty directory as a dependency…

This week I received a surprising email from a developer complaining about not being able to build one of the components. The problem was simply because the Maven repository server was COMPLETELY REBUILT and all the dependencies were wiped-out. As a result, developers maintaining different projects needed to upload the projects again. That triggered something the question where the Maven artifacts should be stored. As the old approach of using the “thirdparty” directory might become a hybrid approach in a future, I thought I could use GIT to store the versions of the maven versions on a version-controlled directory of my projects after I stumbled upon the following blog

I agree with the pros/cons about that approach, and that happens to be a very similar situation I faced today. In this way, I decided to create something similar using BitBucket private’s GIT project to simulate my same environment.

Reusable Gradle Properties

First the initial maven configuration is managed by the following:

This gives me the following capabilities:

  • Using the property “-Prelease” as a switch to not use the -SNAPSHOT prefix in the generated jars.
  • Lots of properties saved in the project, making it easier to create scripts that depending on those properties from other build scripts.
  • Could also provide an option to generate jars to the “thidparty” directory WITHOUT the Maven’s Metadata files.

For instance, given a project “Maceio Finance” on a project repository on BitBucket, I have created the following build.gradle file to build, generate the jars, declare dependencies to the local scm repository, and able to upload new versions to that.

The simplicity of Gradle uploadArchives

Running the task “uploadArtifacts” with the different switches result in the same expected behavior as described by the blog on, as shown in the output below.

The generated directories with the Maven Metadata files is shown below:

Now, if you just want the same Jars to a given version-controlled directory WITHOUT the Metadata file, you can use a task to “installThirdparty” described above.

Finally, the other support needed is a Maven server. We use the open-source Nexus server. Here’s the example of uploading the same contents from the configuration. Note that it uses the previous definitions of the properties from mavenProperties.gradle.

Gotta love gradle properties

Just as a reference, the output of the command “gradle properties” is a good reference to see ALL the projects variables at runtime. Here’s the output of that command. Take a look at the property “projectMvn” just as a reference.


Running EMMA Code / Test Coverage with JAVA 7 and Gradle 1.0M9

Had a huge road block while trying to integrate EMMA Code Coverage with the latest version of Gradle today… I found this blog post that helped starting reaching a solution…

It has been 2 years since the last update (although some users have commented other solutions), but as Gradle DSL and Java are still evolving, that version did not work. One common problem related to Java 7’s new bytecode instruction. EMMA, Cobertura and others are facing the same problems discussed at Just adding the jvmArg “-XX:-UseSplitVerifier” solved the problem for me with EMMA.

Well, after spending a few hours trying to get the previous patch working, I got a working solution running Test Coverage with EMMA on the latest Grails with Java 7.

First, add the emma configuration and its dependency.


dependencies {
  // EMMS Code Coverage
  emma "emma:emma:2.1.5320"
  emma "emma:emma_ant:2.1.5320"
  testCompile group: 'junit', name: 'junit', version: '4.9'

Then, update the test task by adding the doFirst{} and doLast{} closures below.

test {
    // add EMMA related JVM args to our tests
    jvmArgs "-XX:-UseSplitVerifier", "-Demma.coverage.out.file=$buildDir/tmp/emma/metadata.emma", "-Demma.coverage.out.merge=true"

    doFirst {
       println "Instrumenting the classes at " + sourceSets.main.output.classesDir.absolutePath
       // define the custom EMMA ant tasks
       ant.taskdef( resource:"", classpath: configurations.emma.asPath)

       ant.path(id:"run.classpath") {
       def emmaInstDir = new File(sourceSets.main.output.classesDir.parentFile.parentFile, "tmp/emma/instr")
       println "Creating $emmaInstDir to instrument from " +       sourceSets.main.output.classesDir.absolutePath
       // instruct our compiled classes and store them at $buildDir/tmp/emma/instr
       ant.emma(enabled: 'true', verbosity:'info'){
          instr(merge:"true", destdir: emmaInstDir.absolutePath, instrpathref:"run.classpath",
                metadatafile: new File(emmaInstDir, '/metadata.emma').absolutePath) {
             instrpath {
             fileset(dir:sourceSets.main.output.classesDir.absolutePath, includes:"**/*.class")
       setClasspath(files("$buildDir/tmp/emma/instr") + configurations.emma +    getClasspath())

    // The report should be generated directly after the tests are done.
    // We create three types (txt, html, xml) of reports here. Running your build script now should
    // result in output like that:
    doLast {
       def srcDir =[0]
       println "Creating test coverage reports for classes " + srcDir
       def emmaInstDir = new File(sourceSets.main.output.classesDir.parentFile.parentFile, "tmp/emma")
          new File("$buildDir/reports/emma").mkdirs()
          report(sourcepath: srcDir){
             fileset(dir: emmaInstDir.absolutePath){
       println "Test coverage reports available at $buildDir/reports/emma."
       println "txt: $buildDir/reports/emma/coverage.txt"
       println "Test $buildDir/reports/emma/coverage.html"
       println "Test $buildDir/reports/emma/coverage.xml"

You can run the updates from your gradle as follows:

marcello@hawaii:/u1/development/workspaces/open-source/interviews/vmware$ gradle test
:processResources UP-TO-DATE
:processTestResources UP-TO-DATE
Instrumenting the classes at /u1/development/workspaces/open-source/interviews/vmware/build/classes/main
Creating /u1/development/workspaces/open-source/interviews/vmware/build/tmp/emma/instr to instrument from /u1/development/workspaces/open-source/interviews/vmware/build/classes/main
Creating test coverage reports for classes /u1/development/workspaces/open-source/interviews/vmware/src/main/java
Test coverage reports available at /u1/development/workspaces/open-source/interviews/vmware/build/reports/emma.
txt: /u1/development/workspaces/open-source/interviews/vmware/build/reports/emma/coverage.txt
Test /u1/development/workspaces/open-source/interviews/vmware/build/reports/emma/coverage.html
Test /u1/development/workspaces/open-source/interviews/vmware/build/reports/emma/coverage.xml


Writing Functional Tests on Groovy on Grails: Experiences from CollabNet Subversion Edge.

October 30, 2010 2 comments

I first wrote this technical document for the open-source project CollabNet Subversion Edge on how to design and implement Functional Tests using Groovy on Grails. Therefore, this documentation can also be reached at “

Introduction and Setup

The CollabNet Subversion Edge Functional Tests are based on the Groovy on Grails plugin “Functional Tests“. But, before you get started with them, make sure you have covered the following required steps:

Besides Unit and Integration tests used during development, the source-code under development already contains the functional tests plugin support and some developed classes, as shown in the Eclipse view “Project Explorer”. In the file system, the files are located at CSVN_DEV/test/functional, where the directory CSVN_DEV is where you have previously checked out the source-code. Those set of test cases are the last one being run in our internal Continuous Integration server (Hudson) and it’s usually a good place to find bugs related to the user-facing features during development.


Functional Tests Basics

This section covers the basics of the functional tests infranstruction on Subversion Edge and assumes you are already familiar with the Grails Functional Tests plugin documentation. The plugin is already installed in the Subversion Edge development project, as you can use the commands to run a functional test and visualize the test results and report. The test cases are run as RESTful calls to the controllers defined by the application, but it can also use a URL. For instance:

After the execution of an HTTP method wrapper such as “get()” or “post()”, any test case has the access to the response object “this.response” with the HTML code payload. Grails uses this object to execute any of the “assert*” methods documented.

Another important piece of configuration is the CSVN_DEV/grails-app/conf/Config.groovy. Althoug Grails grails uses the closure “environment.test”, SvnEdge uses the general closure “svnedge” during development and test phases and, therefore, values from configuration of that closure are accessable from the test classes.

Functional tests infrastructure

Considering you have your development infrastructure set up, you will find the current implementation of functional tests at the directory “CSVN_DEV/tests/functional”. Notice that the directory structure follows the Java convention for declaring packages, and has already been configured to be included as source-code directories in the current Eclipse configuration artifact “.classpath” on trunk.


We have created the following convention for defining the packages and functional test classes:

  • com.collabnet.svnedge

The package containing major abstract classes to embrace code reuse while aggregating reusable patterns throughout the entire Test infrastructure. The reusable utility methods were extracted during the first iteration of the development of the SvnEdge functional tests. For instance, the access to the configuration keys from “CSVN_DEV/grails-app/conf/Config.groovy” can be easily accessed from the test cases using the method “getConfig()” or just “config”. Similarly, the access to the i18n keys can be performed by calling “getMessage(‘key’)”, as the value of “key” is one of the keys located at the “”, which renders strings displayed in the User Interface. Note that the English version of the i18n messages are used in the functional tests. Moreover, the abstract classes have their own intent for the scenarios found on Subversion Edge functionalities:

  1. AdminLoggedInAbstractSvnEdgeFunctionalTests: test class that sets up the test case with the user “admin” already logged in (“/csvn/status/index”).
  2. LoggedOffAbstractSvnEdgeFunctionalTests: test class that starts the application in the index auth page where a user can login (“/csvn/auth/index”).
  • com.collabnet.svnedge.console

The test cases related to the web console, or Subversion Edge itself. Different components must have its own package. For instance, take the name of the controllers to as the name of the component to be tested such as “user” and “repo”, as they should have their own test packages as “com.collabnet.svnedge.console.ui.user” and “com.collabnet.svnedge.console.ui.repo”, respectively. The only classes implemented at this time are the login

  • com.collabnet.svnedge.teamforge

The test cases related to the teamforge integration. As you will see, there are only one abstract class and two functional classes covering the functional tests of the conversion process when the server has repositories to be imported (Full Conversion) and when the server does not have any repository created (Fresh Conversion). The latter case is a bit tricky as the SvnEdge environment defines a fresh conversion when its database does not have any repository defined. In this way, test cases related to repositories need to make sure to “discover” repositories if the intent is to verify the existence of repositories in the file-system.

Running Functional Tests

As described in Grails functional tests “mini bible”, the only thing needed to run a functional test case is the following command under the directory CSVN_DEV:

grails test-app -functional OPTIONAL_CLASS_NAME

The command will start the current version of Grails installed using the functional tests environment. If you don’t provide the optional parameter “OPTIONAL_CLASS_NAME”, grails executes all the functional tests defined. However, since the execution of all current implementation of test classes takes more than 10 minutes, use the complete name of the test class (package name + name of the class – sufix “Tests”). For instance, the following command executes the functional Tests implemented in the class LoginFunctionalTests:

grails test-app -functional com.collabnet.svnedge.console.ui.LoginFunctional

The command selects the test suite class “CSVN_DEV/tests/functional/com/collabnet/svnedge/console/ui/LoginFunctionalTests.groovy” to be executed, as the output of the execution of the test cases identify the environment and the location where the test reports will be saved. The recommendation here is to keep using the Eclipse STS infrastructure to save your commands execution as shown below.


As shown below, the functional tests execution output is the same from executing the tests using the command line or the Eclipse command as shown in the output view. The tests are prepared to be executed and save the output logs and reports in the directory “CSVN_DEV/target/test-reports”.

Welcome to Grails 1.3.4 -
Licensed under Apache Standard License 2.0
Grails home is set to: /u1/svnedge/replica_admin/grails/grails-1.3.4/

Base Directory: /u1/development/workspaces/collabnet/svnedge-1.3.4/console
Resolving dependencies...
Dependencies resolved in 1565ms.
Running script /u1/svnedge/replica_admin/grails/grails-1.3.4/scripts/TestApp.groovy
Environment set to test
    [mkdir] Created dir: /u1/development/workspaces/collabnet/svnedge-1.3.4/console/target/test-reports/html
    [mkdir] Created dir: /u1/development/workspaces/collabnet/svnedge-1.3.4/console/target/test-reports/plain

Starting functional test phase ...

Once the functional tests execution finishes the execution, the test reports are written and can be accessed using a web browser. The following snippet shows the result of running the test case started above, which shows how long it took Grails to execute the 4 test cases defined in theLoginFunctionalTests test suite, the stats of how many tests passed or failed, as well as the location where the test reports are located along with the final result of PASSED or FAILED. Note that the directory “target/test-reports” is relative to the directory “CSVN_DEV” as described above.

Tests Completed in 12654ms ...
Tests passed: 4
Tests failed: 0
2010-09-28 12:19:11,334 [main] INFO  /csvn  - Destroying Spring FrameworkServlet 'gsp'
2010-09-28 12:19:11,350 [main] INFO  bootstrap.BootStrap  - Releasing resources from the discovery service.
2010-09-28 12:19:11,350 [main] INFO  bootstrap.BootStrap  - Releasing resources from the Operating System service.
2010-09-28 12:19:11,352 [main] INFO  /csvn  - Destroying Spring FrameworkServlet 'grails'
Server stopped
[junitreport] Processing /u1/development/workspaces/collabnet/svnedge-1.3.4/console/target/test-reports/TESTS-TestSuites.xml
                  to /tmp/null1620273079
[junitreport] Loading stylesheet jar:file:/home/mdesales/.ivy2/cache/org.apache.ant/ant-junit/jars/ant-junit-1.7.1.jar
[junitreport] Transform time: 2339ms
[junitreport] Deleting: /tmp/null1620273079

Tests PASSED - view reports in target/test-reports
Application context shutting down...
Application context shutdown.

Accessing the Test Results Report

Once the execution terminates, you can have access to the test reports. This is where you will find all the answers to the test results, including the detailed information of the entire HTTP payload transmitted between SvnEdge server and the Browser emulator that the Functional Tests use. As shown below, the location of the test cases reports is highlighted as a hyper-link to the index page of the test reports. Clicking on it results on opening the Eclipe’s built-in browser view with the reports.


This report is generated per execution, and therefore, they are deleted before each new execution. In case you need keep information of a test run, copy the contents of the directory “CSVN_DEV/target/test-reports”, as you will find reports in both HTML and XML. The report for each test suite includes the list of each test case run, the status and time of execution. The report includes 3 main output:

  • Properties: system properties used.
  • System.out: the output of the standout output of the process; same output print in the grails output, but more organized.
  • System.err: the output of the standard error of the process.

The most used output is the System.out. Clicking on this hyper-link takes you to the organized output of the traffic, highlighting the HTTP Headers, HTTP Body, redirects, test assersions and test results.

Identifying Test cases report scope

The link to the System.out output is the most important and used throughout the development of the test case, as the output of the execution of each test case is displayed in this area.


Each test case has its own test result scope, and you can easily identify the initialization of the execution of a test case by the key “Output from TEST_CASE_NAME”, where “TEST_CASE_NAME” is the name of the method name that defines the test case. For instance, the log for the execution of the test cases for the LoginFunctionalTests includes the following strings:

--Output from testRootLogin--
--Output from testRegularLogin--
--Output from testDotsLogin--
--Output from testFailLogin--

The output of the execution of the HTTP Request Header of a test case is started with “>>>>>” shown as follows:

>>>>>>>>>>>>>>>>>>>> Making request to / using method GET >>>>>>>>>>>>>>>>>>>>
Initializing web request settings for http://localhost:8080/csvn/
Request parameters:
Request headers:
Accept-Language: en
Accept: */*

On the other hand, the output of the HTTP Response Header of a test case is started with “<<<<<<” as shown below. The HTTP Reponse Header Parameters are output for verification of anything used by the test cases. Note that following access to the “/csvn/” “root” context results in an HTTP Forward to the “Login Page” identified by the context “/csvn” controller “/login/auth” and, therefore, there is not “Content” available.

<<<<<<<<<<<<<<<<<<<< Received response from GET http://localhost:8080/csvn/ <<<<<<<<<<<<<<<<<<<<
Response was a redirect to
  http://localhost:8080/csvn/login/auth;jsessionid=hueqpw5eaq32 <<<<<<<<<<<<<<<<<<<<
Response was 302 'Found', headers:
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: JSESSIONID=hueqpw5eaq32;Path=/csvn
Location: http://localhost:8080/csvn/login/auth;jsessionid=hueqpw5eaq32
Content-Length: 0
Server: Jetty(6.1.21)


#Following redirect to http://localhost:8080/csvn/login/auth;jsessionid=hueqpw5eaq32
>>>>>>>>>>>>>>>>>>>> Making request to http://localhost:8080/csvn/login/auth;jsessionid=hueqpw5eaq32
 using method GET >>>>>>>>>>>>>>>>>>>>

If the HTTP Response contains the body payload, it will be output as is in the Content section:

<<<<<<<<<<<<<<<<<<<< Received response from
  GET http://localhost:8080/csvn/login/auth;jsessionid=hueqpw5eaq32 <<<<<<<<<<<<<<<<<<<<
Response was 200 'OK', headers:
Expires: -1
Cache-Control: no-cache
max-age: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Language: en
Content-Length: 4663
Server: Jetty(6.1.21)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="" lang="en" xml:lang="en">

    <title>CollabNet Subversion Edge Login</title>
    <link rel="stylesheet" href="/csvn/css/styles_new.css"
    <link rel="stylesheet" href="/csvn/css/svnedge.css"
    <link rel="shortcut icon"

Whenever a test case fails, the error message is output as follows:

"functionaltestplugin.FunctionalTestException: Expected content to loosely contain [] but it didn't"

Looking deeper in the raw output for the string “Expected content to loosely contain but it didn’t”, you see what HTML output was used for the evaluation of the test case. Sometimes an error case is related to the current UI or to an external test verification. This specific one is related to teamforge integration as the test server did not have a specific user named “” located in the list of users.

Failed: Expected content to loosely contain [] but it didn't

Writing New Functional Test Cases Suite

This section describes how to create test suites using the Functional Tests plugin. In order to maximize code-reuse, we defined a set of Abstract classes that can be used in specific type of tests as shown in the diagram below. Instead of each test case extend the regular class “functionaltestplugin.FunctionalTestCase“, we created a more general abstract class “AbstractSubversionEdgeFunctionalTests” to define general access to configuration artifact, internationalization (i18n) message keys, among others. In addition to the infrastructural utility methods, the main abstract SvnEdge test class contains a set of “often used” method executions such as “protected void login(username, password)”, which is responsible for trying to perform the login to SvnEdge for a given “username” and “password”. The result of the command can then be verified in the body of the implementing class. More details later in this section. First, any test will be implementing one of the test scenario classes: “AdminLoggedInAbstractSvnEdgeFunctionalTests” or “LoggedOutAbstractSvnEdgeFunctionalTests“. However, the test cases for the conversion process needed a specialized Abstract class “AbstractConversionFunctionalTests“, which is of type “AdminLoggedInAbstractSvnEdgeFunctionalTests” because only the admin user can perform the conversion process.


As it is shown in the UML Class Diagram above, the AbstractSvnEdgeFunctionalTests extends from the Grails Functional Test class. In this way, it will inherit all the basic method calls for assertions from JUnit and grails. The class is shown in RED because it is a “PROHIBITED” class. That is, no other classes but the GREEN ones should directly extend from the RED class. The fact is that the test cases implementation in Subversion Edge only has 2 different types of tests and, therefore, new test cases should only inherit from “AdminLoggedInAbstractSvnEdgeFunctionalTests” or “LoggedOutAbstractSvnEdgeFunctionalTests“. Similarly, additional functional tests to verify other scenarios from the conversion process has to inherit the behavior of the abstract class “AbstractConversionFunctionalTests“.

Basic Abstract Classes

As described in the previous sections, the two major types of test cases are related to when the Admin user is logged in and when there is no user logged in. That is, tests that require different users to login can use the latter test class to perform the login and navigate through the UI. Before continuing, It is important to note that the Functional Tests implementation are based on JUnit using the 3x methods name conversions. For instance, the methods “protected void setUp()” and “protected void tearDown()” are called before and after running each test case defined in a test class. Furthermore, it is also important to to call the super implementation of each of the methods because of the dependency on the Grails infrastructure. Take a look at the following JavaDocs to have an idea of the basic utility methods implemented on each of them.

Just as a reminder, upon executing the test cases defined in a class, JUnit executes the method “setUp()”. If any failure occurs in this test, Grails will fail not only the first test case, but ALL the test cases defined in the Test Suite. This is related to the fact that the method “setUp()” is executed before the execution of each test case. Once the execution of a given test case is finished, the execution of the method “tearDown()” is performed. Any failure on this method also results in ALL test cases to fail.

The test cases defined in the abstract classes are defined to give the implementing concrete classes the access to all the important features for the test cases. As mentioned earlier, utility methods to access the configuration properties and internationalization (i18n) messages are provided. In addition, convenient test cases for performing assertions are also implemented in the Abstract classes. The next sections will provide in-depth details in the implementation of the test suites.

Concrete Functional Tests Suites Implementation

The simplest implementation of Functional Tests is the LoginFunctionalTests used as an example before. However, executing the scenario to be implemented using the production version is the first recommended step before writing any piece of code. You need to collect information about the scenario to be executed, choose UI elements to use in your test case, etc. For instance, consider the execution of the Login scenario of a user with wrong username. By default, the development and test environments of Subversion Edge will be bootstrapped with different users such as “admin”, “user” and “”. Considering a scenario where the attempt to login with a wrong username called “marcello” is performed as the result is shown in the screenshot below:


The test case shows that by entering a wrong username and password, an error message is shown as the server responded with a complete and correct page (HTTP Response Code 200), although an error occurred during the execution of the test case. Based on those information, the automated tests can be written in the test suite to verify the possible test cases for the different users in SvnEdge, including the implementation of the wrong input. Note that the implementation of each test case have the procedures to be verified in the super class through the call to a method “testUserLogin” whereas the implementation of the testFailLogin() is the only implementation that is located in the LoginFunctionalTests. Other abstract and concrete test classes are shown in the UML Class diagram below. Note that the YELLOW classes are the concrete classes that extends the functionality from the abstract classes.


  • LoginFunctionalTests.html: The concret functional tests class suite that verify the login for each of the different usernames, as well as the failure tests.
package com.collabnet.svnedge.console.ui

import com.collabnet.svnedge.LoggedOutAbstractSvnEdgeFunctionalTests;

class LoginFunctionalTests extends LoggedOutAbstractSvnEdgeFunctionalTests {

    protected void setUp() {

    protected void tearDown() {

    void testRootLogin() {

    void testRegularLogin() {

    void testDotsLogin() {

    void testFailLogin() {
        this.login("marcello", "xyzt")
        assertContentContains getMessage("user.credential.incorrect",
            ["marcello"] as String[])

The fact is that the methods “loginAdmin()”, “loginUser()”, etc, are implemented in the AbstractSvnEdgeFunctionalTests to allow code reuse in other test classes, and therefore, the test case “testFailLogin()” uses the basic method “AbstractSvnEdgeFunctionalTests.login(username, password)” for the verification of a user that does not exist. Also, note that the verification of the login scenario is as simple as verifying if the a given String exists in the resulting HTTP Response output. For instance, when attempting to login with a user that does not exist, the error message “Wrong username/password provided for the user “marcello”. This is due to the fact that the String is located in the messages bundle “user.credential.incorrect” and the method “getMessage()” is the helper method implemented in the class AbstractSvnEdgeFunctionalTests.

Another important thing to keep in mind is about code convention. The name of test cases are defined as cameCase, prefixed by the keyword “test”. The name of the test cases can be as long as “AbstractConversionFunctionalTests.html“. The most important point here is that the name of the method must be coherient to the steps being performed. Also, note that Groovy accepts a more relaxed code notation, which makes it easy to read:

        // JAVA method invocation Notation
        this.login("marcello", "xyzt")

        // GROOVY method invocation Notation
        this.login "marcello", "xyzt"

When it comes to the real implementation of a given scenario, you have to constantly refer to the Grails Functional Tests documentation and that’s where you will find your “best friends”. Yes!!! Your best friends! The assert methods that will help you verify the results of the HTTP Response. But first, let’s take a look at the implementation of the basic method that performs “login” and “logout”. As we know from the definition of the abstract classes, each time a method from a class that extends “LoggedOutAbstractSvnEdgeFunctionalTests” is executed, the method setUp() inherited from this class is executed first.

public abstract class LoggedOutAbstractSvnEdgeFunctionalTests extends AbstractSvnEdgeFunctionalTests {

    protected void setUp() {
        //The web framework must be initialized.


        if (this.response.contentAsString.contains(
                getMessage(""))) {

    protected void tearDown() {
        //Stop Svn Server in case it is running


        //The tear down method terminates all the web-related objects, and
        //therefore, must be performed in the end of the operation.

Note that the implementation of the concrete classes MUST make a call to the super.setUp() first, so it executes the depending steps. As you can see in the class implementation below, the method setUp() will first make a request to “/”, that is, “http://localhost:8080/csvn/&#8221; since the RESTful method “get()” uses the base URL + the context name “/csvn”. Then, the first assertion is important to verify that the Server is up and running, as well as verifying that the request did not return any error in the UI. Bookmark the RFC2616 and use the HTTP Response Status Codes as required. The default one to verify is “200”, even though the scenario results in an error message as the test case “LoginFunctionalTests.testFailLogin()”. Finally, after verifying if the status code is as expected, the test uses the object “response” to verify if the HTML content contains the string identified by the key “” in the the i18n artifact “CSVN_DEV/grails-app/i18n/”. For this case, the method is verifying for the key:

Following the way JUnit implements the test execution cycle, the method “tearDown()” is executed right after each method “testXYZ()” is executed. In our case, there are a few steps to be verified before terminating the test case. As it might be necessary, the HTTP server might have been started during a test case, and therefore, the method “stopSvnServer()” is called. This is specially placed in the “highest” abstract class because all types of test cases might want to start the HTTP server from the status page. After an HTTP reques to “/” is performed, the verification to the output is performed to and in case is necessary, the method “logout()” is executed as implemented in the abstrac class “AbstractSvnEdgeFunctionalTests“. That is, if the HTML code from the response object contains the string identified by the key “” (LOGOUT), then click in the link “LOGOUT”. Then, assert if the HTTP response status was equals to 200 and that the content contains the header string “Login” identified by the key “”.

     * Performs the logout by clicking on the link.
    protected void logout() {
        def logout = getMessage("")
        if (this.response.contentAsString.contains(logout)) {
            click logout

Similarly, test cases that perform login will essentially fill out the login form and click on the button “Log in”. The basic implementation of the method “login(username, password) is shown below. The HTTP GET Request to the page “/login/auth” is performed followed by the assert of the status code. Then, if the test environment keeps the user Logged In as a result of a failure of any previous test case, the verification if the user is logged in is performed so that the call to the method “logout()”, as shown above, is performed. Finally, when the user s in the front page, the login form is filled out with the correct values. Please refer to the “Grails Functional Tests Documentation” for details on how to fill out and submit form fields, but it should be straightforward. The only detail needed is to capture the name of the form defined in the HTML code. A good helper way is to use Google Chrome or Firefox “Web Developer” plugin to capture the UI element “ids”. Specifically for the form submission, the ID of the form and the “id”s from the form fields are necessary. Then, the label value of the SUBMIT button is necessary, and as shown in the code below, that string is located in the string with the key “”.


    protected void login(username, password) {

        if (this.response.contentAsString.contains(
                getMessage(""))) {
        def login = getMessage("")
        form('loginForm') {
            j_username = username
            j_password = password
            click login

It is extremely important to note here a very hard problem when it comes to “clickable” items in the UI. Since we are using a mix of the Grails GSP tags and some CSS styles from TeamForge, Grails creates the buttons in a different way for Forms and Places without the HTML form entity. Whenever a form was generated by Grails, the Submit button like the “login” one showed above will only respond to the command “click LABEL” inside of the form() closure. On the other hand, the command “click LABEL” will only perform its action when declared outside of the form() closure. Different examples of these GOTCHAS have been found while the Conversion tests were being written.

To summarize the steps to automate manual tests with corresponding Functional Tests, the suggested steps are as follows:

  1. . Perform the test scenario manually and gather necessary information about the User Interface, choosing unique elements that are present in the resulting action. For the case of login, the verification of the string “Logged in as:” is perfomed. For tests exploring failures and error messages, choose to assert about the existence of these error messages.
  2. . Once you are familiar about how the scenario behaves, create the main Test Case Suite by extending from one of the GREEN abstrac classes in the UML Class Diagram shown above. Choose the names related to the component.
  3. . Propose code reuse by implementing new methods in the AbstractSvnEdgeFunctionalTests if necessary, or if other components will use the same implementation. If not, keep the implementation in the test class developed.
  4. . Add JavaDocs to the methods that are going to be inherited or are difficult to understand. Try documenting the method execution before writing the test case as you will understand the scope of the test better. Next section will provide a good understanding on how to write those supporting documentation.

Advanced Functional Tests Techniques

Once you get used to the way to write automated test cases, you should be able to implement complex test cases that involves not only the local Subversion Edge server, but also external servers such as the TeamForge server used during the tests of conversion process. Don’t forget to document the steps in a structured way inside the JavaDocs, as documentation later makes it easy to understand the purpose of the tests.

Note that the JavaDocs of the classes contain a more detailed specification of the execution of the test cases. For example, the sentences starting with “Verify” are related to the assertions necessary to verify the test case, while “Go to” are related to the HTTP Request method “get()”. Each of the sections are identified so that the implementation of the methods setUp(), tearDown(), and the actual method are explicitly written using Groovy. The source code has more detailed implementation of the test cases.

Test Case 1: Successful conversion to TeamForge Mode

   * SetUp
        * Login to SvnEdge
        * Revert to Standalone Mode in case on TeamForge Mode

   * Steps to reproduce
         * Go to the Credentials Form
         * Enter correct credentials and existing CTF URL and try to convert;

   * Expected Results
         * Successful conversion message is shown
         * Login -> Logout as admin
         * Verify that the server is on TeamForge mode;
         * Login to CTF server and verify that the system ID
            from the SvnEdge server is listed on the list of integration servers

   * Tear Down
         * Revert conversion if necessary
         * Logout from the SvnEdge server

The implementation of complex test cases might require verification of different properties of local and external resources. The example of the conversion process was the first challenge of this nature we had to implement. The following code snippet is the implementation of assertions of the conversion as the Expected results. Note that the method custom assertion methods were written to support this implementation (“assertProhibitedAccessToStandaloneModeLinksWorks()” and “assertConversionSucceededOnCtfServer()”.

     * Verify that the state of the conversion is persisted:
     * <li>The local server shows the TeamForge URL
     * <li>The CTF server shows the link to the server. This can be verified
     * by the current system ID on the list of integration servers.
    protected void assertConversionSucceeded() {
        // Step 1: verify that the conversion is persistent
        assertStatus 200

        assertStatus 200

        // verify that the software version is still shown

        assertStatus 200

        // verify that prohibited links work

        // Step 2: verify that the CTF server DOES list the system ID

Using the response object

As seen in some of the examples, the assertions are the way to verify that a given expected value exists in the HTTP Response payload received from the Server. However, whenever the test case needs to make a decision based on the contents of the response object, you can use the direct access to the response object. For instance, instead of failing a test that needs to have the user logged out, this code snippet verifies if the user is logged in and then performs the logout procedure. The same logic can be applied in different scenarios such as verifying if the server is started/stopped by verifying the status page button. Similarly, the test can verify if there are any created repositories in the file-system before creating a new test repository.

        if (this.response.contentAsString.contains(getMessage(""))) {

Dealing with external resources

The nature of Subversion Edge requires the integration with TeamForge, and how about testing the state of both systems in the same test case? Considering the Grails Plugin allows external HTTP requests during tests, why not performing the same steps an Admin would do to verify the state of the server? This was a bit tricky, but works like a charm. As we had designed before, reusing the configuration was the first step to define which remote TeamForge to use during tests. Then, the Test case could take care of automating the ways to generate the URL for the CTF server based on the configuration parameters during the tests of conversion. Here’s the closure in the file “CSVN_DEV/grails-app/conf/Config.groovy” that one can change which TeamForge server to use (svnedge.ctfMaster).

    ctfMaster {
        ssl = false
        domainName = ""
        username = "admin"
        password = "admin"
        port = 80
        systemId = "exsy1002"

Taking a closer look of what we needed, this is related to the assertions for the last expected result “Login to CTF server and verify that the system ID from the SvnEdge server is listed on the list of integration servers”. So, the translation of this sentence into Groovy code originated the method call “AbstractConversionFunctionalTests.assertConversionSucceededOnCtfServer()”, as the steps to perform this assertion are used by all different scenarios. As implemented, the first step requires that the login to TeamForge take the user to the Administration page “List Integrations” using the method ” this.goToCtfListIntegrationsPage()” before verifying if the system ID saved by the conversion process exists in that page. However, observations on how the HTTP Request flow in TeamForge works was necessary to understand the forwards after the user is logged in. After building the necessary parameters in the method “loginToCtfServerIfNecessary()” was implemented with all the needed values from both the Grails Config.groovy and from the environment. As warned before, the clickable elements of forms can differ from Subversion Edge and TeamForm, and therefore, the grails element “click LABEL” was used here outside the form closure. Finally, don’t be tempted to verify strings in TeamForge using i18n as they are different and Subversion Edge does not have direct access to them. Prefer validating steps using form elements or IDs produced by TeamForge as the UI can change on the remote server.

     * Verifies that the CTF server lists the current ctf server system ID.
    protected void assertConversionSucceededOnCtfServer() {

        assertContentContains("Site Administration")
        assertContentContains("SCM Integrations")
        def appServerPort = System.getProperty("jetty.port", "8080")
        def csvnHostAndPort = server.hostname + ":" + appServerPort

        // TeamForge removes any double-quotes (") submitted via the SOAP API.
        assertContentContains("This is a CollabNet Subversion Edge server in " +
            "managed mode from ${csvnHostAndPort}.")

     * Goes to the list of integrations on the CTF server
    private void goToCtfListIntegrationsPage() {
        // Goes to the list integrations page
        get(this.makeCtfUrl() + "/sf/sfmain/do/listSystems")

    * Makes login to CTF server from a given point that connects to the server.
    * In case the response content DOES NOT contains the string "Logged in as",
    * then make the login. The resulting page is the redirected page requested
    * earlier.
    private void loginToCtfServerIfNecessary() {
        if (!this.response.contentAsString.contains("Logged in as")) {
            assertStatus 200
            def ctfUsername = config.svnedge.ctfMaster.username
            def ctfPassword = config.svnedge.ctfMaster.password
            form("login") {
                username = ctfUsername
                password = ctfPassword
            // the button is a link instead of a form button. Use it outside
            // the form closure.
            click "Log In"
            assertStatus 200

Test Case Suites Needed

A few test cases have been written for specific functionalities of the application. However, here’s some of the test cases that can be developed.

* User Functional Tests
- Create User of each type
  - Login/Logout
  - Verify access to prohibited URLs
  - Access SVN and ViewVC Pages
- List Users
- Delete User
- Change User password
  - Logout and login with the new password.
  - Access SVN and ViewVC pages with new password
- View Self page
- Try changing the server settings, accessing other admin sections

* Repos Functional Tests
- Create Repo
- Discover Repos
- List Repos
- Edit Access Rules
  - Login with users without access to specific repos without access

* Statistics Functional Tests
- Access the pages for statistics

* Administration Functional Tests
- Changing server settings as Admin
- Changing the server Authentication settings
  - Login / Logout and verify changes.
  - Restart server after changing settings

* Server Logs Functional Tests
- Change log level
- View log Files
- View non-existing file
- View existing file
- View log files

* Packages Update Functional Tests
- Update the software packages
- Convert the server and try to update the server to new version

If you have any questions regarding the Functional Tests specification, please don’t hesitate to send an email to

Marcello de Sales – Software Engineer – CollabNet, Inc.

Categories: java, Subversion

BlackBeltFactory: If you are a teacher at heart and love technology, this is your place…

While studying Java for fun and to take the Java certification from the Sun Microsystems back in 2004, I used to hang out in different tutorial websites with reviews for the exams. I was still living in Brazil, where I grew up, when I first started studying Java at the University and seeing passionate about the “Write Once, Run Anywhere” premise… When I found JavaBlackBelt in 2006, I joined it to try perfecting my Java skills and keep up-to-date with the language. Given the transformation of how Social Networking took the Internet, everything changed since then, as they changed their branding and name to BlackBeltFactory, as well as have added social interaction capabilities and a market place for developers, technologists and the ones who love to teach and learn.

My previous experience was just related to my own learning experience: practice/learn the fundamentals of the Java Programming Language. It was essentially a website where users could go and take exams in different subjects related not only with Java, but also with relating technologies such as XML, Web Services, Hibernate, etc. However, I must confess that it is hard to keep up with the exams when you have your day-to-day job, school obligations, etc. I had conquered the Java Blue Belt and I was facing a lot of changes starting with moving from New York to California for the dream of the Silicon Valley and then having the opportunity to engage on another 2 years of my dreamed MS and work with what I love: Java and Computer Science. The academic world can take all of your time with research papers to read (ACM was my browser’s start page) and exams/finals. On the other hand, the only place where I could focus on practicing Java was my research projects: (my thesis and conferences). So, I never became a JavaBlackBelt per say and I was cleaning my mailbox when the name BlackBeltFactory was showing up on older and older emails. Yes, JavaBlackBelt had evolved and “taken the Social Networking train”. There are a list of changes listed on their website here.

The very first basic change BlackBeltFactory did was to take advantage of their infrastructure and start thinking on a more “language/vendor-agnostic” approach: why not offering training in other languages? I saw C# another programming language listed on their website and I must say that the BlackBeltFactory was a cool place to hang out and take exams prepared and reviewed by peers in the community. It is definitely a place to challenge your skills set on a given track. You can only take exams when you provide contributions: review questions, add comments, etc. This approach requires the user to be active in the learning community.

In my opinion, BlackBeltFactory’s natural progression could not have been different: take advantage of the Social Networking capabilities that we are currently live to provide the user’s better learning experience. Teaching is one of my passions and I must say that BlackBeltFactory did a great job in adding features like “Become a coach”. After you have passed the exam to be a coach, it seems that you can either create a free or a paid training to someone. Similarly, users interested about learning can ask others about services of teaching a specific topic. This marketplace is healthy and very interesting to me in the sense that I don’t need to drive anywhere to teach someone something I’m passionate about. As far as I could see, BlackBeltFactory offers the process for both participants to engage on a program. Hummm… Now I don’t need to think about going for PhD and teach! 😀

Another great feature is the translation capability. Although the previous version of the website branded as JavaBlackBelt was awesome for English Speaking users, the platform could not capture users of other nationalities and without knowledge of English. As a Brazilian, I can say that it is difficult, in general, to the ones who are starting with our field of technology/science to properly “bootstrap” their career because of the restricted access to content in Portuguese. That’s why I made sure I had Portuguese as one of the first translated versions of CollabNet Subversion Edge as I’m working in the project. BlackBeltFactory just gave me yet another reason to stick around and contribute to their community as I have a passion for learning and sharing knowledge.

All in all, I think I have to squeeze more of my time to play around in the BlackBeltFactory! For the love of teaching, I have already joined 2 Brazilian groups for the translations and I will make time to review exams and try to get my Java BlackBelt 😀 I could not get even the yellow in Kung Fu when I was 15, but I think I might have potential for Java. I have liked my LinkedIn and Twitter accounts, which are nice as a linking resource.

mongoDB Shards, Cluster and MapReduce: experiments for Sensor Networks

This documents the use of mongoDB as the persistence layer for the collected data from NetBEAMS. It is divided into sections of setup and the CRUD (Create, Retrieve, Update, Delete) operations, as well as advanced topics such as data replication, the use of MapReduce, etc. This document is a copy of the experiments performed for my Masters Thesis Report entitled “A Key-Value-Based Persistence Layer for Sensor Networks“. The original wiki documentation can be found at MongoDBShardsClusterAndMapReduce.

The setup of the mongoDB shards must be performed on each cluster node. First, the relevant processes are started, and then the cluster must be configured with each of the shards, as well as indexes of the collections to be used. Before continuing on this section, refer to the following documentation:

In order to start collecting data, the mongoDB’s server must be set up on a single or distributed way. Using the distributed cluster version requires starting the commands on the following listing:

marcello@netbeams-mongo-dev02:~/development/workspaces/netbeams/persistence$ ps aux | grep mongo
marcello  3391  0.0  0.2  67336  3328 pts/1    Sl   12:38   0:01 mongod --dbpath data/shards/shard-1/ --port 20001
marcello  3397  0.0  0.2  59140  3280 pts/1    Sl   12:38   0:01 mongod --dbpath data/shards/shard-2/ --port 20002
marcello  3402  0.0  0.2  59140  3276 pts/1    Sl   12:38   0:01 mongod --dbpath data/shards/shard-3/ --port 20003
marcello  3406  0.0  0.3 157452  3980 pts/1    Sl   12:38   0:01 mongod --dbpath data/shards/config --port 10000
marcello  3431  0.4  0.2  62004  3332 pts/1    Sl   12:38   0:35 mongos -vvv --configdb localhost:10000
marcello  3432  0.0  0.0   5196   704 pts/1    S    12:38   0:00 tee logs/mongos-cluster-head.log
In summary, these processes are defined as follows:
  • Shards Node: each shard process “mongod” is responsible for managing its own “chunks” of data on a given “dbpath” directory, on a given port number. These processes are used by the cluster head “mongos”;
  • Cluster Metadata Server Node: the main metadata server of the cluster can be located on a local or foreign host. This listing above shows the metadata server “config” located in the same server, managed by the “mongod” process. It carries information about the databases, the list of shards, and the list of “chunks” of each database, including the location “Ip_address:port” of them;
  • Cluster Head Server: the orchestration of the cluster is performed by the “mongos” process. It connects to the cluster head to select which shard to be used, statistics about counters, etc. This is the main process that accepts the client requests.

Make sure to proxy the output of the processes to log files. As shown in the Listing above, the process “tee” is capturing the output for the process “mongos”. mongoDB’s process has additional parameters for that matter as well.

Considering that the proper processes are running, specially the metadata server and the main cluster head, the client process can be started to issue the commands to enable shards on a given database system. Since mongoDB client’s interface uses Javascript as the main programming language abstraction to manipulate data, a script can be used to automate the process of setting up the server. Before continuing, make sure you have covered the mongoDB’s documentation on how to setup database shards:

First, connect to the server using the client process “mongo”, as shown in the following listing:

marcello@netbeams-mongo-dev02:~/development/workspaces/netbeams/persistence$ mongo
MongoDB shell version: 1.2.0
url: test
connecting to: netbeams
Sun Dec 20 14:22:49 connection accepted from #5
type "help" for help

After connected to the server through the client, get references to 2 important databases: the “admin” and “config”. The “admin” is a database system responsible for running commands to the cluster server, while the “config” is the reference to the metadata server. The following listing shows the use of the method “db.getSisterDB()” to retrieve those references:

> admin = db.getSisterDB("admin")
> config = db.getSisterDB("config")

Once the references are available, the use of the names as shortcuts makes the access better. Let’s add each shards that are running on the local and on the foreign servers ( on different communication ports. It is important to note that the issued commands are executed on the metadata server “config”.

> admin.runCommand( { addshard: "" } )
Sun Dec 20 16:04:02 Request::process ns: admin.$cmd msg id:-2097268492 attempt: 0
Sun Dec 20 16:04:02 single query: admin.$cmd  { addshard: "" }  ntoreturn: -1

> admin.runCommand( { addshard: "" } )
Sun Dec 20 16:04:03 Request::process ns: admin.$cmd msg id:-2097268491 attempt: 0
Sun Dec 20 16:04:03 single query: admin.$cmd  { addshard: "" }  ntoreturn: -1

> admin.runCommand( { addshard: "localhost:20001", allowLocal: true } )

In order to be added into the list, a shard server must be running. In case the shard is down at this point, it will be not added into the list of available shards. On the other hand, if it is added and it goes down, the mongos keeps sending heartbeat to verify if the shard has come back. Anyway, use the command “listshards” to list the existing shards that the cluster head can use.

> admin.runCommand( { listshards:1 } )
Sun Dec 20 16:04:03 Request::process ns: admin.$cmd msg id:-2097268490 attempt: 0
Sun Dec 20 16:04:03 single query: admin.$cmd  { addshard: "localhost:20001", allowLocal: true }  ntoreturn: -1
Sun Dec 20 16:04:03 Request::process ns: admin.$cmd msg id:-2097268489 attempt: 0
Sun Dec 20 16:04:03 single query: admin.$cmd  { listshards: 1.0 }  ntoreturn: -1
        "shards" : [
                        "_id" : ObjectId("4b2e8b3f5e90e01ce34de6ea"),
                        "host" : ""
                        "_id" : ObjectId("4b2e8b3f5e90e01ce34de6eb"),
                        "host" : ""
                        "_id" : ObjectId("4b2e8b3f5e90e01ce34de6ec"),
                        "host" : "localhost:20001"
        "ok" : 1

Enabling the shards means to give the metadata server “config” the name of the database to be sharded, as well as the definition of the shard keys. The function “enablesharding” receives the name of the database system. The following listing shows the database “netbeams” being enabled. Later, the definition of the shard key must be given, as the key “observation.pH” is defined as the shard key:

> admin.runCommand({enablesharding:"netbeams"})
{"ok" : 1}
admin.runCommand( { shardcollection: "netbeams.SondeDataContainer", key: { "observation.pH" : 1} } )
Sun Dec 20 16:04:03 Request::process ns: admin.$cmd msg id:-2097268488 attempt: 0
Sun Dec 20 16:04:03 single query: admin.$cmd  { enablesharding: "netbeams" }  ntoreturn: -1
Sun Dec 20 16:04:03 Request::process ns: admin.$cmd msg id:-2097268487 attempt: 0
Sun Dec 20 16:04:03 single query: admin.$cmd  { shardcollection: "netbeams.SondeDataContainer", key: { observation.pH: 1.0 } }  ntoreturn: -1
{"collectionsharded" : "netbeams.SondeDataContainer" , "ok" : 1}

The chunks show the different sections of the data. By using the reference to the metadata database server, list the different shards “config.chunks.find()” to list the documents.

> config.chunks.find()
{ "lastmod" : { "t" : 1261341503000, "i" : 1 }, "ns" : "netbeams.SondeDataContainer", "min" : { "observation" : { "pH" : { $minKey : 1 } } },
"minDotted" : { "observation.pH" : { $minKey : 1 } }, "max" : { "observation" : { "pH" : { $maxKey : 1 } } }, "maxDotted" : { "observation.pH" : { $maxKey : 1 } },
"shard" : "", "_id" : ObjectId("4b2e8b3fb342bcd910b62ec9") }

The next step is to create the indexes of the expected keys. This procedure can be defined after the documents are inserted. In general, defining indexes slows down on “Create” operations, but speeds up “Retrieval” ones. In order to proceed, make sure you have covered the documentation on mongoDB’s Indexes.

  • mongoDB Indexes: this is the documentation regarding indexes of keys on mongoDB.

Note, in the following Listing, that the keys are written to the metadata server “config”. A reference to the database “netbeams” is acquired by using the function “db.getSisterDB()” as it was used for the databases “config” and “admin”. The method “db.collection.ensureIndex()” is used.

> netbeams = db.getSisterDB("netbeams")
> netbeams.SondeDataContainer.ensureIndex( { "message_id":1 } )
Sun Dec 20 16:04:03 Request::process ns: netbeams.system.indexes msg id:-2097268486 attempt: 0
Sun Dec 20 16:04:03  .system.indexes write for: netbeams.system.indexes
Sun Dec 20 16:04:03 Request::process ns: netbeams.$cmd msg id:-2097268485 attempt: 0
Sun Dec 20 16:04:03 single query: netbeams.$cmd  { getlasterror: 1.0 }  ntoreturn: -1
Sun Dec 20 16:04:03 Request::process ns: test.$cmd msg id:-2097268484 attempt: 0
Sun Dec 20 16:04:03 single query: test.$cmd  { getlasterror: 1.0 }  ntoreturn: -1

netbeams.SondeDataContainer.ensureIndex( { "sensor.ip_address":1 } )
netbeams.SondeDataContainer.ensureIndex( { "sensor.location.latitude":1 } )
netbeams.SondeDataContainer.ensureIndex( { "sensor.location.longitude":1 } )
netbeams.SondeDataContainer.ensureIndex( { "time.valid":1 } )
netbeams.SondeDataContainer.ensureIndex( { "time.transaction":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.WaterTemperature":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.SpecificConductivity":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Conductivity":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Resistivity":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Salinity":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Pressure":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Depth":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.pH":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.pHmV":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Turbidity":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.ODOSaturation":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.ODO":1 } )
netbeams.SondeDataContainer.ensureIndex( { "observation.Battery":1 } )

Actually, you can verify the setup performed by accessing each of the collections of the config server. Using a client to access the server in a different shell, you can directly access and modify (NOT RECOMMENDED) the settings of the metadata server, as shown in the following listing:

marcello@netbeams-mongo-dev02:~/development/workspaces/netbeams/persistence$ mongo config
MongoDB shell version: 1.2.0
url: config
connecting to: config
type "help" for help
> Sun Dec 20 16:31:57 connection accepted from #7
show collections
Sun Dec 20 16:32:01 Request::process ns: config.system.namespaces msg id:-128400130 attempt: 0
Sun Dec 20 16:32:01 single query: config.system.namespaces  { query: {}, orderby: { name: 1.0 } }  ntoreturn: 0

So the method “find()” can be used to list the contents of each of the collections. An example is to list the databases configured, showing the properties of each of them (partitioned or not, server host, etc), as shown in the following listing.

> db.databases.find()
Sun Dec 20 16:47:48 Request::process ns: config.databases msg id:-128400129 attempt: 0
Sun Dec 20 16:47:48 single query: config.databases  {}  ntoreturn: 0
{ "name" : "admin", "partitioned" : false, "primary" : "localhost:10000", "_id" : ObjectId("4b2e8b3fb342bcd910b62ec7") }
{ "name" : "netbeams", "partitioned" : true, "primary" : "",
                  "sharded" : { "netbeams.SondeDataContainer" : { "key" : { "observation" : { "pH" : 1 } }, "unique" : false } },
                  "_id" : ObjectId("4b2e8b3fb342bcd910b62ec8") }
{ "name" : "test", "partitioned" : false, "primary" : "", "_id" : ObjectId("4b2e8b3fb342bcd910b62eca") }

Before proceeding, make sure you have covered the basics of mongoDB use:

Using the mongoDB client process “mongo”, access a given server “mongos” or “mongod”. The client access to “mongos” process executes the commands in the context of the entire cluster through the use of the metadata server “config”, while the “mongod” is used to access a given shard server, if necessary for debug processes. Use the commands specifying the server location and which database to use. The following listing shows the command to access a given shard on a given port, using the database “netbeams”.

marcello@netbeams-mongo-dev02:~/development/workspaces/netbeams/persistence$ mongo
MongoDB shell version: 1.2.0
connecting to:
type "help" for help

In order to verify the stats of a collection, use the function “collection.stats()”. This function verifies the counters stored in the metadata server.

> db.SondeDataContainer.stats()
Sun Dec 20 14:54:24 Request::process ns: netbeams.$cmd msg id:-1701410104 attempt: 0
Sun Dec 20 14:54:24 single query: netbeams.$cmd  { collstats: "SondeDataContainer" }  ntoreturn: -1
Sun Dec 20 14:54:24 passing through unknown command: collstats { collstats: "SondeDataContainer" }
        "ns" : "netbeams.SondeDataContainer",
        "count" : 2364851,
        "size" : 1155567036,
        "storageSize" : 1416246240,
        "nindexes" : 40,
        "ok" : 1

The access of a given document is randomly chosen from one of the shards by using the function “collection.findOne()”. It is a way to verify one example of the collected data.

> db.SondeDataContainer.findOne()
Sun Dec 20 14:59:08 Request::process ns: netbeams.SondeDataContainer msg id:-1701410103 attempt: 0
Sun Dec 20 14:59:08 shard query: netbeams.SondeDataContainer  {}
Sun Dec 20 14:59:08  have to set shard version for conn: 0x2909de0 ns:netbeams.SondeDataContainer my last seq: 0  current: 4
Sun Dec 20 14:59:08     setShardVersion  netbeams.SondeDataContainer  { setShardVersion: "netbeams.SondeDataContainer",
configdb: "localhost:10000", version: Timestamp 1261341503000|1, serverID: ObjId(4b2e8b3eb342bcd910b62ec6) } 0x2909de0
Sun Dec 20 14:59:08       setShardVersion success!
        "_id" : ObjectId("e26f40072f68234b6af3d600"),
        "message_id" : "b405e634-fd4b-450c-9466-82dc0555ea06",
        "sensor" : {
                "ip_address" : "",
                "location" : {
                        "latitude" : 37.89155,
                        "longitude" : -122.4464
        "time" : {
                "valid" : "Sun Dec 06 2009 10:18:22 GMT-0800 (PST)",
                "transaction" : "Sat Dec 12 2009 01:52:42 GMT-0800 (PST)"
        "observation" : {
                "WaterTemperature" : 23.45,
                "SpecificConductivity" : 35.4,
                "Conductivity" : 139.6,
                "Resistivity" : 899.07,
                "Salinity" : 0.02,
                "Pressure" : 0.693,
                "Depth" : 2.224,
                "pH" : 6.25,
                "pHmV" : -76,
                "Turbidity" : 0.2,
                "ODOSaturation" : 31.3,
                "ODO" : 54.83,
                "Battery" : 1.1


In order to proceed with this section, make sure you have the necessary background in the programming model “MapReduce?“. The recommended documentation and tutorials are as follows:

  • Introduction to MapReduce: this training video class describes the MapReduce? concepts using Hadoop and the Hadoop Distributed File System, which can be directly related to the mongoDB’s implementation; A Must watching before proceeding;
  • mongoDB’s MapReduce HowTo: this is the main documentation of the MapReduce? implementation and use on mongoDB. This covers the basic and how the functions “map” and “reduce” can be implemented for a given collection of documents.

The first basic example of the use of MapReduce? in distribute system is counting. In my opinion, it is a good example on how to have the counting process spread out into different machines. By using the regular client process “mongo”, access the database “netbeams”, as shown in the following listing:

marcello@netbeams-mongo-dev02:~/development/workspaces/netbeams/persistence$ mongo netbeams
MongoDB shell version: 1.2.0
url: netbeams
connecting to: netbeams
Sun Dec 20 14:22:49 connection accepted from #5
type "help" for help

At this point, you’re connected to the server running in the main host. Refer to the setup process described in the beginning of this documentation for more details. Our goal is to report the number of collected data from different servers given by the IP address of them. In this case, our strategy is to define a map function that emits the value 1 as the counter, and use a reduce function to count the consolidated result after the mongoDB’s MapReduce? engine returns the intermediary results to be reduced.

  • The Map function: The following defines the single map function that defines the key as the IP address of the sensor, and the count as the value. Note that mongoDB’s implementation differs from the Hadoop implementation. It does not include the key as a parameter to the map function, because it uses the concept of “this”, that refers to the collection object being used during the execution.
> m1 = function () {
    emit(this.sensor.ip_address, {count:1});
  • The Reduce function: the following defines the single reduce function that receives the consolidated results mapping the given keys (ip addresses) and the counting values found. The function iterates over the values returned and increments the total variable with the value of the variable “count”, which in this case is equals to “1” on each of the elements. The “…” are the spaces returned from the mongoDB client shell”. The result is returned using the key “count”.
> r1 = function (key, values) {
    var total = 0;
    for (var i = 0; i < values.length; i++) {
        total += values[i].count;
    return {count:total};

By defining each of the function “map” and “reduce”, you can use the collection function “db.collection.mapReduce”, using the function references as parameters. The following listing shows the execution of the command using the mongoDB’s shell, displaying the definition of each of the “map” and “reduce” functions before the execution:

> res = db.SondeDataContainer.mapReduce(m1, r1);
Sun Dec 20 14:26:02 Request::process ns: netbeams.$cmd msg id:-1701410106 attempt: 0
Sun Dec 20 14:26:02 single query: netbeams.$cmd  { mapreduce: "SondeDataContainer", map: function () {
    emit(this.sensor.ip_address, {count:1});
}, reduce: function (key, values) {
    var total = 0;
    for (var i = 0; i < va... }  ntoreturn: -1

After processing the execution of the function on each of the shards, the cluster head process “mongos” returns the values and consolidates the results. The output is temporarily stored in a collection called “dbres.result?“, saving the values on a separate chunk. The output is shown as follows:

Sun Dec 20 14:33:15 ~ScopedDBConnection: _conn != null
Sun Dec 20 14:33:15 creating new connection for pool to:
Sun Dec 20 14:33:15 ~ScopedDBConnection: _conn != null
        "result" : "",
        "shardCounts" : {
                "" : {
                        "input" : 2364851,
                        "emit" : 2364851,
                        "output" : 254
        "counts" : {
                "emit" : 2364851,
                "input" : 2364851,
                "output" : 254
        "ok" : 1,
        "timeMillis" : 433282,
        "timing" : {
                "shards" : 433193,
                "final" : 89
        "ok" : 1,

As shown in this output, the MapReduce? result returns the number of counts of emit, input, and final output. Since there are 253 definitions of IP address being used on the network IP “” (0 – subnet address, 255 – broadcast address). The values are related to the total number of observations inserted during the Create operation. The Retrieve section shows the total number of documents as 2.36 million documents. Again, the output of the function “db.collection.stats()” shows the total number of documents:

> db.SondeDataContainer.stats()
Sun Dec 20 14:54:24 Request::process ns: netbeams.$cmd msg id:-1701410104 attempt: 0
Sun Dec 20 14:54:24 single query: netbeams.$cmd  { collstats: "SondeDataContainer" }  ntoreturn: -1
Sun Dec 20 14:54:24 passing through unknown command: collstats { collstats: "SondeDataContainer" }
        "ns" : "netbeams.SondeDataContainer",
        "count" : 2364851,
        "size" : 1155567036,
        "storageSize" : 1416246240,
        "nindexes" : 40,
        "ok" : 1

The number of “emits” is the number of total documents visited by the “map” function. The reduced is referred to the output value of the counts. In order to see the result, just access the database reference dbres.result? and use the function “find()” to list the results, as shown in the following listing, showing just 20 items from the result:

> db[res.result].find()                        
Sun Dec 20 14:34:43 Request::process ns: msg id:-1701410105 attempt: 0
Sun Dec 20 14:34:43 single query:  {}  ntoreturn: 0
Sun Dec 20 14:34:43 creating new connection for pool to:
{ "_id" : "", "value" : { "count" : 9408 } }
{ "_id" : "", "value" : { "count" : 9371 } }
{ "_id" : "", "value" : { "count" : 9408 } }
{ "_id" : "", "value" : { "count" : 9500 } }
{ "_id" : "", "value" : { "count" : 9363 } }
{ "_id" : "", "value" : { "count" : 9355 } }
{ "_id" : "", "value" : { "count" : 9281 } }
{ "_id" : "", "value" : { "count" : 9320 } }
{ "_id" : "", "value" : { "count" : 9341 } }
{ "_id" : "", "value" : { "count" : 9464 } }
{ "_id" : "", "value" : { "count" : 9285 } }
{ "_id" : "", "value" : { "count" : 9201 } }
{ "_id" : "", "value" : { "count" : 9397 } }
{ "_id" : "", "value" : { "count" : 9258 } }
{ "_id" : "", "value" : { "count" : 9242 } }
{ "_id" : "", "value" : { "count" : 9231 } }
{ "_id" : "", "value" : { "count" : 9446 } }
{ "_id" : "", "value" : { "count" : 9550 } }
{ "_id" : "", "value" : { "count" : 9409 } }
{ "_id" : "", "value" : { "count" : 9256 } }
has more

Note that the final result shows the key “id” being the IP address, as defined during the “map” function, and the result is “value.count”, since “value” is the default output of the MapReduce? engine and “count” was used in the “reduce” function.

Other use cases can be performed. The execution of this map reduce was not fast because of the use of one single shard. MapReduce? is designed to perform related to the proportion of servers available. If the load is distributed in more shards, the execution result is returned in a faster way.

The shard logs reveals the details of the map and reduce operations. The following listing is from the log of the process “mongod” server, showing the instants of creation of the temporary database tables for intermediate results. First, the request is received and both the map and reduce is setup to be executed.

Sun Dec 20 14:26:02 query netbeams.$cmd ntoreturn:1 reslen:179 nscanned:0 { mapreduce: "SondeDataContainer", map: function () {
    emit(this.sensor.ip_address, {count:1});
}, reduce: function (key, values) {
    var total = 0;
    for (var i = 0; i < va..., out: "tmp.mrs.SondeDataContainer_1261347962_5" }  nreturned:1 433257ms
Sun Dec 20 14:26:02 CMD: drop
Sun Dec 20 14:26:02 CMD: drop

The “map phase” is first executed, and it must be completely executed before the “reduce phase” takes place. In the scenario used to count the number of documents per IP address, it happens in different instants as shown in the following listing. In addition, it shows the process of indexing the intermediate results during the “map phase” and saves the data into the database “”:

                43700/2364851   1%
                96000/2364851   4%
                148300/2364851  6%
                200300/2364851  8%
                250900/2364851  10%
                300600/2364851  12%
                351600/2364851  14%
                403800/2364851  17%
                455800/2364851  19%
                508000/2364851  21%
                560500/2364851  23%
                601100/2364851  25%
                647500/2364851  27%
                699900/2364851  29%
                752300/2364851  31%
                804300/2364851  34%
                856100/2364851  36%
                907900/2364851  38%
                959000/2364851  40%
                1009800/2364851 42%
                1060800/2364851 44%
                1112800/2364851 47%
                1164100/2364851 49%
                1209400/2364851 51%
                1253700/2364851 53%
                1305400/2364851 55%
                1350900/2364851 57%
                1401700/2364851 59%
                1453100/2364851 61%
                1503100/2364851 63%
                1551500/2364851 65%
                1602600/2364851 67%
                1637100/2364851 69%
                1687600/2364851 71%
                1736800/2364851 73%
                1787600/2364851 75%
                1839900/2364851 77%
                1891100/2364851 79%
                1941400/2364851 82%
                1989900/2364851 84%
                2041800/2364851 86%
                2094300/2364851 88%
                2145500/2364851 90%
                2193500/2364851 92%
                2245100/2364851 94%
                2296200/2364851 97%
                2341700/2364851 99%
Sun Dec 20 14:28:24 building new index on { 0: 1 } for
Sun Dec 20 14:28:24 Buildindex idxNo:0
       { ns: "", key: { 0: 1 }, name: "0_1" }
Sun Dec 20 14:28:40      external sort used : 0 files  in 16 secs
Sun Dec 20 14:28:46 done for 1796343 records 22.486secs
Sun Dec 20 14:28:24 insert netbeams.system.indexes 22486ms
Sun Dec 20 14:28:47 building new index on { _id: ObjId(000000000000000000000000) } for
Sun Dec 20 14:28:47 Buildindex idxNo:0
      { name: "_id_", ns: "", key: { _id: ObjId(000000000000000000000000) } }
Sun Dec 20 14:28:47 done for 0 records 0.02secs

The execution of the “reduce phase” stars and processes the intermediate results of the “map phase”, saving the final results in the new temporary database “”.

                100/1796343     0%
                200/1796343     0%
Sun Dec 20 14:33:15 CMD: drop
Sun Dec 20 14:33:15 CMD: drop netbeams.tmp.mrs.SondeDataContainer_1261347962_5
Sun Dec 20 14:33:15 end connection
Sun Dec 20 14:33:15 connection accepted from #15
Sun Dec 20 14:33:15 connection accepted from #16
Sun Dec 20 14:33:15 building new index on { _id: ObjId(000000000000000000000000) } for
Sun Dec 20 14:33:15 Buildindex idxNo:0
         { name: "_id_", ns: "", key: { _id: ObjId(000000000000000000000000) } }
Sun Dec 20 14:33:15 done for 0 records 0secs
Sun Dec 20 14:33:15  mapreducefinishcommand 253
Sun Dec 20 14:33:15 CMD: drop netbeams.tmp.mrs.SondeDataContainer_1261347962_5
Sun Dec 20 14:33:15 ~ScopedDBConnection: _conn != null
Sun Dec 20 14:33:15 end connection
Sun Dec 20 14:33:15 end connection
Sun Dec 20 14:34:43 connection accepted from #17

NOTE: If the results are important, make sure to save the temporary results into a new database system, since the results returned by a map-reduce function are purged upon new access of the server through the mongoDB client.

RSA Algorithm Explained: a step-by-step process

The art of information hiding, or Cryptography, is one of my favorite applications of Mathematics in computer science. Information Hiding dates from thousand of years ago in “non-standard hieroglyphs carved into monuments from Egypt‘s Old Kingdom (ca 4500+ years ago)” [History of Cryptography]. It has different real-world applications ranging from credit-card security to online transactions in protected websites (those with the security lock). There are different approaches of information hiding, and the one used in the applications cited is the one first described by RivestShamir and Adleman, the RSA. A great post about public key cryptography is summarized by Dr. Duke O’Connor on his website post “RSA is 30, and counting” about the 30th anniversary of the paper gave the origin to RSA:

R. Rivest, A. Shamir, L. Adleman.  A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, Vol. 21 (2), pp.120–126. 1978.

The application used on the Internet can be summarized in the following video:

If you want to teach your children :), take a look at this other video.

RSA Algorithm on a T-ShirtDr. O’Connor summarizes two versions of the Mathematical foundation of the algorithm: one with the first flaw and the corrected version. The “fixed” version of the algorithm is what I had studied in school while studying Abstract Algebra, where the students were distributed into groups of three to implement their own Crypto systems and exchange keys and information in order to be tested. Of course, different assumptions were considered for the exercise. This post is just the output of the application I had developed in Java, what I called Cryptonline. I decided to use my free time to develop a Groovy on Grails Online Forum application where users can only see the messages if they login and use your public/private keys. The intention of the application is just for educational purposes. Read Dr. O’Connor’s post for background background information before moving on (in case you don’t have the background on it :D). As the T-Shirt says, it is just an algorithm.

Suppose you are transferring an important message over an unreliable communication channel such as a letter to someone, or simply to give your bank account to someone for transfer purposes.

Bank Transfer between you and the bank.

As you can see, a reliable communication channel is the one where your message is encrypted and no one in between the sender and receiver can read it because the message is an arrangement that only you and the bank are capable of reading. The only way to do so is the creation of a keys that can lock and unlock the message hidden in the message. The RSA algorithm is based on the idea of Private and Public keys, where both are used to encode and decode the encrypted message. Suppose the string “John Smith” is the text to be decoded. Encryption is the process to hide the information on an Encrypted version and Decryption is the process of decoding the Encrypted text to its original format.

Private Key Use

Of course, the idea is to have a key where you can distribute, so that only people you want to read the message can read it.

Private and Public Keys

This is the basic idea of the algorithm. There are different variations of this algorithm depending on the application. Whenever you visit a website with the lock enabled means that any information exchanged between your computer and the visited website’s computer is encrypted.

By now you should have the background on this subject (considering you have gone to Dr. O’Connor’s website and read the Mathematical background). The source-code below is the actual output from Cryptonline, the one I developed back in 2000 at school. It is divided into 3 steps: Keys creation, Text Encryption and Text Decryption.

Public Key Creation

Let’s start from the Public Key.

-> Configuring random prime numbers
        P = 1277
        Q = 2311
-> Calculating public keys
          N = P * Q; N = 2951147
-> FI = (P-1) * (Q-1); FI = 2947560

-> Calculating (E)
           While MCD(n >= 2 , 2947560) != 1
           MCD(2 , 2947560) = 2
           MCD(3 , 2947560) = 3
           MCD(4 , 2947560) = 4
           MCD(5 , 2947560) = 5
           MCD(6 , 2947560) = 6
           MCD(7 , 2947560) = 7
           MCD(8 , 2947560) = 8
           MCD(9 , 2947560) = 3
           MCD(10 , 2947560) = 10
           MCD(11 , 2947560) = 11
           MCD(12 , 2947560) = 12
           MCD(13 , 2947560) = 1 Correct!
          E = 13

       Public Key (N,E) = (2951147 , 13)

Private Key Creation

The calculation of the Private Key is based on the matrix calculation of the Phi number, considering E.

-> Calculating private keys
         Initializing (p1,p2,p3) = (1, 0 , FI(n))
         Initializing (q1,q2,q3) = (0, 1 ,  E  ))
         While q3 != 0
             quoc = p3 / q3
             (t1,t2,t3) = (p1,p2,p3) - quoc * (q1,q2,q3)
             After, arrange the values:
             (p1,p2,p3) = (q1,q2,q3)
             (q1,q2,q3) = (t1,t2,t3)

           (13 <> 0) , then:
             quoc = 2947560 / 13 = 226735
             (t1,t2,t3) = (0,1,13) - 226735 * (1,-226735,5) = (1,-226735,5)
             (p1,p2,p3) = (1,-226735,5)
             (q1,q2,q3) = (1,-226735,5)

           (5 <> 0) , then:
             quoc = 13 / 5 = 2
             (t1,t2,t3) = (1,-226735,5) - 2 * (-2,453471,3) = (-2,453471,3)
             (p1,p2,p3) = (-2,453471,3)
             (q1,q2,q3) = (-2,453471,3)

           (3 <> 0) , then:
             quoc = 5 / 3 = 1
             (t1,t2,t3) = (-2,453471,3) - 1 * (3,-680206,2) = (3,-680206,2)
             (p1,p2,p3) = (3,-680206,2)
             (q1,q2,q3) = (3,-680206,2)

           (2 <> 0) , then:
             quoc = 3 / 2 = 1
             (t1,t2,t3) = (3,-680206,2) - 1 * (-5,1133677,1) = (-5,1133677,1)
             (p1,p2,p3) = (-5,1133677,1)
             (q1,q2,q3) = (-5,1133677,1)

           (1 <> 0) , then:
             quoc = 2 / 1 = 2
             (t1,t2,t3) = (-5,1133677,1) - 2 * (13,-2947560,0) = (13,-2947560,0)
             (p1,p2,p3) = (13,-2947560,0)
             (q1,q2,q3) = (13,-2947560,0)

         q3 is zero(0). Now, verify the value of p2. In case of negative, invert it by summing it with FI. (represent the negative number of z(n) by a positive.)

         u2 = 1133677;
         D = u2; D = 1133677

      Private Key (N,D) = (2951147, 1133677);

Using the Private and Public Keys

To summarize, the program outputs the keys used throughout the application.

#### All RSA Information ####

Public Key (N, E) = (2951147, 13)
Private Key (N, D) = (2951147, 1133677)

Of course you cannot give the original Prime Numbers to anybody since they are the factors that created the public and private key. So, the only key you can give away to other people is the Public one. Now, in order to illustrate Encryption and Decryption, my encryption machine uses ASCII-based character mapping for the Mathematical calculations. Consider my blog’s title as the input “Marcello de Sales: because solving problems is addicting”. Each character of the string is transformed into the ASCII added other numbers.

Encryption Process

-> Original Message
Marcello de Sales: because solving problems is addicting

-> Setting the receiver's public key
(N , E) = (2951147 , 13)

-> Transforming the message to ASCII code

-> Configuring randomly selected blocks from the ASCII message
Bloco(x) = x ^ E mod N

Block(17) = 17 ^ 13 mod 2951147 = 2920887
Block(71) = 71 ^ 13 mod 2951147 = 1483408
Block(972) = 972 ^ 13 mod 2951147 = 363316
Block(1419) = 1419 ^ 13 mod 2951147 = 1419505
Block(920) = 920 ^ 13 mod 2951147 = 213548
Block(1) = 1 ^ 13 mod 2951147 = 1
Block(20) = 20 ^ 13 mod 2951147 = 93651
Block(8) = 8 ^ 13 mod 2951147 = 1394993
Block(2082) = 2082 ^ 13 mod 2951147 = 2878680
Block(1113) = 1113 ^ 13 mod 2951147 = 770001
Block(2200) = 2200 ^ 13 mod 2951147 = 2301917
Block(20113) = 20113 ^ 13 mod 2951147 = 787047
Block(2183) = 2183 ^ 13 mod 2951147 = 424239
Block(19) = 19 ^ 13 mod 2951147 = 1557862
Block(7) = 7 ^ 13 mod 2951147 = 2854397
Block(208) = 208 ^ 13 mod 2951147 = 375871
Block(2012) = 2012 ^ 13 mod 2951147 = 491468
Block(151) = 151 ^ 13 mod 2951147 = 2348470
Block(58) = 58 ^ 13 mod 2951147 = 966721
Block(13219) = 13219 ^ 13 mod 2951147 = 2596853
Block(820) = 820 ^ 13 mod 2951147 = 1336058
Block(11991) = 11991 ^ 13 mod 2951147 = 2624815
Block(97) = 97 ^ 13 mod 2951147 = 1340264
Block(21721) = 21721 ^ 13 mod 2951147 = 1760166
Block(52011) = 52011 ^ 13 mod 2951147 = 1685895
Block(32215) = 32215 ^ 13 mod 2951147 = 1202590
Block(21) = 21 ^ 13 mod 2951147 = 2752293
Block(1208) = 1208 ^ 13 mod 2951147 = 1414540
Block(21820) = 21820 ^ 13 mod 2951147 = 1733373
Block(5) = 5 ^ 13 mod 2951147 = 1879414
Block(21020) = 21020 ^ 13 mod 2951147 = 310870
Block(3132) = 3132 ^ 13 mod 2951147 = 519822
Block(212) = 212 ^ 13 mod 2951147 = 1315135
Block(2142) = 2142 ^ 13 mod 2951147 = 2430603
Block(1119) = 1119 ^ 13 mod 2951147 = 748920
Block(8208) = 8208 ^ 13 mod 2951147 = 2808982
Block(20) = 20 ^ 13 mod 2951147 = 93651
Block(1209) = 1209 ^ 13 mod 2951147 = 906866
Block(215) = 215 ^ 13 mod 2951147 = 396673
Block(13) = 13 ^ 13 mod 2951147 = 2564672
Block(2205) = 2205 ^ 13 mod 2951147 = 337248
Block(2) = 2 ^ 13 mod 2951147 = 8192
Block(151) = 151 ^ 13 mod 2951147 = 2348470
Block(32) = 32 ^ 13 mod 2951147 = 1191513
Block(197200) = 197200 ^ 13 mod 2951147 = 2266852
Block(200) = 200 ^ 13 mod 2951147 = 104075
Block(20) = 20 ^ 13 mod 2951147 = 93651
Block(519) = 519 ^ 13 mod 2951147 = 1459225
Block(9) = 9 ^ 13 mod 2951147 = 1601171
Block(21620) = 21620 ^ 13 mod 2951147 = 2477239
Block(5210) = 5210 ^ 13 mod 2951147 = 1598948
Block(203) = 203 ^ 13 mod 2951147 = 644537

-> Encrypted Message

Decryption Process

Upon receiving the encrypted message, the receiver needs to use the public key from the sender in order to decrypt the message. The receiver has the same encryption machine, but needs your public key in order to decipher it. The program developed decrypts the message as shown below.

-> Encrypted Message

-> Setting the private key
(N , D) = (2951147 , 1133677)

-> Decripting each block
Ascii(x) = x ^ D mod N

Ascii(2920887) = 2920887 ^ 1133677 mod 2951147 = 17
Ascii(1483408) = 1483408 ^ 1133677 mod 2951147 = 71
Ascii(363316) = 363316 ^ 1133677 mod 2951147 = 972
Ascii(1419505) = 1419505 ^ 1133677 mod 2951147 = 1419
Ascii(213548) = 213548 ^ 1133677 mod 2951147 = 920
Ascii(1) = 1 ^ 1133677 mod 2951147 = 1
Ascii(93651) = 93651 ^ 1133677 mod 2951147 = 20
Ascii(1394993) = 1394993 ^ 1133677 mod 2951147 = 8
Ascii(2878680) = 2878680 ^ 1133677 mod 2951147 = 2082
Ascii(770001) = 770001 ^ 1133677 mod 2951147 = 1113
Ascii(2301917) = 2301917 ^ 1133677 mod 2951147 = 2200
Ascii(787047) = 787047 ^ 1133677 mod 2951147 = 20113
Ascii(424239) = 424239 ^ 1133677 mod 2951147 = 2183
Ascii(1557862) = 1557862 ^ 1133677 mod 2951147 = 19
Ascii(2854397) = 2854397 ^ 1133677 mod 2951147 = 7
Ascii(375871) = 375871 ^ 1133677 mod 2951147 = 208
Ascii(491468) = 491468 ^ 1133677 mod 2951147 = 2012
Ascii(2348470) = 2348470 ^ 1133677 mod 2951147 = 151
Ascii(966721) = 966721 ^ 1133677 mod 2951147 = 58
Ascii(2596853) = 2596853 ^ 1133677 mod 2951147 = 13219
Ascii(1336058) = 1336058 ^ 1133677 mod 2951147 = 820
Ascii(2624815) = 2624815 ^ 1133677 mod 2951147 = 11991
Ascii(1340264) = 1340264 ^ 1133677 mod 2951147 = 97
Ascii(1760166) = 1760166 ^ 1133677 mod 2951147 = 21721
Ascii(1685895) = 1685895 ^ 1133677 mod 2951147 = 52011
Ascii(1202590) = 1202590 ^ 1133677 mod 2951147 = 32215
Ascii(2752293) = 2752293 ^ 1133677 mod 2951147 = 21
Ascii(1414540) = 1414540 ^ 1133677 mod 2951147 = 1208
Ascii(1733373) = 1733373 ^ 1133677 mod 2951147 = 21820
Ascii(1879414) = 1879414 ^ 1133677 mod 2951147 = 5
Ascii(310870) = 310870 ^ 1133677 mod 2951147 = 21020
Ascii(519822) = 519822 ^ 1133677 mod 2951147 = 3132
Ascii(1315135) = 1315135 ^ 1133677 mod 2951147 = 212
Ascii(2430603) = 2430603 ^ 1133677 mod 2951147 = 2142
Ascii(748920) = 748920 ^ 1133677 mod 2951147 = 1119
Ascii(2808982) = 2808982 ^ 1133677 mod 2951147 = 8208
Ascii(93651) = 93651 ^ 1133677 mod 2951147 = 20
Ascii(906866) = 906866 ^ 1133677 mod 2951147 = 1209
Ascii(396673) = 396673 ^ 1133677 mod 2951147 = 215
Ascii(2564672) = 2564672 ^ 1133677 mod 2951147 = 13
Ascii(337248) = 337248 ^ 1133677 mod 2951147 = 2205
Ascii(8192) = 8192 ^ 1133677 mod 2951147 = 2
Ascii(2348470) = 2348470 ^ 1133677 mod 2951147 = 151
Ascii(1191513) = 1191513 ^ 1133677 mod 2951147 = 32
Ascii(2266852) = 2266852 ^ 1133677 mod 2951147 = 197200
Ascii(104075) = 104075 ^ 1133677 mod 2951147 = 200
Ascii(93651) = 93651 ^ 1133677 mod 2951147 = 20
Ascii(1459225) = 1459225 ^ 1133677 mod 2951147 = 519
Ascii(1601171) = 1601171 ^ 1133677 mod 2951147 = 9
Ascii(2477239) = 2477239 ^ 1133677 mod 2951147 = 21620
Ascii(1598948) = 1598948 ^ 1133677 mod 2951147 = 5210
Ascii(644537) = 644537 ^ 1133677 mod 2951147 = 203

-> Complete message in ASCII

-> Original Message
Marcello de Sales: because solving problems is addicting

This is fun! Now you can talk anything with your peers :D. Different practical applications are in Internet chat rooms. I’ve used Adium for Mac in order to chat through a secure communication channel over the Internet :). I keep thinking which the keys are when I’m communicating (when I’m bored…).

TF-IDF in Hadoop Part 3: Documents in Corpus and TFIDF Computation

The previous 2 parts of this post did the small part of the job for calculating the TF-IDF for each “term” in different documents in “corpus”. Since the implementation depends on concepts of Information Retrieval, specially for starters in Information Retrieval, take a look at the book Christopher D. ManningPrabhakar Raghavan and Hinrich SchützeIntroduction to Information Retrieval, Cambridge University Press. 2008. The authors are professors at Stanford and Stuttgart Universities, have different exercises in the subject, and I found out good resources in Chapter 7, showing the basic concepts of the TF-IDF algorithm. As I mentioned before, I had first read the term 7 years ago when I was writing my BS in Computer Science degree report (Portuguese) for the problem of user profile matching and clustering. Interestingly enough, I started learning hadoop about 2 weeks ago and I was stoked about it, because my first contact with MapReduce was actually using mongoDB during my MS in Computer Science thesis report when I needed to generate a report over a data-centric collection of data collected from an environmental Sensor Network using the mongoDB’s MapReduce API over distributed mongoDB Shards. All in all, it seems the time to put this in practice is now :).

Job 3: Documents in Corpus and TF-IDF computation

In order to summarize the idea of scoring words based on its occurrence in corpus, I will use graphical and textual examples of the algorithm to take advantage of this post and make it clear what the exercises really are. Ricky Ho has implemented TF-IDF using Apache PIG and documented exactly the steps described by the algorithm in a very nice diagram shown below. Ricky also made a good summary about the terms “term frequency” and “inverse document frequency”, so check his website out in case you need.

The TF-IDF MapReduce Phases by Ricky Ho

This post implements the third round of the implementation, where the count of words are done by counting the size of the array that brings all the documents for each of the words, taking into consideration the output of the previous phase. Let’s take a look at the data format from the previous job and see if it matches the description of the diagram presented (note that the order of the terms are not sorted. I selected the term “therefore” at random):

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 2-word-counts/part-r-00000 | less
therefore@all-shakespeare       652/738781
therefore@leornardo-davinci-all.txt     124/149612
therefore@the-outline-of-science-vol1.txt       36/70650

The output shows each term at each document, and the number of its occurrence on the given document, accompanied by the total number of terms in the document. So, the final Mapper and Reducer were defined as follows:

  • Map:
    • Input: ((term@document), n/N)
    • Re-arrange the mapper to have the word as the key, since we need to count the number of documents where it occurs
    • Output: (term, document=n/N)
  • Reducer:
    • D = total number of documents in corpus. This can be passed by the driver as a constant;
    • d = number of documents in corpus where the term appears. It is a counter over the reduced values for each term;
    • TFIDF = n/N * log(D/d);
    • Output: ((word@document), d/D, (n/N), TFIDF)

This post shows the implementation of the third step, which counts the number of documents in which a “term” appears in each document in corpus and calculates the TF-IDF. I have made some assumptions for the final output to better present the results and of course to deal with the scope of the example.

  • The first problem was to maintain this step as the last one for the completion of the exercise. In order to do so, the calculation of the number of documents in corpus could be made by another MapReduce phase as described in the Cloudera’s documentation. However, they concluded the class slides by mentioning that this last phase could be done without such additional phase. I remembered in the classes that you use the JobConf for the purpose of parameters passed to the jobs. So, I used the FileSystem class to count the number of documents 😀 in the original input directory, since that is a constant number. I tried using the Context/Configuration classes of the Hadoop 0.20.1 API to pass that number to the last Reducer, but the get(key) returns null. So, the only way I could pass the number of documents was using the JobName 🙂 I know, it is a dirty hack, but it works;
  • Since the number of documents in corpus is small, the chances that a word appears in all documents are higher than applying the algorithm for web indexing on thousands or millions of documents. Therefore, the term log(totalDocs/docsPerWord) can result in “nulling” the result (log(3/3)=0) , so I simplified the calculation by using tfIdf = tf, since the log function results in 100% of occurrence in all documents in corpus (you could implement it as tfIdf=tf * 1 as well);
  • I decided to add more information to the output just in purpose of documentation. It shows [word@document, documentsFrequency/documentsCorpus, wordFrequency/totalWordsInDocument, TF-IDF];
  • The final result is formatted to a smaller to have only a few decimal points for purposes of displaying the values in this exercise. Therefore, in production these values matter.

Job3, Mapper

package index;

import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

 * WordsInCorpusTFIDFMapper implements the Job 3 specification for the TF-IDF algorithm
 * @author Marcello de Sales (
public class WordsInCorpusTFIDFMapper extends Mapper<LongWritable, Text, Text, Text> {

    public WordsInCorpusTFIDFMapper() {

     * @param key is the byte offset of the current line in the file;
     * @param value is the line from the file
     * @param output has the method "collect()" to output the key,value pair
     * @param reporter allows us to retrieve some information about the job (like the current filename)
     *     PRE-CONDITION: marcello@book.txt  \t  3/1500
     *     POST-CONDITION: marcello, book.txt=3/1500
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] wordAndCounters = value.toString().split("\t");
        String[] wordAndDoc = wordAndCounters[0].split("@");                 //3/1500
        context.write(new Text(wordAndDoc[0]), new Text(wordAndDoc[1] + "=" + wordAndCounters[1]));
Job3, Reducer
package index;

import java.text.DecimalFormat;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.mapreduce.Reducer;

 * WordsInCorpusTFIDFReducer calculates the number of documents in corpus that a given key occurs and the TF-IDF computation.
 * The total number of D is acquired from the job name 🙂 It is a dirty hack, but the only way I could communicate the number from
 * the driver.
 * @author Marcello de Sales (
public class WordsInCorpusTFIDFReducer extends Reducer<Text, Text, Text, Text> {

    private static final DecimalFormat DF = new DecimalFormat("###.########");

    public WordsInCorpusTFIDFReducer() {

     * @param key is the key of the mapper
     * @param values are all the values aggregated during the mapping phase
     * @param context contains the context of the job run
     *             PRECONDITION: receive a list of <word, ["doc1=n1/N1", "doc2=n2/N2"]>
     *             POSTCONDITION: <"word@doc1,  [d/D, n/N, TF-IDF]">
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        // get the number of documents indirectly from the file-system (stored in the job name on purpose)
        int numberOfDocumentsInCorpus = Integer.parseInt(context.getJobName());
        // total frequency of this word
        int numberOfDocumentsInCorpusWhereKeyAppears = 0;
        Map<String, String> tempFrequencies = new HashMap<String, String>();
        for (Text val : values) {
            String[] documentAndFrequencies = val.toString().split("=");
            tempFrequencies.put(documentAndFrequencies[0], documentAndFrequencies[1]);
        for (String document : tempFrequencies.keySet()) {
            String[] wordFrequenceAndTotalWords = tempFrequencies.get(document).split("/");

            //Term frequency is the quocient of the number of terms in document and the total number of terms in doc
            double tf = Double.valueOf(Double.valueOf(wordFrequenceAndTotalWords[0])
                    / Double.valueOf(wordFrequenceAndTotalWords[1]));

            //interse document frequency quocient between the number of docs in corpus and number of docs the term appears
            double idf = (double) numberOfDocumentsInCorpus / (double) numberOfDocumentsInCorpusWhereKeyAppears;

            //given that log(10) = 0, just consider the term frequency in documents
            double tfIdf = numberOfDocumentsInCorpus == numberOfDocumentsInCorpusWhereKeyAppears ?
                    tf : tf * Math.log10(idf);

            context.write(new Text(key + "@" + document), new Text("[" + numberOfDocumentsInCorpusWhereKeyAppears + "/"
                    + numberOfDocumentsInCorpus + " , " + wordFrequenceAndTotalWords[0] + "/"
                    + wordFrequenceAndTotalWords[1] + " , " + DF.format(tfIdf) + "]"));

I have implemented the TestCases for both the Mapper and Reducer classes, but for simplification of this post, I will skip those. Let’s take a look a the driver written, since it captures the total number of documents directly from the filesystem using the buil-in Hadoop API. Definitely no need for another MapReduce phase for that. As described in the Cloudera’s training, the less the better since we are saving resources utilization :). Anyway, let’s go to the Driver implementation.

Job3, Driver
package index;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

 * WordFrequenceInDocument Creates the index of the words in documents,
 * mapping each of them to their frequency.
 * @author Marcello de Sales (
 * @version "Hadoop 0.20.1"
public class WordsInCorpusTFIDF extends Configured implements Tool {

    // where to put the data in hdfs when we're done
    private static final String OUTPUT_PATH = "3-tf-idf";

    // where to read the data from.
    private static final String INPUT_PATH = "2-word-counts";

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf, "Word in Corpus, TF-IDF");



        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

        //Getting the number of documents from the original input directory.
        Path inputPath = new Path("input");
        FileSystem fs = inputPath.getFileSystem(conf);
        FileStatus[] stat = fs.listStatus(inputPath);

        //Dirty hack to pass the total number of documents as the job name.
        //The call to context.getConfiguration.get("docsInCorpus") returns null when I tried to pass
        //conf.set("docsInCorpus", String.valueOf(stat.length)) Or even
        //conf.setInt("docsInCorpus", stat.length)

        return job.waitForCompletion(true) ? 0 : 1;

    public static void main(String[] args) throws Exception {
        int res = Configuration(), new WordsInCorpusTFIDF(), args);

Continuing with the implementation, the only thing to do is to compile and run the final driver. Note that the input directory is the one containing the partial counts of documents of job 2, that is, “2-word-counts”. The output is the directory reserved fro step 3, or “3-tf-idf”. Then, the only way that I could send the total number in corpus was using the jobName. The Hadoop 0.20.1 API does not pass the values of the configuration at any cost. As documented, I tried using the context reference in the reducer class, but the reference only returned “null” for the call “context.getConfiguration().get(“docsInCorpus”)”. I gave up and looked for an option, and the JobName was the only way I could :).

I skipped the test session and compiled everything and ran the driver as follows:

training@training-vm:~/git/exercises/shakespeare$ ant
Buildfile: build.xml

    [javac] Compiling 11 source files to /home/training/git/exercises/shakespeare/bin

      [jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar


Then, finally running the calculator of words:

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordsInCorpusTFIDF
10/01/09 21:41:40 INFO input.FileInputFormat: Total input paths to process : 1
10/01/09 21:41:41 INFO mapred.JobClient: Running job: job_200912301017_0115
10/01/09 21:41:42 INFO mapred.JobClient:  map 0% reduce 0%
10/01/09 21:41:51 INFO mapred.JobClient:  map 100% reduce 0%
10/01/09 21:42:00 INFO mapred.JobClient:  map 100% reduce 100%
10/01/09 21:42:02 INFO mapred.JobClient: Job complete: job_200912301017_0115
10/01/09 21:42:02 INFO mapred.JobClient: Counters: 17
10/01/09 21:42:02 INFO mapred.JobClient:   Job Counters
10/01/09 21:42:02 INFO mapred.JobClient:     Launched reduce tasks=1
10/01/09 21:42:02 INFO mapred.JobClient:     Launched map tasks=1
10/01/09 21:42:02 INFO mapred.JobClient:     Data-local map tasks=1
10/01/09 21:42:02 INFO mapred.JobClient:   FileSystemCounters
10/01/09 21:42:02 INFO mapred.JobClient:     FILE_BYTES_READ=2017995
10/01/09 21:42:02 INFO mapred.JobClient:     HDFS_BYTES_READ=1920431
10/01/09 21:42:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4036022
10/01/09 21:42:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2943390
10/01/09 21:42:02 INFO mapred.JobClient:   Map-Reduce Framework
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce input groups=0
10/01/09 21:42:02 INFO mapred.JobClient:     Combine output records=0
10/01/09 21:42:02 INFO mapred.JobClient:     Map input records=48779
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce shuffle bytes=2017995
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce output records=0
10/01/09 21:42:02 INFO mapred.JobClient:     Spilled Records=97558
10/01/09 21:42:02 INFO mapred.JobClient:     Map output bytes=1920431
10/01/09 21:42:02 INFO mapred.JobClient:     Combine input records=0
10/01/09 21:42:02 INFO mapred.JobClient:     Map output records=48779
10/01/09 21:42:02 INFO mapred.JobClient:     Reduce input records=48779

Note that the final number of values are the same as the previous Job step. So, everything went ok as expected. Taking a look and the file in the output directory:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls 3-tf-idf
Found 2 items
drwxr-xr-x   - training supergroup          0 2010-01-09 21:41 /user/training/3-tf-idf/_logs
-rw-r--r--   1 training supergroup    2943390 2010-01-09 21:41 /user/training/3-tf-idf/part-r-00000

I decided to take a look at the same word I mentioned above “therefore”. Here’s the result for them, this time the output was automatically sorted by Hadoop.

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 3-tf-idf/part-r-00000 | less
abook@leornardo-davinci-all.txt [1/3 , 3/149612 , 0.00000957]
aboriginal@the-outline-of-science-vol1.txt      [1/3 , 1/70650 , 0.00000675]
abortive@all-shakespeare        [2/3 , 4/738781 , 0.00000095]
therefore@all-shakespeare       [3/3 , 652/738781 , 0.00088253]
therefore@the-outline-of-science-vol1.txt       [3/3 , 36/70650 , 0.00050955]
therefore@leornardo-davinci-all.txt     [3/3 , 124/149612 , 0.00082881]

Taking a look at chapter 7 of the book, it is clear to make the following conclusions about the output presented:

  • The term “therefore” is more relevant in the document “all-shakespeare”, since its occurrence is more likely to happen than in the other documents;
  • Other terms that does not appear in all documents such as “abook”, “aboriginal” and “abortive” have very small relevance for the given corpus of documents.

What’s Next?

I had so much fun with my first contact with Hadoop that I am going to the Cloudera Training here in the Bay Area in about 10 days. Although the training covers the Hadoop 0.18 API, I decided to use the Hadoop 0.20.1 API because I just wanted to try a “cleaner” API.

As for next steps from this exercise, one could do the categorized classification of the terms per document in a data-centric way (document -> term -> tf-idf), or whatever your need is. Time to go play with Apache PIG and HBase.

%d bloggers like this: