Latest Entries »

Introduction

Concatenation of Strings is very easy in Java – all you need is a ‘+’. It can’t get any easier than that, right? Unfortunately there are a few pitfalls. One thing you should remember from your first Java lessons is a small albeit important detail: Stringobjects are immutable. Once constructed they cannot be changed anymore.

Whenever you “change” the value of a String you create a new object and make that variable reference this new object. Appending a String to another existing one is the same kind of deal: a new String containing the stuff from both is created and the old one is dropped.

You might wonder why Strings are immutable in first place. There are two very compelling reasons for it:

  1. Immutable basic types makes things easier. If you pass a String to a function you can be sure that its value won’t change.
  2. Security. With mutable Strings one could bypass security checks by changing the value right after the check. (Same thing as the first point, really.)

The performance impact of String.concat()

Each time you append something via ‘+’ (String.concat()) a new String is created, the old stuff is copied, the new stuff is appended, and the old String is thrown away. The bigger the String gets the longer it takes – there is more to copy and more garbage is produced.

Creating a String with a length of 65536 (character by character) already takes about 22 seconds on an AMD64 X2 4200+. The following diagram illustrates the exponentially growing amount of required time:

String.concat() - exponential growth
Figure 1: StringBuilder vs StringBuffer vs String.concat

StringBuilder and StringBuffer are also shown, but at this scale they are right onto the x-axis. As you can see String.concat() is slow. Amazingly slow in fact. It’s so bad that the guys over at FindBugs added a detector for String.concat inside loops to their static code analysis tool.

When to use ‘+’

Using the ‘+’ operator for concatenation isn’t bad per se though. It’s very readable and it doesn’t necessarily affect performance. Let’s take a look at the kind of situations where you should use ‘+’.

a) Multi-line Strings:

String text=
    "line 1\n"+
    "line 2\n"+
    "line 3";

Since Java doesn’t feature a proper multi-line String construct like other languages, this kind of pattern is often used. If you really have to you can embed massive blocks of text this way and there are no downsides at all. The compiler creates a single String out of this mess and no concatenation happens at runtime.

b) Short messages and the like:

System.out.println("x:"+x+" y:"+y);

The compiler transforms this to:

System.out.println((new StringBuilder()).append("x:").append(x).append(" y:").append(y).toString());

Looks pretty silly, doesn’t it? Well, it’s great that you don’t have to write that kind of code yourself. 😉

If you’re interested in byte code generation: Accordingly to Arno Unkrig (the amazing dude behind Janino) the optimal strategy is to use String.concat() for 2 or 3 operands, and StringBuilder for 4 or more operands (if available – otherwise StringBuffer). Sun’s compiler always uses StringBuilder/StringBufferthough. Well, the difference is pretty negligible.

When to use StringBuilder and StringBuffer

This one is easy to remember: use ’em whenever you assembe a String in a loop. If it’s a short piece of example code, a test program, or something completely unimportant you won’t necessarily need that though. Just keep in mind that ‘+’ isn’t always a good idea.

StringBuilder and StringBuffer compared

StringBuilder is rather new – it was introduced with 1.5. Unlike StringBuffer it isn’t synchronized, which makes it a tad faster:

StringBuilder compared with StringBuffer
Figure 2: StringBuilder vs StringBuffer

As you can see the graphs are sort of straight with a few bumps here and there caused by re-allocation. Also StringBuilder is indeed quite a bit faster. Use that one if you can.

Initial capacity

Both – StringBuilder and StringBuffer – allow you to specify the initial capacity in the constructor. Of course this was also a thing I had to experiment with. Creating a 0.5mb String 50 times with different initial capacities:

different initial capacities compared
Figure 3: StringBuilder and StringBuffer with different initial capacities

The step size was 8 and the default capacity is 16. So, the default is the third dot. 16 chars is pretty small and as you can see it’s a very sensible default value.

If you take a closer look you can also see that there is some kind of rhythm: the best initial capacities (local optimum) are always a power of two. And the worst results are always just before the next power of two. The perfect results are of course achieved if the required size is used from the very beginning (shown as dashed lines in the diagram) and no resizing happens at all.

Some insight

That “PoT beat” is of course specific to Sun’s implementations of StringBuilder and StringBuffer. Other implementations may show a slightly different behavior. However, if these particular implementations are taken as target one can derive two golden rules from these results:

  1. If you set the capacity use a power of two value.
  2. Do not use the String/CharSequence constructors ever. They set the capacity to the length of the given String/CharSequence + 16, which can be virtually anything.

Benchmarking method

In order to get meaningful results I took care of a few things:

  • VM warmup
  • separate runs for each test
  • each sample is the median of 5 runs
  • inner loops were inside of each bench unit
Advertisements

18/02/2014

R Functions – Part I

  • q() / quit() to quit R
  • install.packages(“RJDBC”) to install new packages from internet
  • library(“packagname”) / require(“libraryname”) to include library into memory while executing program.
  • help (package=’base’) to get help. [ ?data.frame , help(data.frame), ??data.frame ]
  • Search() to see which packages are already loaded into memory .
  • Ls() to print all variables from environment.
  • Print(a),paste(a), a to print object
  • Getwd(), setwd(“c:\\testR\”) to get/set working directory. 
  • Dir() to list all files from current working directory
  • Source(“R file name”) to execute R code file
  • C:\program files\R\R CMD BATCH abc.R to run from command prompt batch mode
  • Typeof(object) to find type of object
  • As.integer(10.10) to get integer value, is.integer(10.10) to get if it is integer
  • Length(vector) to get length of vector length
  • Nchar(productname) to get length of string
  • Vector<-c(1,2,3) get vector
  • Multiply<-vector*vector , to multiply one vector with another, recycle smaller vector
  • Append(vector,c(4,5,6)) append a new vector
  • Matrix <- matrix(amount, nrow=2,ncol=3) create metrix with 2X3 , populate data vertically.
  • mixedDataTypeList <- list(c(1,2,3),c(“a”,”b”,”c”))
  • df1 <- data.frame(student=c(1,2,3),name=(“a”,”b”,”c”))
  • summary(df1.student) :- get summary statistics
  • df1<-rbind(df1,c(4,”d”)) :- add new collection
  • head(df1,2), tail(df1,2) :- get first and last n records 

 

Setup Cassandra Multinode Cluster

Install Cassandra :-

[1] Download Cassandra from website. [http://cassandra.apache.org/download/] Save it to /data path.

[2] Execute following command to unzip tar file

 tar -zxvf apache-cassandra-*-bin.tar.gz

[3] Rename apache-cassandra-* folder to apache-cassandra using “mv apache-cassandra-* apache-cassandra”.

[4] Create Cassandra Temporary directory where cassandra saves all its data.

Create Following directories.
 mkdir /data/cassandradata/commitlog
 mkdir /data/cassandradata/data
 mkdir /data/cassandradata/logs
 mkdir /data/cassandradata/saved_caches

[5] modify Following Configurations in the Main Cassandra Node[here it is 192.168.1.90] by editing conf/cassandra.yaml file.[Note :- dont write tab in file, cassandra will give an error if tab is used in configuration file.]

  1.  data_file_directories:--  /data/cassandradata/data
  2. commitlog_directory: /data/cassandradata/commitlog
  3.  saved_caches_directory: /data/cassandradata/saved_caches
  4.  - seeds: "192.168.1.90" [Main cassandra Nodes Ip Address via which other nodes can connect this machine./ NIC Card address]
  5. listen_address: 192.168.1.90 [Local Node Ip Address / NIC Card address]
  6.  rpc_address: 192.168.1.90 [Local Node Ip Address / NIC Card address]

[6] Modify log folder path in log4j-server.properties

 log4j.appender.R.File=/var/log/cassandra/system.log

[7] Start server via executing following command and logs will be displayed on the screen.

 bin/cassandra -f     remove -f command line argument to run server as background process.

[8] Execute foloowing command to test server and connect it via command line.

bin/cassandra-cli [or if localhost doeesnot work then use : bin/cassandra-cli --host 192.168.1.90]
 You will get following display :-
 Connected to: "Test Cluster" on 127.0.0.1/9160
 Welcome to Cassandra CLI version 1.0.7
Type 'help;' or '?' for help.
 Type 'quit;' or 'exit;' to quit.
[default@unknown]
You may use "Help;" command to get mode help.
 [default@unknown] help;

Crete Demo keyspace;

 [default@unknown] create keyspace DEMO;
 f53dff10-5bd8-11e1-0000-915a024292eb
 Waiting for schema agreement...
 ... schemas agree across the cluster

Authenticate you to use the DEMO keyspace.

 [default@DEMO] create column family Users
 ... with key_validation_class = 'UTF8Type'
 ... and comparator = 'UTF8Type'
 ... and default_validation_class = 'UTF8Type';

Now you can store data into Users column family:

 [default@DEMO] set Users[1234][name] = scott;
 Value inserted.
 Elapsed time: 10 msec(s).
 [default@DEMO] set Users[1234][password] = tiger;
 Value inserted.
 Elapsed time: 10 msec(s).

You have inserted a row into the Users column family. The row key is ‘1234’, and we set values for two columns in the row: ‘name’, and ‘password’.
Now let’s fetch the data you inserted:

 [default@DEMO] get Users[1234];
 => (column=name, value=scott, timestamp=1350769161684000)
 => (column=password, value=tiger, timestamp=1350769245191000)
Returned 2 results.
 Elapsed time: 67 msec(s).

[9] Now set up another nodes [i.e. 192.168.1.91].Copy both /data/apache-cassandra and /data/cassandradata folder to new server and Edit following configuration parameters.

listen_address: 192.168.1.91 [Local Node Ip Address / NIC Card address]
rpc_address: 192.168.1.91 [Local Node Ip Address / NIC Card address]
 Execute following command on 90 server and get communication random token for this server.
./tools/bin/token-generator 4 [ Here 4 indicates we want to configure total 4 nodes so we need 4 distinct hashtokens.]

[root@cassandramaster apache-cassandra]# ./tools/bin/token-generator 4
DC #1:
Node #1: 0
Node #2: 42535295865117307932921825928971026432
Node #3: 85070591730234615865843651857942052864
Node #4: 127605887595351923798765477786913079296
Use node#2 token to set initial_token parameter in the conf/cassandra.yaml file.
initial_token: 42535295865117307932921825928971026432.

[10] now when new node starts via ” bin/cassandra” command server will get notified and you will see a new message like : New node 192.168.1.91 is up and available in the cluster.

[11] Follow step 9 and 10 to add new nodes to Cassandra Cluster.

With lightdm being the new graphical user login in Ubuntu users will need to find a way to disable it to boot in to text mode, fortunately the people behind lightdm have made that really easy to do.

Edit /etc/default/grub with your favorite editor,

sudo nano /etc/default/grub

Find out this line:

GRUB_CMDLINE_LINUX_DEFAULT=”<no matter what's you find here>”

Change it to:

GRUB_CMDLINE_LINUX_DEFAULT=”text”

Update Grub:

sudo update-grub

No need to remove / disable lightdm upstart conf, it already does that for you.

lightdm.conf

# Check kernel command-line for inhibitors, unless we are being called 

# manually 

       for ARG in $(cat /proc/cmdline); do

if [ “$ARG” = “text” ];

              then

              plymouth quit || :

              stop 

              exit 0 

          fi 

       done

You will still be able to use X by typing startx after you logged in.

What we want to do

In this tutorial, I will describe the required steps for setting up a multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduceHDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Cluster of machines running Hadoop at Yahoo! (Source: Yahoo!)

In a previous tutorial, I described how to setup up a Hadoop single-node cluster on an Ubuntu box. The main goal of ”this” tutorial is to get a more sophisticated Hadoop installation up and running, namely building a multi-node cluster using two Ubuntu boxes.

This tutorial has been tested with the following software versions:

Tutorial approach and structure

From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Ubuntu boxes in this tutorial. In my humble opinion, the best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two Ubuntu boxes, and in a second step to “merge” these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master (but also act as a slave with regard to data storage and processing), and the other box will become only a slave. It’s much easier to track down any problems you might encounter due to the reduced complexity of doing a single-node cluster setup first on each machine.

Tutorial approach and structure.

Prerequisites

Configuring single-node clusters first

The tutorial approach outlined above means that you should read now my previous tutorial on how to setup up a Hadoop single-node cluster and follow the steps described there to build a single-node Hadoop cluster on each of the two Ubuntu boxes. It’s recommended that you use the ”same settings” (e.g., installation locations and paths) on both machines, or otherwise you might run into problems later when we will migrate the two machines to the final multi-node cluster setup.

Just keep in mind when setting up the single-node clusters that we will later connect and “merge” the two machines, so pick reasonable network settings etc. now for a smooth transition later.

Done? Let’s continue then!

Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one Ubuntu box the ”master”[here dhimantVM1] (which will also act as a slave) and the other Ubuntu box a ”slave”[here dhimantVM2].

We will call the designated dhimantVM1 machine just the master from now on and the slave-only machine theslave. We will also give the two machines these respective hostnames in their networking setup, most notably in /etc/hosts. If the hostnames of your machines are different (e.g. node01) then you must adapt the settings in this tutorial as appropriate.

Shutdown each single-node cluster with /bin/stop-all.sh before continuing if you haven’t done so already.

Networking

This should come as no surprise, but for the sake of completeness I have to point out that both machines must be able to reach each other over the network. The easiest is to put both machines in the same network with regard to hardware and software configuration, for example connect both machines via a single hub or switch and configure the network interfaces to use a common network such as 192.168.1.x/24.

To make it simple, I assigned the IP address 192.168.1.13 to the master machine and 192.168.1.14 to the slave machine because i have configured my MAC machine with 192.168.1.12 IP address. Update /etc/hosts on both machines with the following lines:

# /etc/hosts (for master AND slave)
192.168.1.13    dhimantVM1
192.168.1.14    dhimantVM2

SSH access

The user on the dhimantVM1 (aka dhimant@dhimantVM1) must be able to connect a) to its own user account on the dhimantVM1 – i.e. ssh dhimantVM1 in this context and not necessarily ssh localhost – and b) to the user account on the slave (aka dhimant@slave) via a password-less SSH login. If you followed my single-node cluster tutorial, you just have to add the dhimant@dhimantVM1 public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to the authorized_keys file of dhimant@slave (in this user’s$HOME/.ssh/authorized_keys). You can do this manually or use the following SSH command:

dhimant@dhimantVM1:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub dhimant@dhimantVM2

This command will prompt you for the login password for user on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary.

The final step is to test the SSH setup by connecting with user from the dhimantVM1 to the user account on the slave. The step is also needed to save slave‘s host key fingerprint to the dhimant@dhimantVM1‘sknown_hosts file.

So, connecting from dhimantVM1 to dhimantVM1…

dhimant@dhimantVM1:~$ ssh dhimantVM1
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information disabled due to load higher than 1.0
0 packages can be updated.
0 updates are security updates.
Last login: Mon May 28 10:11:17 2012 from dhimantvm1
dhimant@dhimantVM1:~$

…and from dhimantVM1 to slave.

dhimant@dhimantVM1:~$ ssh dhimantVM2
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information disabled due to load higher than 1.0
Last login: Mon May 28 10:11:31 2012 from dhimantvm1
dhimant@dhimantVM2:~$

Hadoop

Cluster Overview (aka the goal)

The next sections will describe how to configure one Ubuntu box as a master[dhimantVM1] node and the other Ubuntu box as a slave node. The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.

How the final multi-node cluster will look like.

The master[dhimantVM1] node will run the “master” daemons for each layer: NameNode for the HDFS storage layer, and JobTracker for the MapReduce processing layer. Both machines will run the “slave” daemons: DataNode for the HDFS layer, and TaskTracker for MapReduce processing layer. Basically, the “master” daemons are responsible for coordination and management of the “slave” daemons while the latter will do the actual data storage and data processing work.

Masters vs. Slaves

From the Hadoop documentation:

Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the actual “master nodes”. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves or “worker nodes”.

Configuration

conf/masters (master only)

Despite its name, the conf/masters file defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster. In our case, this is just the dhimantVM1 machine. The primary NameNode and the JobTracker will always be the machines on which you run the bin/start-dfs.sh and bin/start-mapred.sh scripts, respectively (the primary NameNode and the JobTracker will be started on the same machine if you runbin/start-all.sh). Note that you can also start an Hadoop daemon manually on a machine viabin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker], which will not take the conf/masters and conf/slaves files into account.

Here are more details regarding the conf/masters file, taken from the Hadoop HDFS user guide:

The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started bybin/start-dfs.sh on the nodes specified in conf/masters file.

Again, the machine on which bin/start-dfs.sh is run will become the primary NameNode.

On master, update /conf/masters that it looks like this:

 dhimantVM1

conf/slaves (master[dhimantVM1] only)

This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

On dhimantVM1, update conf/slaves that it looks like this:

dhimantVM1
dhimantVM2

If you have additional slave nodes, just add them to the conf/slaves file, one per line (do this on all machines in the cluster).

dhimantVM1
dhimantVM2
dhimantVM3
dhimantVM4
dhimantVM5

Note: The conf/slaves file on dhimantVM1 is used only by the scripts like bin/start-dfs.sh orbin/stop-dfs.sh. For example, if you want to add DataNodes on the fly (which is not described in this tutorial yet), you can “manually” start the DataNode daemon on a new slave machine via bin/hadoop-daemon.sh start datanode. Using the conf/slaves file on the dhimantVM1 simply helps you to make “full” cluster restarts easier.

conf/*-site.xml (all machines)

Note: As of Hadoop 0.20.0, the configuration settings previously found in hadoop-site.xml were moved to conf/core-site.xml (fs.default.name), conf/mapred-site.xml (mapred.job.tracker) and conf/hdfs-site.xml (dfs.replication).

Assuming you configured each machine as described in the single-node cluster tutorial, you will only have to change a few variables.

Important: You have to change the configuration files conf/core-site.xmlconf/mapred-site.xmland conf/hdfs-site.xml on ALL machines as follows.

First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies theNameNode (the HDFS master[dhimantVM1]) host and port. In our case, this is the master machine.

<!-- In: conf/core-site.xml -->
<property>
 <name>hadoop.tmp.dir</name>
 <value>/home/dhimant/hadoop-dhimant</value>
 <description>A base for other temporary directories.</description>
 </property>
 <property>
 <name>fs.default.name</name>
 <value>hdfs://dhimantVM1:8020</value>
 </property>

Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies theJobTracker (MapReduce master[dhimantVM1]) host and port. Again, this is the dhimantVM1 in our case.

<!-- In: conf/mapred-site.xml -->
<property>
 <name>mapred.job.tracker</name>
 <value>dhimantVM1:8021</value>
 </property>

Third, we change the dfs.replication variable (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of slave nodes (more precisely, the number of DataNodes) that you have available, you will start seeing a lot of (Zero targets found, forbidden1.size=1) type errors in the log files.

The default value of dfs.replication is 3. However, we have only two nodes available, so we setdfs.replication to 2.

<!-- In: conf/hdfs-site.xml -->
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

Formatting the HDFS filesystem via the NameNode

Before we start our new multi-node cluster, we have to format Hadoop’s distributed filesystem (HDFS) for the NameNode. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop NameNode, this will cause all your data in the HDFS filesytem to be erased.

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable on the NameNode), run the command

dhimant@dhimantVM1:~/hadoop$ bin/hadoop namenode -format

Background: The HDFS name table is stored on the NameNode’s (here: master[dhimantVM1]) local filesystem in the directory specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information for the DataNodes.

Starting the multi-node cluster

Starting the cluster is done in two steps. First, the HDFS daemons are started: the NameNode daemon is started on dhimantVM1, and DataNode daemons are started on all slaves (here: master and slave). Second, the MapReduce daemons are started: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slave).

HDFS daemons

Run the command /bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

In our case, we will run bin/start-dfs.sh on master:

dhimant@dhimantVM1:~/hadoop$ bin/start-dfs.sh 
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-namenode-dhimantVM1.out
dhimantVM1: starting datanode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-datanode-dhimantVM1.out
dhimantVM2: starting datanode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-datanode-dhimantVM2.out
dhimantVM1: starting secondarynamenode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-secondarynamenode-dhimantVM1.out
dhimant@dhimantVM1:~/hadoop$

As you can see in slave‘s output above, it will automatically format it’s storage directory (specified bydfs.data.dir) if it is not formatted already. It will also create the directory if it does not exist yet.

MapReduce daemons

Run the command /bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

In our case, we will run bin/start-mapred.sh on master:

dhimant@dhimantVM1:~/hadoop$ bin/start-mapred.sh 
Warning: $HADOOP_HOME is deprecated.
starting jobtracker, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-jobtracker-dhimantVM1.out
dhimantVM2: starting tasktracker, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-tasktracker-dhimantVM2.out
dhimantVM1: starting tasktracker, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-tasktracker-dhimantVM1.out
dhimant@dhimantVM1:~/hadoop$

Stopping the multi-node cluster

Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however. First, we begin with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped on all slaves (here: master and slave). Second, the HDFS daemons are stopped: the NameNode daemon is stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).

MapReduce daemons

Run the command /bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

In our case, we will run bin/stop-mapred.sh on master:

dhimant@dhimantVM1:~/hadoop$ bin/stop-mapred.sh 
Warning: $HADOOP_HOME is deprecated.
stopping jobtracker
dhimantVM2: stopping tasktracker
dhimantVM1: stopping tasktracker
dhimant@dhimantVM1:~/hadoop$

HDFS daemons

Run the command /bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

In our case, we will run bin/stop-dfs.sh on master:

dhimant@dhimantVM1:~/hadoop$ bin/stop-dfs.sh 
Warning: $HADOOP_HOME is deprecated.
stopping namenode
dhimantVM1: stopping datanode
dhimantVM2: stopping datanode
dhimantVM1: stopping secondarynamenode
dhimant@dhimantVM1:~/hadoop$

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.

Copy example input data

We will first create an input folder into HDFS file system and copy files to that folder.

dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -mkdir input
Warning: $HADOOP_HOME is deprecated.
dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -copyFromLocal conf/*.* /user/dhimant/input
Warning: $HADOOP_HOME is deprecated.

Run the MapReduce job

Now, we actually run the WordCount example job.

dhimant@dhimantVM1:~/hadoop$ bin/hadoop jar hadoop-examples-1.0.2.jar wordcount /user/dhimant/input /user/dhimant/output

This command will read all the files in the HDFS directory /user/dhimant/input, process it, and store the result in the HDFS directory /user/dhimant/output.

Exemplary output of the previous command in the console:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop jar hadoop-examples-1.0.2.jar wordcount /user/dhimant/input /user/dhimant/output
Warning: $HADOOP_HOME is deprecated.
****hdfs://dhimantVM1:8020/user/dhimant/input
12/05/28 10:19:24 INFO input.FileInputFormat: Total input paths to process : 7
12/05/28 10:19:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/05/28 10:19:24 WARN snappy.LoadSnappy: Snappy native library not loaded
12/05/28 10:19:26 INFO mapred.JobClient: Running job: job_201205281013_0001
12/05/28 10:19:27 INFO mapred.JobClient: map 0% reduce 0%
12/05/28 10:20:55 INFO mapred.JobClient: map 28% reduce 0%
12/05/28 10:24:06 INFO mapred.JobClient: map 42% reduce 0%
12/05/28 10:25:06 INFO mapred.JobClient: map 48% reduce 0%
12/05/28 10:28:13 INFO mapred.JobClient: map 57% reduce 0%
12/05/28 10:32:20 INFO mapred.JobClient: map 69% reduce 0%
12/05/28 10:32:29 INFO mapred.JobClient: map 85% reduce 0%
12/05/28 10:34:47 INFO mapred.JobClient: map 100% reduce 0%
12/05/28 10:35:31 INFO mapred.JobClient: map 100% reduce 9%
12/05/28 10:35:38 INFO mapred.JobClient: map 100% reduce 66%
12/05/28 10:35:47 INFO mapred.JobClient: map 100% reduce 100%
12/05/28 10:36:14 INFO mapred.JobClient: Job complete: job_201205281013_0001
12/05/28 10:36:15 INFO mapred.JobClient: Counters: 29
12/05/28 10:36:15 INFO mapred.JobClient: Job Counters 
12/05/28 10:36:15 INFO mapred.JobClient: Launched reduce tasks=1
12/05/28 10:36:15 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1798383
12/05/28 10:36:15 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/05/28 10:36:15 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/05/28 10:36:15 INFO mapred.JobClient: Launched map tasks=9
12/05/28 10:36:15 INFO mapred.JobClient: Data-local map tasks=9
12/05/28 10:36:15 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=888063
12/05/28 10:36:15 INFO mapred.JobClient: File Output Format Counters 
12/05/28 10:36:15 INFO mapred.JobClient: Bytes Written=1412500
12/05/28 10:36:15 INFO mapred.JobClient: FileSystemCounters
12/05/28 10:36:15 INFO mapred.JobClient: FILE_BYTES_READ=4462538
12/05/28 10:36:15 INFO mapred.JobClient: HDFS_BYTES_READ=6950814
12/05/28 10:36:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7551061
12/05/28 10:36:15 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412500
12/05/28 10:36:15 INFO mapred.JobClient: File Input Format Counters 
12/05/28 10:36:15 INFO mapred.JobClient: Bytes Read=6949995
12/05/28 10:36:15 INFO mapred.JobClient: Map-Reduce Framework
12/05/28 10:36:15 INFO mapred.JobClient: Map output materialized bytes=2915042
12/05/28 10:36:15 INFO mapred.JobClient: Map input records=137147
12/05/28 10:36:15 INFO mapred.JobClient: Reduce shuffle bytes=2915042
12/05/28 10:36:15 INFO mapred.JobClient: Spilled Records=507854
12/05/28 10:36:15 INFO mapred.JobClient: Map output bytes=11435843
12/05/28 10:36:15 INFO mapred.JobClient: CPU time spent (ms)=108340
12/05/28 10:36:15 INFO mapred.JobClient: Total committed heap usage (bytes)=787509248
12/05/28 10:36:15 INFO mapred.JobClient: Combine input records=1174991
12/05/28 10:36:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=819
12/05/28 10:36:15 INFO mapred.JobClient: Reduce input records=201008
12/05/28 10:36:15 INFO mapred.JobClient: Reduce input groups=128513
12/05/28 10:36:15 INFO mapred.JobClient: Combine output records=201008
12/05/28 10:36:15 INFO mapred.JobClient: Physical memory (bytes) snapshot=1189154816
12/05/28 10:36:15 INFO mapred.JobClient: Reduce output records=128513
12/05/28 10:36:15 INFO mapred.JobClient: Virtual memory (bytes) snapshot=8574676992
12/05/28 10:36:15 INFO mapred.JobClient: Map output records=1174991
dhimant@dhimantVM1:~/hadoop$

Check if the result is successfully stored in HDFS directory /user/dhimant/output:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -ls
Warning: $HADOOP_HOME is deprecated.
Found 2 items
drwxr-xr-x - dhimant supergroup 0 2012-05-27 18:10 /user/dhimant/input
drwxr-xr-x - dhimant supergroup 0 2012-05-27 18:34 /user/dhimant/output
dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -ls /user/dhimant/output
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 1 dhimant supergroup 0 2012-05-27 18:34 /user/dhimant/output/_SUCCESS
drwxr-xr-x - dhimant supergroup 0 2012-05-27 18:27 /user/dhimant/output/_logs
-rw-r--r-- 1 dhimant supergroup 16126 2012-05-27 18:34 /user/dhimant/output/part-r-00000

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the"-D" option:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop jar hadoop-examples-1.0.2.jar -D mapred.reduce.tasks=16 wordcount /user/dhimant/input /user/dhimant/output

An important note about mapred.map.tasksHadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.

Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -cat /user/dhimant/output/part-r-00000

to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

dhimant@dhimantVM1:~/hadoop$ mkdir WordCountOutput
dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -getmerge /user/dhimant/output WordCountOutput/Output
dhimant@dhimantVM1:~/hadoop$ head -10 WordCountOutput/Output 
"". 4
"*" 10
"alice,bob 10
"console" 1
"hadoop.root.logger". 1
"jks". 4
# 132
#*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 1
#*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both 1
#Default 1
dhimant@dhimantVM1:~/hadoop$

Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.

The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

What we want to do

In this short tutorial, I will describe the required steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduceHDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Cluster of machines running Hadoop at Yahoo! (Source: Yahoo!)

The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.

This tutorial has been tested with the following software versions:

Prerequisites

Sun Java 6

Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.

Please visit my another post here.

The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu).

After installation, make a quick check whether Sun’s JDK is correctly set up:

dhimant@dhimantVM1:~# java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the user we created in the previous section.

I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several guides available.

First, we have to generate an SSH key for user.

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it . For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the user.

I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several guides available.

First, we have to generate an SSH key for the current user.

dhimant@dhimantVM1:~/hadoop$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/dhimant/.ssh/id_rsa): 
/home/dhimant/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/dhimant/.ssh/id_rsa.
Your public key has been saved in /home/dhimant/.ssh/id_rsa.pub.
The key fingerprint is:
31:65:a5:ab:31:c2:2d:50:f1:9f:52:f0:b0:e5:66:fb dhimant@dhimantVM1
The key's randomart image is:
+--[ RSA 2048]----+
| o.o +.. |
| . . X . |
| . = B |
| o . B + |
| + S = |
| o = . |
| . E |
| |
| |
+-----------------+
dhimant@dhimantVM1:~$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
dhimant@dhimantVM1:~$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information disabled due to load higher than 1.0
165 packages can be updated.
26 updates are security updates.
Last login: Sun May 27 17:40:49 2012 from dhimantvm1.local
dhimant@dhimantVM1:~$

Disabling IPv6

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box.
In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.

You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh:

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Hadoop

Installation

You have to download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /home/dhimant/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

$ sudo tar xzf hadoop-1.0.2.tar.gz
$ sudo mv hadoop-1.0.2 hadoop
$ sudo chown -R hduser:hadoop hadoop

(Just to give you the idea, YMMV — personally, I create a symlink from hadoop-1.0.2 to hadoop.)

Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

# Set Hadoop-related environment variables
 export HADOOP_HOME=/home/dhimant/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
 export JAVA_HOME=/usr/lib/jvm/default-java
# Some convenient aliases and functions for running Hadoop-related commands
 unalias fs &> /dev/null
 alias fs="hadoop fs"
 unalias hls &> /dev/null
 alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
 # compress job outputs with LZOP (not covered in this tutorial):
 # Conveniently inspect an LZOP compressed file from the command
 # line; run via:
 #
 # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
 #
 # Requires installed 'lzop' command.
 #
 lzohead () {
 hadoop fs -cat $1 | lzop -dc | head -1000 | less
 }
# Add Hadoop bin/ directory to PATH
 export PATH=$PATH:$HADOOP_HOME/bin

You can repeat this exercise also for other users who want to use Hadoop.

Excursus: Hadoop Distributed File System (HDFS)

From The Hadoop Distributed File System: Architecture and Design:

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.

The following picture gives an overview of the most important HDFS components.

HDFS Architecture (source: http://hadoop.apache.org/core/docs/current/hdfs_design.html)

Configuration

Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.

hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open/conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/default-java

conf/*-site.xml

Note: As of Hadoop 0.20.0, the configuration settings previously found in hadoop-site.xml were moved to core-site.xml (hadoop.tmp.dir, fs.default.name), mapred-site.xml (mapred.job.tracker) and hdfs-site.xml (dfs.replication).

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

You can leave the settings below ”as is” with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice. We will use the directory /home/dhimant/hadoop-dhimant in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

Now we create the directory and set the required ownerships and permissions:

$ sudo mkdir -p /home/dhimant/hadoop-dhimant
$ sudo chown hduser:hadoop /home/dhimant/hadoop-dhimant
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /home/dhimant/hadoop-dhimant

If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section).

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

<!-- In: conf/core-site.xml -->
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/dhimant/hadoop-dhimant</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:8020</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:

<!-- In: conf/mapred-site.xml -->
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:8021</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

In file conf/hdfs-site.xml:

<!-- In: conf/hdfs-site.xml -->
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.

Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

dhimant@dhimantVM1:~/hadoop$ bin/hadoop namenode -format

The output will look like this:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.
12/05/27 17:56:56 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = dhimantVM1/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012
************************************************************/
12/05/27 17:56:57 INFO util.GSet: VM type = 64-bit
12/05/27 17:56:57 INFO util.GSet: 2% max memory = 38.6675 MB
12/05/27 17:56:57 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/05/27 17:56:57 INFO util.GSet: recommended=4194304, actual=4194304
12/05/27 17:56:57 INFO namenode.FSNamesystem: fsOwner=dhimant
12/05/27 17:56:58 INFO namenode.FSNamesystem: supergroup=supergroup
12/05/27 17:56:58 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/05/27 17:56:58 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/05/27 17:56:58 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/05/27 17:56:58 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/05/27 17:56:58 INFO common.Storage: Image file of size 113 saved in 0 seconds.
12/05/27 17:56:58 INFO common.Storage: Storage directory /home/dhimant/hadoop-dhimant/dfs/name has been successfully formatted.
12/05/27 17:56:58 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at dhimantVM1/127.0.1.1
************************************************************/

Starting your single-node cluster

Run the command:

dhimant@dhimantVM1:~/hadoop$ bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

The output will look like this:

dhimant@dhimantVM1:~/hadoop$ bin/start-all.sh 
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-namenode-dhimantVM1.out
localhost: starting datanode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-datanode-dhimantVM1.out
localhost: starting secondarynamenode, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-secondarynamenode-dhimantVM1.out
starting jobtracker, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-jobtracker-dhimantVM1.out
localhost: starting tasktracker, logging to /home/dhimant/hadoop/libexec/../logs/hadoop-dhimant-tasktracker-dhimantVM1.out

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.

You can also check with netstat if Hadoop is listening on the configured ports.

dhimant@dhimantVM1:~/hadoop$ netstat -plten | grep java
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1000 16215 3865/java 
tcp 0 0 0.0.0.0:60714 0.0.0.0:* LISTEN 1000 15719 3865/java 
tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1000 16885 4226/java 
tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1000 16327 3953/java 
tcp 0 0 127.0.0.1:51536 0.0.0.0:* LISTEN 1000 16225 4226/java 
tcp 0 0 127.0.0.1:8020 0.0.0.0:* LISTEN 1000 15482 3324/java 
tcp 0 0 127.0.0.1:8021 0.0.0.0:* LISTEN 1000 16216 3953/java 
tcp 0 0 0.0.0.0:41845 0.0.0.0:* LISTEN 1000 15268 3593/java 
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1000 16109 3324/java 
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1000 16200 3593/java 
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1000 16237 3593/java 
tcp 0 0 0.0.0.0:54814 0.0.0.0:* LISTEN 1000 14510 3324/java 
tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1000 16616 3593/java 
tcp 0 0 0.0.0.0:39847 0.0.0.0:* LISTEN 1000 16053 3953/java 
dhimant@dhimantVM1:~/hadoop$

If there are any errors, examine the log files in the /home/dhimant/hadoop/logs/ directory.

Stopping your single-node cluster

Run the command

dhimant@dhimantVM1:~/hadoop$ bin/stop-all.sh 
Warning: $HADOOP_HOME is deprecated.
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
dhimant@dhimantVM1:~/hadoop$
to stop all the daemons running on your machine.

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.

Copy example input data

We will first create an input folder into HDFS file system and copy files to that folder.

dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -mkdir input
Warning: $HADOOP_HOME is deprecated.
dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -copyFromLocal conf/*.* /user/dhimant/input
Warning: $HADOOP_HOME is deprecated.

Run the MapReduce job

Now, we actually run the WordCount example job.

dhimant@dhimantVM1:~/hadoop$ bin/hadoop jar hadoop-examples-1.0.2.jar wordcount /user/dhimant/input /user/dhimant/output

This command will read all the files in the HDFS directory /user/dhimant/input, process it, and store the result in the HDFS directory /user/dhimant/output.

Exemplary output of the previous command in the console:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop jar hadoop-examples-1.0.2.jar wordcount /user/dhimant/input /user/dhimant/output
Warning: $HADOOP_HOME is deprecated.
****hdfs://localhost:8020/user/dhimant/input
12/05/27 18:27:23 INFO input.FileInputFormat: Total input paths to process : 16
12/05/27 18:27:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/05/27 18:27:23 WARN snappy.LoadSnappy: Snappy native library not loaded
12/05/27 18:27:25 INFO mapred.JobClient: Running job: job_201205271804_0003
12/05/27 18:27:26 INFO mapred.JobClient: map 0% reduce 0%
12/05/27 18:28:50 INFO mapred.JobClient: map 12% reduce 0%
12/05/27 18:28:53 INFO mapred.JobClient: map 18% reduce 0%
12/05/27 18:30:40 INFO mapred.JobClient: map 37% reduce 0%
12/05/27 18:30:44 INFO mapred.JobClient: map 37% reduce 8%
12/05/27 18:30:50 INFO mapred.JobClient: map 37% reduce 12%
12/05/27 18:31:19 INFO mapred.JobClient: map 43% reduce 12%
12/05/27 18:31:21 INFO mapred.JobClient: map 50% reduce 12%
12/05/27 18:31:39 INFO mapred.JobClient: map 62% reduce 12%
12/05/27 18:33:13 INFO mapred.JobClient: map 62% reduce 14%
12/05/27 18:33:19 INFO mapred.JobClient: map 62% reduce 16%
12/05/27 18:33:27 INFO mapred.JobClient: map 62% reduce 20%
12/05/27 18:33:58 INFO mapred.JobClient: map 68% reduce 20%
12/05/27 18:34:01 INFO mapred.JobClient: map 75% reduce 20%
12/05/27 18:34:08 INFO mapred.JobClient: map 87% reduce 22%
12/05/27 18:34:11 INFO mapred.JobClient: map 87% reduce 25%
12/05/27 18:34:24 INFO mapred.JobClient: map 93% reduce 29%
12/05/27 18:34:27 INFO mapred.JobClient: map 100% reduce 29%
12/05/27 18:34:39 INFO mapred.JobClient: map 100% reduce 100%
12/05/27 18:34:47 INFO mapred.JobClient: Job complete: job_201205271804_0003
12/05/27 18:34:48 INFO mapred.JobClient: Counters: 29
12/05/27 18:34:48 INFO mapred.JobClient: Job Counters 
12/05/27 18:34:48 INFO mapred.JobClient: Launched reduce tasks=1
12/05/27 18:34:48 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1541542
12/05/27 18:34:48 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/05/27 18:34:48 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/05/27 18:34:48 INFO mapred.JobClient: Launched map tasks=19
12/05/27 18:34:48 INFO mapred.JobClient: Data-local map tasks=19
12/05/27 18:34:48 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=339007
12/05/27 18:34:48 INFO mapred.JobClient: File Output Format Counters 
12/05/27 18:34:48 INFO mapred.JobClient: Bytes Written=16126
12/05/27 18:34:48 INFO mapred.JobClient: FileSystemCounters
12/05/27 18:34:48 INFO mapred.JobClient: FILE_BYTES_READ=27338
12/05/27 18:34:48 INFO mapred.JobClient: HDFS_BYTES_READ=32782
12/05/27 18:34:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=423352
12/05/27 18:34:48 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=16126
12/05/27 18:34:48 INFO mapred.JobClient: File Input Format Counters 
12/05/27 18:34:48 INFO mapred.JobClient: Bytes Read=30805
12/05/27 18:34:48 INFO mapred.JobClient: Map-Reduce Framework
12/05/27 18:34:48 INFO mapred.JobClient: Map output materialized bytes=27428
12/05/27 18:34:48 INFO mapred.JobClient: Map input records=849
12/05/27 18:34:48 INFO mapred.JobClient: Reduce shuffle bytes=27428
12/05/27 18:34:48 INFO mapred.JobClient: Spilled Records=2714
12/05/27 18:34:48 INFO mapred.JobClient: Map output bytes=41337
12/05/27 18:34:48 INFO mapred.JobClient: CPU time spent (ms)=33260
12/05/27 18:34:48 INFO mapred.JobClient: Total committed heap usage (bytes)=2311716864
12/05/27 18:34:48 INFO mapred.JobClient: Combine input records=3018
12/05/27 18:34:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=1977
12/05/27 18:34:48 INFO mapred.JobClient: Reduce input records=1357
12/05/27 18:34:48 INFO mapred.JobClient: Reduce input groups=827
12/05/27 18:34:48 INFO mapred.JobClient: Combine output records=1357
12/05/27 18:34:48 INFO mapred.JobClient: Physical memory (bytes) snapshot=2657071104
12/05/27 18:34:48 INFO mapred.JobClient: Reduce output records=827
12/05/27 18:34:48 INFO mapred.JobClient: Virtual memory (bytes) snapshot=18063450112
12/05/27 18:34:48 INFO mapred.JobClient: Map output records=3018

Check if the result is successfully stored in HDFS directory /user/dhimant/output:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -ls
Warning: $HADOOP_HOME is deprecated.
Found 2 items
drwxr-xr-x - dhimant supergroup 0 2012-05-27 18:10 /user/dhimant/input
drwxr-xr-x - dhimant supergroup 0 2012-05-27 18:34 /user/dhimant/output
dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -ls /user/dhimant/output
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 1 dhimant supergroup 0 2012-05-27 18:34 /user/dhimant/output/_SUCCESS
drwxr-xr-x - dhimant supergroup 0 2012-05-27 18:27 /user/dhimant/output/_logs
-rw-r--r-- 1 dhimant supergroup 16126 2012-05-27 18:34 /user/dhimant/output/part-r-00000

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the"-D" option:

dhimant@dhimantVM1:~/hadoop$ bin/hadoop jar hadoop-examples-1.0.2.jar -D mapred.reduce.tasks=16 wordcount /user/dhimant/input /user/dhimant/output

An important note about mapred.map.tasksHadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.

Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -cat /user/dhimant/output/part-r-00000

to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

dhimant@dhimantVM1:~/hadoop$ mkdir WordCountOutput
dhimant@dhimantVM1:~/hadoop$ bin/hadoop dfs -getmerge /user/dhimant/output WordCountOutput/Output
dhimant@dhimantVM1:~/hadoop$ head -10 WordCountOutput/Output 
"". 4
"*" 10
"alice,bob 10
"console" 1
"hadoop.root.logger". 1
"jks". 4
# 132
#*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 1
#*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both 1
#Default 1
dhimant@dhimantVM1:~/hadoop$

Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.

The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

MapReduce Job Tracker Web Interface

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.

A screenshot of Hadoop’s Job Tracker web interface.

Task Tracker Web Interface

The task tracker web UI shows you running and non-running tasks. It also gives access to the ”local machine’s” Hadoop log files.

By default, it’s available at http://localhost:50060/.

A screenshot of Hadoop’s Task Tracker web interface.

HDFS Name Node Web Interface

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the ”local machine’s” Hadoop log files.

By default, it’s available at http://localhost:50070/.

A screenshot of Hadoop’s Name Node web interface.

What’s next?

If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial Running Hadoop On Ubuntu Linux (Multi-Node Cluster) where I describe how to build a Hadoop ”multi-node” cluster with two Ubuntu boxes (this will increase your current cluster size by 100% :-P ).

Installing Java on other distributions of Linux can be difficult. But with Linux Ubuntu, it is usually simple as long as you have connection to the internet.

  1. Open Terminal (Applications > Accessories > Terminal
  2. Next, type the following into the terminal:
         sudo add-apt-repository ppa:ferramroberto/java 
         sudo apt-get update 
         sudo apt-get install sun-java6-bin sun-java6-fonts sun-java6-jre sun-java6-plugin
  1. Wait until it loads entirely, then the terminal will change into an old-fashioned window with no mouse user interface.
          Use the down arrow to scroll down until you can't scroll further. Then press the right arrow and hit enter to accept the license agreement.
  1. After accepting the license agreement, Java will download and it will ask you several questions about configuration which you should choose based on what you want.

Java performance

Reducing time and space consumption

Abstract: We give some advice on improving the execution of Java programs by reducing their time and space consumption. There are no magic tricks, just advice on common problems to avoid.

1 Reducing time consumption

1.1 Standard code optimizations

Do not expect the Java compiler (such as javac or jikes) to perform many clever optimizations. Due to Java’s rather strict sequencing and thread semantics there is little the compiler can safely do to improve a Java program, in contrast to compilers for less strictly defined languages such as C or Fortran. But you can improve your Java source code yourself.

_ Move loop-invariant computations out of loops. For example, avoid repeatedly computing the loop bound in a for-loop, like this:

for (int i=0; i<size()*2; i++) { … }

Instead, compute the loop bound only once and bind it to a local variable, like this:

for (int i=0, stop=size()*2; i<stop; i++) { … }

_ Do not compute the same subexpression twice:

if (birds.elementAt(i).isGrower()) …

if (birds.elementAt(i).isPullet()) …

Instead, compute the subexpression once, bind the result to a variable, and reuse it:

Bird bird = birds.elementAt(i);

if (bird.isGrower()) …

if (bird.isPullet()) …

1

_ Every array access requires an index check, so it is worth-while to reduce the number of array accesses. Moreover, usually the Java compiler cannot automatically optimize indexing into multidimensional arrays. For instance, every iteration of the inner (j) loop below recomputes the indexing rowsum[i] as well as the indexing arr[i] into the first dimension of arr:

double[] rowsum = new double[n];

for (int i=0; i<n; i++)

for (int j=0; j<m; j++)

rowsum[i] += arr[i][j];

Instead, compute these indexings only once for each iteration of the outer loop:

double[] rowsum = new double[n];

for (int i=0; i<n; i++) {

double[] arri = arr[i];

double sum = 0.0;

for (int j=0; j<m; j++)

sum += arri[j];

rowsum[i] = sum;

}

Note that the initialization arri = arr[i] does not copy row i of the array; it simply assigns an array reference (four bytes) to arri.

_ Declare constant fields as final static so that the compiler can inline them and precompute constant expressions.

_ Declare constant variables as final so that the compiler can inline them and precompute constant expressions.

_ Replace a long if-else-if chain by a switch if possible; this is much faster.

_ If a long if-else-if chain cannot be replaced by a switch (because it tests a String, for instance), and if it is executed many times, it is often worthwhile to replace it by a final static HashMap or similar.

_ Nothing (except obscurity) is achieved by using ‘clever’ C idioms such as performing the entire computation of a while-loop in the loop condition:

int year = 0;

double sum = 200.0;

double[] balance = new double[100];

while ((balance[year++] = sum *= 1.05) < 1000.0);

1.2 Fields and variables

_ Access to local variables and parameters in a method is much faster than access to static or instance

fields. For a field accessed in a loop, it may be worthwhile to copy the field’s value to a local

variable before the loop, and refer only to the local variable inside the loop.

_ There is no runtime overhead for declaring variables inside nested blocks or loops in a method.

It usually improves clarity to declare variables as locally as possible (with as small a scope as

possible), and this may even help the compiler improve your program.

2

1.3 String manipulation

_ Do not build strings by repeated string concatenation. The loop below takes time quadratic in the

number of iterations and most likely causes heap fragmentation as well (see Section 2):

String s = “”;

for (int i=0; i<n; i++) {

s += “#” + i;

}

Instead, use a StringBuilder object and its append method. This takes time linear in the

number of iterations, and may be several orders of magnitude faster:

StringBuilder sbuf = new StringBuilder();

for (int i=0; i<n; i++) {

sbuf.append(“#”).append(i);

}

String s = sbuf.toString();

_ On the other hand, an expression containing a sequence of string concatenations automatically

gets compiled to use StringBuilder.append(…), so this is OK:

String s = “(” + x + “, ” + y + “)”;

_ Do not process strings by repeatedly searching or modifying a String or StringBuilder.

Repeated use of methods substring and index from String may be legitimate but should

be looked upon with suspicion.

1.4 Storing tables of constants in arrays

_ Declaring an initialized array variable inside a method causes a new array to be allocated at every

execution of the method:

public static int monthdays(int y, int m) {

int[] monthlengths =

{ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };

return m == 2 && leapyear(y) ? 29 : monthlengths[m-1];

}

Instead, an initialized array variable or similar table should be declared and allocated once and for

all as a final static field in the enclosing class:

private final static int[] monthlengths =

{ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };

public static int monthdays(int y, int m) {

return m == 2 && leapyear(y) ? 29 : monthlengths[m-1];

}

_ More complicated initializations can use a static initializer block static { … } to precompute

the contents of an array like this:

3

private final static double[] logFac = new double[100];

static {

double logRes = 0.0;

for (int i=1, stop=logFac.length; i<stop; i++)

logFac[i] = logRes += Math.log(i);

}

public static double logBinom(int n, int k) {

return logFac[n] – logFac[n-k] – logFac[k];

}

The static initializer is executed when the enclosing class is loaded. In this example it precomputes

a table logFac of logarithms of the factorial function n! = 1 _ 2 _ _ _ (n 􀀀 1) _ n, so that method

logBinom(n,k) can efficiently compute the logarithm of a binomial coefficient. For instance,

the number of ways to choose 7 cards out of 52 is Math.exp(logBinom(52, 7)) which

equals 133 784 560.

1.5 Methods

_ Declaring a method as private, final, or static makes calls to it faster. Of course, you

should only do this when it makes sense in the application.

_ For instance, often an accessor method such as getSize can reasonably be made final in a

class, when there would be no point in overriding it in a subclass:

class Foo {

private int size;

public final int getSize() {

return size;

}

}

This can make a call o.getSize() just as fast as a direct access to a public field o.size.

There need not be any performance penalty for proper encapsulation (making fields private).

_ Virtual method calls (to instance methods) are fast and should be used instead of instanceof

tests and casts.

_ In modern Java Virtual Machine implementations, such as Sun’s HotSpot JVM and IBM’s JVM,

interface method calls are just as fast as virtual method calls to instance methods. Hence there is

no performance penalty for maintenance-friendly programming, using interfaces instead of their

implementing classes for method parameters and so on.

1.6 Sorting and searching

_ Never use selection sort, bubblesort or insertion sort, except on very short arrays or lists. Use

heapsort (for arrays) or mergesort (for doubly linked lists) or quicksort (for arrays; but you must

make a good choice of pivot element).

_ Even better, use the built-in sorting routines, which are guaranteed to be fast: O(n log(n)) time

for n elements, and sometimes faster if the data are nearly sorted already:

4

For arrays, use java.util.Arrays.sort, which is an improved quicksort; it uses no additional

memory, but is not stable (does not preserve the order of equal elements). There are

overloaded versions for all primitive types and for objects.

For ArrayList<T> and LinkedList<T>, implementing interface java.util.List<T>,

use java.util.Collections.sort, which is stable (preserves the order of equal elements)

and smooth (near-linear time for nearly sorted lists) but uses additional memory.

_ Avoid linear search in arrays and lists, except when you know that they are very short. If your

program needs to look up something frequently, use one of these approaches:

Binary search on sorted data:

For arrays, use java.util.Arrays.binarySearch. The array must be sorted, as if

by java.util.Arrays.sort. There are overloaded versions for all primitive types and

for objects.

For ArrayList<T>, use java.util.Collections.binarySearch. The array

list must be sorted, as if by java.util.Collections.sort.

If you need also to insert or remove elements from the set or map, use one of the approaches

below instead.

Hashing: Use HashSet<T> or HashMap<K,V> from package java.util if your key

objects have a good hash function hashCode. This is the case for String and the wrapper

classes Integer, Double, . . . , for the primitive types.

Binary search trees: Use TreeSet<T> or TreeMap<K,V> from package java.util

if your key objects have a good comparison function compareTo. This is the case for

String and the wrapper classes Integer, Double, . . . , for the primitive types.

1.7 Exceptions

_ The creation new Exception(…) of an exception object builds a stack trace, which is costly

in time and space, and especially so in deeply recursive method calls. The creation of an object

of class Exception or a subclass of Exception may be between 30 and 100 times slower

than creation of an ordinary object. On the other hand, using a try-catch block or throwing an

exception is fast.

_ You can prevent the generation of this stack trace by overriding method fillInStackTrace

in subclasses of Exception, as shown below. This makes creation exception instances roughly 10

times faster.

class MyException extends Exception {

public Throwable fillInStackTrace() {

return this;

}

}

_ Thus you should create an exception object only if you actually intend to throw it. Also, do not

use exceptions to implement control flow (end of data, termination of loops); use exceptions only

to signal errors and exceptional circumstances (file not found, illegal input format, and so on). If

your program does need to throw exceptions very frequently, reuse a single pre-created exception

object.

5

1.8 Collection classes

Java’s collection classes in package java.util.* are well-designed and well-implemented. Using

these classes can improve the speed of your program considerably, but you must beware of a few pitfalls.

_ If you use HashSet<T> or HashMap<K,V>, make sure that your key objects have a good

(uniform) and fast hashCode method, and that it agrees with the objects’ equals method.

_ If you use TreeSet<T> or TreeMap<K,V>, make sure that your key objects have a good

and fast compareTo method; or provide a Comparator<T> resp. Comparator<K> object

explicitly when creating the TreeSet<T> or TreeMap<K,V>.

_ Beware that indexing into a LinkedList<T> is not a constant-time operation. Hence the loop

below takes time quadratic in the size of the list lst if lst is a LinkedList<T>, and should

not be used:

int size = lst.size();

for (int i=0; i<size; i++)

System.out.println(lst.get(i));

Instead, use the enhanced for statement to iterate over the elements. It implicitly uses the collection’s

iterator, so the traversal takes linear time:

for (T x : lst)

System.out.println(x);

_ Repeated calls to remove(Object o) on LinkedList<T> or ArrayList<T> should be

avoided; it performs a linear search.

_ Repeated calls to add(int i, T x) or remove(int i) on LinkedList<T> should be

avoided, except when i is at the end or beginning of the linked list; both perform a linear traversal

to get to the i’th element.

_ Repeated calls to add(int i, T x) or remove(int i) on ArrayList<T> should be

avoided, except when i is at the end of the ArrayList<T>; it needs to move all elements after

i.

_ Preferably avoid the legacy collection classes Vector, Hashtable and Stack in which all

methods are synchronized, and every method call has a runtime overhead for obtaining a lock

on the collection. If you do need a synchronized collection, use synchronizedCollection

and similar methods from class java.util.Collection to create it.

_ The collection classes can store only reference type data, so a value of primitive type such as int,

double, . . . must be wrapped as an Integer, Double, . . . object before it can be stored or

used as a key in a collection. This takes time and space and may be unacceptable in memoryconstrained

embedded applications. Note that strings and arrays are reference type data and need

not be wrapped.

If you need to use collections that have primitive type elements or keys, consider using the Trove

library, which provides special-case collections such as hash set of int and so on. As a result

it is faster and uses less memory than the general Java collection classes. Trove can be found at

<http://trove4j.sourceforge.net/&gt;.

6

1.9 Input and output

_ Using buffered input and output (BufferedReader, BufferedWriter, BufferedInput-

Stream, BufferedOutputStream from package java.io) can speed up input/output by

a factor of 20.

_ Using the compressed streams ZipInputStream and ZipOutputStream or GZIPInput-

Stream and GZIPOutputStream from package java.util.zip may speed up the input

and output of verbose data formats such as XML. Compression and decompression takes CPU

time, but the compressed data may be so much smaller than the uncompressed data that it saves

time anyway, because less data must be read from disk or network. Also, it saves space on disk.

1.10 Space and object creation

_ If your program uses too much space (memory), it will also use too much time: Object allocation

and garbage collection take time, and using too much memory leads to poor cache utilization and

possibly even the need to use virtual memory (disk instead of RAM). Moreover, depending on the

JVM’s garbage collector, using much memory may lead to long collection pauses, which can be

irritating in interactive systems and catastrophic in real-time applications.

_ Object creation takes time (allocation, initialization, garbage collection), so do not unnecessarily

create objects. However, do not introduce object pools (in factory classes) unless absolutely necessary.

Most likely, you will just add code and maintenance problems, and your object pool may

introduce subtle errors by recycling an object in the pool although it is still being referred to and

modified from other parts of the program.

_ Be careful that you do not create objects that are never used. For instance, it is a common mistake

to build an error message string that is never actually used, because the exception in which the

message is embedded gets caught by a try-catch that ignores the message.

_ GUI components (created by AWT or Swing) may claim much space and may not be deallocated

aggressively enough. Do not create GUI components that you do not necessarily need.

1.11 Bulk array operations

There are special methods for performing bulk operations on arrays. They are usually much faster than

equivalent for loops, in part because they need to perform only a single bounds check.

_ static void java.lang.System.arrayCopy(src, si, dst, di, n) copies elements

from array segment src[si..si+n-1] to array segment dst[di..di+n-1].

_ static bool java.util.Arrays.equals(arr1, arr2) returns true if the arrays

arr1 and arr2 have the same length and their elements are pairwise equal. There are overloads

of this method for arguments of type boolean[], byte[], char[], double[], float[],

int[], long[], Object[] and short[].

_ static void java.util.Arrays.fill(arr, x) sets all elements of array arr to x.

This method has the same overloads as Arrays.equals.

_ static void java.util.Arrays.fill(arr, i, j, x) sets elements arr[i..j-1]

to x. This method has the same overloads as Arrays.equals.

_ static int java.util.Arrays.hashcode(arr) returns a hashcode for the array computed

from the hashcodes of its elements. This method has the same overloads as Arrays.equals.

7

1.12 Scientific computing

If you are doing scientific computing in Java, the Colt open source library provides many high performance

and high quality routines for linear algebra, sparse and dense matrices, statistical tools for data

analysis, random number generators, array algorithms, mathematical functions and complex numbers.

Don’t write a new inefficient and imprecise numerical routine if the one you need is here already. Colt

can be found at <http://hoschek.home.cern.ch/hoschek/colt/&gt;

1.13 Reflection

_ A reflective method call, reflective field access, and reflective object creation (using package

java.lang.reflect) are far slower than ordinary method call, field access, and object creation.

_ Access checks may further slow down such reflective calls; some of this cost may be avoided by

declaring the class of the called method to be public. This has been seen to speed up reflective

calls by a factor of 8.

1.14 Compiler and execution platform

_ As mentioned above, a Java compiler cannot perform many of the optimizations that a C or Fortran

compiler can. On the other hand, a just-in-time (JIT) compiler in the Java Virtual Machine (JVM)

that executes the bytecode can perform many optimizations that a traditional compiler cannot

perform.

_ For example, a test (x instanceof C) conditionally followed by a cast (C)x may be optimized

by a JVM so that at most one test is performed. Hence it is not worth the trouble to rewrite

your program to avoid either the instanceof test or the cast.

_ There are many different Java Virtual Machines (JVMs) with very different characteristics:

Sun’s HotSpot Client JVM performs some optimizations, but generally prioritizes fast startup

over aggressive optimizations.

Sun’s HotSpot Server JVM (option -server, not available for Microsoft Windows) performs

very aggressive optimizations at the expense of a longer startup delay.

IBM’s JVM performs very aggressive optimizations, comparable to Sun’s HotSpot Server

JVM.

The JVMs in implementations of J2ME (mobile phones) and PersonalJava (some PDAs) do

not include JIT compilation and probably perform no optimizations at all. Hence in this case

it is even more important that you do as many optimizations as possible in the Java code

yourself.

I do not know the optimization characteristics of Oracle’s JVM, the Kaffe JVM, Intel’s Open

Runtime Platform, IBM’s Jikes RVM, . . .

You can see what JVM you are using by typing java -version at a command-line prompt.

8

1.15 Profiling

If a Java program appears to be too slow, try to profile some runs of the program. Assume that the

example that performs repeated string concatenation in Section 1.3 is in file MyExample.java. Then

one can compile and profile it using Sun’s HotSpot JVM as follows:

javac -g MyExample.java

java -Xprof MyExample 10000

The result of the profiling is shown on standard output (the console):

Flat profile of 19.00 secs (223 total ticks): main

Interpreted + native Method

1.3% 1 + 0 java.lang.AbstractStringBuilder.append

1.3% 1 + 0 java.lang.String.<init>

2.6% 2 + 0 Total interpreted

Compiled + native Method

51.3% 0 + 40 java.lang.AbstractStringBuilder.expandCapacity

29.5% 23 + 0 java.lang.AbstractStringBuilder.append

10.3% 8 + 0 java.lang.StringBuilder.toString

6.4% 0 + 5 java.lang.String.<init>

97.4% 31 + 45 Total compiled

Thread-local ticks:

65.0% 145 Blocked (of total)

Flat profile of 0.01 secs (1 total ticks): DestroyJavaVM

Thread-local ticks:

100.0% 1 Blocked (of total)

Global summary of 19.01 seconds:

100.0% 929 Received ticks

74.6% 693 Received GC ticks

0.8% 7 Other VM operations

It says that 51.3% per cent of the computation time was spent in native method expandCapacity and

a further 29.5% was spent in method append, both from class AbstractStringBuilder. This

makes it plausible that the culprits are + and += on String, which are compiled into append calls.

But what is even more significant is the bottom section, which says that 74.6% of the total time was

spent in garbage collection, and hence less than 25% was spent in actual computation. This indicates a

serious problem with allocation of too much data that almost immediately becomes garbage.

9

2 Reducing space consumption

_ In a JVM, data are allocated on a call stack (for method parameters and local variables) and on

a heap (for objects, including strings and arrays). There is a separate stack for every thread of

execution, and a joint heap for all the threads. The stack of a thread grows and shrinks with

the depth of method calls. Object, strings and arrays are allocated in the heap by the executing

threads; they are deallocated (garbage-collected) by an autonomous garbage collector.

_ Three important aspects of space usage are allocation rate, retention and fragmentation:

Allocation rate is the rate at which your program creates new objects, strings, and arrays.

A high allocation rate costs time (for allocation, object initialization, and deallocation) and

space (because the garbage collector may set aside more memory for efficiency reasons)

even when the allocated data has a very short lifetime.

Retention is the amount of live heap data, that is, the heap data transitively reachable from

the call stacks at any point in time. A high retention costs space (obviously) and time (the

garbage collector must perform more administrative work both for allocation and deallocation).

Fragmentation is the creation of fragments: small unusable chunks of memory. Allocation of

increasingly larger objects, such as increasingly longer strings or arrays, may cause memory

fragmentation, leaving many small memory fragments that cannot be used. Such fragmentation

costs time (to search for a sufficiently large hole at allocation) and space (because the

fragments go unused). Most garbage collectors take care to avoid fragmentation, but that

itself may cost time and space, and may not be done in embedded JVM implementations.

_ A space leak is unwanted or unexpected retention, which usually causes memory consumption

to grow linearly with execution time. A space leak is caused by objects, strings or arrays being

reachable from live variables although those objects will actually never be used again. For

instance, this may happen if you cache computation results in a HashMap: the results remain

reachable from the HashMap even if you will never need them again. This can be avoided by

using a WeakHashMap instead.

_ A space leak may be caused by a deeply tail-recursive method that should have been written as

a loop. A Java compiler does not automatically optimize a tail-recursive method to a loop, so all

data reachable from the execution stack will be retained until the method returns.

_ The kind of garbage collector (generational, mark-sweep, reference counting, two-space, incremental,

compacting, . . . ) strongly influences the time and space effects of allocation rate, retention,

and fragmentation. However, a functioning garbage collector will never in itself cause a

space leak. Space leaks are caused by mistakes in your program.

_ Make sure that constant fields shared among all objects of a class are static, so that only one

field is ever created. When all Car objects have the same icon, do not do this:

public class Car {

ImageIcon symbol = new ImageIcon(“porsche.gif”);

}

Instead, do this:

10

public class Car {

final static ImageIcon symbol = new ImageIcon(“porsche.gif”);

}

_ When you are not sure that an object will actually be needed, then allocate it lazily: postpone its

allocation until needed, but allocate it only once. This will unconditionally create a Button for

every Car object, although the Button may never be requested by a call to the getButton

method:

public class Car {

private Button button = new JButton();

public Car() {

… initialize button …

}

public final JButton getButton() {

return button;

}

}

Instead, you can allocate the Button lazily in getButton:

public class Car {

private Button button = null;

public Car() { … }

public final JButton getButton() {

if (button == null) { // button not yet created, so create it

button = new JButton();

… initialize button …

}

return button;

}

}

This saves space (for the Button object) as well as time (for allocating and initializing it). On

the other hand, if the button is known to be needed, it is more efficient to allocate and initialize it

early and avoid the test in getButton.

 


%d bloggers like this: