At work we have a few of these, large and small, but they are either in production or simply to complex to use them for day to day testing. Especially when you try to debug a MapReduce job or code talking directly to HBase. I found that the excellent HBase team had already a class in place that is used to set up the JUnit tests they run. And the same goes for Hadoop. So I set out to extract the bare essentials if you will to create a tiny HBase cluster running on a tiny Hadoop distributed filesystem.
After a couple of issues that had to be resolved the below class is my "culmination" of sweet cluster goodness ;)
/* File: MiniLocalHBase.java
* Created: Feb 21, 2009
* Author: Lars George
*
* Copyright (c) 2009 larsgeorge.com
*/
package com.larsgeorge.hadoop.hbase;
import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.MiniHBaseCluster;
import org.apache.hadoop.hbase.util.FSUtils;
import org.apache.hadoop.hdfs.MiniDFSCluster;
/**
* Starts a small local DFS and HBase cluster.
*
* @author Lars George
*/
public class MiniLocalHBase {
static HBaseConfiguration conf = null;
static MiniDFSCluster dfs = null;
static MiniHBaseCluster hbase = null;
/**
* Main entry point to this class.
*
* @param args The command line arguments.
*/
public static void main(String[] args) {
try {
int n = args.length > 0 && args[0] != null ?
Integer.parseInt(args[0]) : 4;
conf = new HBaseConfiguration();
dfs = new MiniDFSCluster(conf, 2, true, (String[]) null);
// set file system to the mini dfs just started up
FileSystem fs = dfs.getFileSystem();
conf.set("fs.default.name", fs.getUri().toString());
Path parentdir = fs.getHomeDirectory();
conf.set(HConstants.HBASE_DIR, parentdir.toString());
fs.mkdirs(parentdir);
FSUtils.setVersion(fs, parentdir);
conf.set(HConstants.REGIONSERVER_ADDRESS, HConstants.DEFAULT_HOST + ":0");
// disable UI or it clashes for more than one RegionServer
conf.set("hbase.regionserver.info.port", "-1");
hbase = new MiniHBaseCluster(conf, n);
// add close hook
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
hbase.shutdown();
if (dfs != null) {
try {
FileSystem fs = dfs.getFileSystem();
if (fs != null) fs.close();
} catch (IOException e) {
System.err.println("error closing file system: " + e);
}
try {
dfs.shutdown();
} catch (Exception e) { /*ignore*/ }
}
}
} );
} catch (Exception e) {
e.printStackTrace();
}
} // main
} // MiniLocalHBase
The critical part for me was that if you wanted to be able to start more than one region server you have to disable the UI of each of these region servers or they will fail trying to bind the same info port, usually 60030.
I also added a small shutdown hook so that when you quit the process it will shut down nicely and keep the data in such a condition that you can restart the local again later on for further testing. Otherwise you may end up having to redo the file system - no biggie I guess, but hey why not? You can specify the number of RegionServer's being started on the command line. It defaults to 4 in my sample code above. Also, you do not need any
hbase-site.xml
or hadoop-site.xml
to set anything else. All required settings are hardcoded to start the different servers in separate threads. You can of course add one and tweak further settings - just keep in mind that the ones hardcoded in the code cannot be reassigned by the external XML settings files. You would have to move those directly into the code.To start this mini cluster you can either run this from within Eclipse for example, which makes it really easy since all the required libraries are in place, or you start it from the command line. This could work like so:
hadoop$ java -Xms512m -Xmx512m -cp bin:lib/hadoop-0.19.0-core.jar:lib/hadoop-0.19.0-test.jar:lib/hbase-0.19.0.jar:lib/hbase-0.19.0-test.jar:lib/commons-logging-1.0.4.jar:lib/jetty-5.1.4.jar:lib/servlet-api.jar:lib/jetty-ext/jasper-runtime.jar:lib/jetty-ext/jsp-api.jar:lib/jetty-ext/jasper-compiler.jar:lib/jetty-ext/commons-el.jar com.larsgeorge.hadoop.hbase.MiniLocalHBase
What I did is create a small project, have the class compile into the "bin" directory and threw all Hadoop and HBase libraries into the "lib" directory. This was only for the sake of keeping the command line short. I suggest you have the classpath set already or have it point to the original locations where you have untar'ed the respective packages.
Running it from within Eclipse let's you of course use the integrated debugging tools at hand. The next step is to follow through with what the test classes already have implemented and be able to start Map/Reduce jobs with the debugging enabled. Mind you though as the local cluster is not very powerful - even if you give it more memory than I did above. But fill it with a few hundred rows and use it to debug your code and once it runs fine, run it happily ever after on your production site.
All the credit goes to the Hadoop and HBase teams of course, I simply gathered their code from various places.
What is org.apache.hadoop.hbase.MiniHBaseCluster? I can't find this class in 0.19.1, is it something we'll see in 0.20?
ReplyDeleteHi test,
ReplyDeleteIt is part of the test package. You will have to add src/test path or the hbase-0.19.1-test.jar to your development environment.
Hi, thanks for your post. Do you know what are the adv/disadv/differences between MIniHBaseCluster and LocalHBaseCluster? I want to use a local cluster for my tests and I am interested in a lightweight but realistic cluster.
ReplyDeleteHi Alex, the MiniHBaseCluster simply ties it all together while using the LocalHBaseCluster internally. So same thing pretty much. This is an ongoing thing on the IRC channel too, peeps like to test a full cluster locally under realistic scenarios. This is hardly going to work - in fact the local cluster is putting even more spanners in the works at times. If you run into issues with it and by the time you finally get those resolved and the local cluster working you realize that it does - most of the time - *not* act like a real, distributed one. This is attributed to various things, be it more CPU's, more memory etc.
ReplyDeleteEfforts are underway to get HBase into the Cloudera distribution and if that works you should consider testing it on an EC2 rental cluster. That gets you closer than anything else. The local cluster is good for small tests and getting the API use ironed out. But once you have done so you need to get the big iron out.