First Munich OpenHUG Meeting
We are trying to gauge the interest in a south Germany Hadoop User Group Meeting. After seeing quite a big interest in the Berlin meetings a few of us got together and decided to test the waters for another meeting at the other end of the country. We are therefore happy to announce the first Munich OpenHUG Meeting.
When: Thursday December 17, 2009 at 5:30pm open end
Where: eCircle AG, Nymphenburger Straße 86, 80636 München ("Bruckmann" Building, "U1 Mailinger Str", map in German and look for the signs)
Thanks to Bob Schulze from eCircle to provide the location, projector and also giving a first presentation on how eCircle is planning to use the Hadoop stack.
We also have Dave Butlerdi giving an overview of his usage of Hadoop.
Finally I will give a state of affairs of the HBase project. What is it, what does it do and how am I using it (since early 2008).
We are also open for everyone who wants to talk about anything related to these new technologies often combined under the rather new term "NoSQL". Take the opportunity to talk about what you are working on and find like minded people to bounce ideas off. This is also why we chose the title OpenHUG for the meeting. While we mostly work with Hadoop and its subprojects we also like to learn about related projects and technologies.
Last but not least there will be something to drink and we will get pizzas in. Since we do not know how many of you will come on such short notice we simply stay at Bob's place and continue or chats over food.
As this is a first meeting in Munich on this topic we called it in a day after the Berlin meeting. Given there is interest we will in the future settle on dates that fit nicely between the Berlin dates so that we have no overlap and you can attend both meetings.
Please RSVP at Yahoo's Upcoming or Xing.
Monday, December 14, 2009
Tuesday, November 24, 2009
HBase vs. BigTable Comparison
HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification. This post is an attempt to compare the two systems.
Before we embark onto thedark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Putting aside minor differences, as of HBase 0.20, which is using ZooKeeper as its lock distributed coordination service, it has all the means to be nearly an exact implementation of BigTable's functionality. What I will be looking into below are mainly subtle variations or differences. Where possible I will try to point out how the HBase team is working on improving the situation given there is a need to do so.
Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.
Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.
Lock Service
This is from the BigTable paper:
There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s).
Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from and what effect it may have.
Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.
As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.
HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.
Master Operation
This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.
Master Startup
Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper is (yet) only used to publish the server hosting the -ROOT- region.
Tablet/Region Splits
The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.
Compactions
The following are more terminology differences than anything else.
HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.
This again refers to what is called "minor compaction" in HBase.
Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.
Immutable Files
Knowing that files are fixed once written BigTable makes the following assumption:
I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.
I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!
Before we embark onto the
Scope
The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. The paper was published 2006 while the HBase sub-project of Hadoop was established only around the end of that same year to early 2007. Back then the current version of Hadoop was 0.15.0. Given we are now about 2 years in, with Hadoop 0.20.1 and HBase 0.20.2 available, you can hopefully understand that indeed much has happened since. Please also note that I am comparing a 14 page high level technical paper with an open-source project that can be examined freely from top to bottom. It usually means that there is more to tell about how HBase does things because the information is available.Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.
Terminology
There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or "tablet server" respectively. Apart from that most differences are minor or caused by usage of related technologies since Google's code is obviously closed-source and therefore only mirrored by open-source projects. The open-source projects are free to use other terms and most importantly names for the projects themselves.Features
The following table lists various "features" of BigTable and compares them with what HBase has to offer. Some are actual implementation details, some are configurable option and so on. This may be confusing but it would be difficult to sort them into categories and not ending up with one entry only in each of them.Feature | Google BigTable | Apache HBase | Notes |
Atomic Read/Write/Modify | Yes, per row | Yes, per row | Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the user to lock more than one row at a time. |
Lexicographic Row Order | Yes | Yes | All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders. |
Block Support | Yes | Yes | Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K. |
Block Compression | Yes, per column family | Yes, per column family | Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well. |
Number of Column Families | Hundreds at Most | Less than 100 | While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly. |
Column Family Name Format | Printable | Printable | The main reason for HBase here is that column family names are used as directories in the file system. |
Qualifier Format | Arbitrary | Arbitrary | Any arbitrary byte[] array can be used. |
Key/Value Format | Arbitrary | Arbitrary | Like above, any arbitrary byte[] array can be used. |
Access Control | Yes | No | BigTable enforces access control on a column family level. HBase does not have yet have that feature (see HBASE-1697). |
Cell Versions | Yes | Yes | Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level. |
Custom Timestamps | Yes (micro) | Yes (milli) | With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after already having stored newer ones beforehand. |
Data Time-To-Live | Yes | Yes | Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time. |
Batch Writes | Yes | Yes | Both systems allow to batch table operations. |
Value based Counters | Yes | Yes | BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented. |
Row Filters | Yes | Yes | Again both system allow to apply filters when scanning rows. |
Client Script Execution | Yes | No | BigTable uses Sawzall to enable users to process the stored data. |
MapReduce Support | Yes | Yes | Both systems have convenience classes that allow scanning a table in MapReduce jobs. |
Storage Systems | GFS | HDFS, S3, S3N, EBS | While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it. |
File Format | SSTable | HFile | |
Block Index | At end of file | At end of file | Both storage file formats have a similar block oriented structure with the block index stored at the end of the file. |
Memory Mapping | Yes | No | BigTable can memory map storage files directly into memory. |
Lock Service | Chubby | ZooKeeper | There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different semantics. |
Single Master | Yes | No | HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node. |
Tablet/Region Count | 10-1000 | 10-1000 | Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount of load on each server. |
Tablet/Region Size | 100-200MB | 256MB | The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value. |
Root Location | 1st META / Chubby | -ROOT- / ZooKeeper | HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master. |
Client Region Cache | Yes | Yes | The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively |
Meta Prefetch | Yes | No (?) | A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups. |
Historian | Yes | Yes | The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table. |
Locality Groups | Yes | No | It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately. |
In-Memory Column Families | Yes | Yes | These are for relatively small tables that need very fast access times. |
KeyValue (Cell) Cache | Yes | No | This is a cache that servers hot cells. |
Block Cache | Yes | Yes | Blocks read from the storage files are cached internally in configurable caches. |
Bloom Filters | Yes | Yes | These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not. |
Write-Ahead Log (WAL) | Yes | Yes | Each region server in either system stores one modification log for all regions it hosts. |
Secondary Log | Yes | No | In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization. |
Skip Write-Ahead Log | ? | Yes | For bulk imports the client in HBase can opt to skip writing into the WAL. |
Fast Table/Region Split | Yes | Yes | Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store. |
New Features
As mentioned above, a few years have passed since the original OSDI'06 BigTable paper. Jeff Dean - a fellow at Google - has mentioned a few new BigTable features during speeches and presentations he gave recently. We will have a look at some of them here.Feature | Google BigTable | Apache HBase | Notes |
Client Isolation | Yes | No | BigTable is internally used to server many separate clients and can therefore keep the data between isolated. |
Coprocessors | Yes | No | BigTable can host code that resides with the regions and splits with them as well. See HBASE-2000 for progress on this feature within HBase. |
Corruption Safety | Yes | No | This is an interesting topic. BigTable uses CRC checksums to verify if data has been written safely. While HBase does not have this, the question is if that is build into Hadoop's HDFS? |
Replication | Yes | No | HBase is working on the same topic in HBASE-1295 |
Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.
Variations and Differences
Some of the above features need a bit more looking into as they are difficult to be narrowed down to simple "Yay or Nay" questions. I am addressing them below separately.Lock Service
This is from the BigTable paper:
Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.
There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s).
Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from and what effect it may have.
Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.
The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.
As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.
The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.
HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.
Master Operation
To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server's file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.
This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.
Master Startup
The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.
Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper is (yet) only used to publish the server hosting the -ROOT- region.
Tablet/Region Splits
In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.
The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.
Compactions
The following are more terminology differences than anything else.
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.
HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.
... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
This again refers to what is called "minor compaction" in HBase.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.
Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.
Immutable Files
Knowing that files are fixed once written BigTable makes the following assumption:
The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.
I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!
Friday, November 20, 2009
HBase on Cloudera Training Virtual Machine (0.3.2)
Note: This is a follow up to my earlier post. Since then Cloudera released a new VM that includes the current 0.20 branch of Hadoop. Below I have the same post adjusted to work with that new release. Please note that there are subtle changes, for example the NameNode port has changed. So if you in any way still have the older post please make sure you forget about it and follow this one here instead for the new VM version.
You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. The main issue though is that the current Hadoop Training VM does not include HBase at all (yet?). Apart from that the install of a local HBase instance is a straight forward process.
Here are the steps to get HBase running on Cloudera's VM:
Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.
Finally a screenshot of the running HBase UI:
You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. The main issue though is that the current Hadoop Training VM does not include HBase at all (yet?). Apart from that the install of a local HBase instance is a straight forward process.
Here are the steps to get HBase running on Cloudera's VM:
- Download VM
Get it from Cloudera's website.
- Start VM
As the above page states: "To launch the VMWare image, you will either need VMware Player for windows and linux, or VMware Fusion for Mac."
Note: I have Parallels for Mac and wanted to use that. I used Parallels Transporter to convert the "cloudera-training-0.3.2.vmx" to a new "cloudera-training-0.2-cl4-000001.hdd", create a new VM in Parallels selecting Ubuntu Linux as the OS and the newly created .hdd as the disk image. Boot up the VM and you are up and running. I gave it a bit more memory for the graphics to be able to switch the VM to 1440x900 which is the native screen resolution on my MacBook Pro I am using.
Finally follow the steps explained on the page above, i.e. open a Terminal and issue:
$ cd ~/git $ ./update-exercises --workspace
- Pull HBase branch
We are using the brand new HBase 0.20.2 release. Open a new Terminal (or issue a$ cd ..
in the open one), then:
$ sudo -u hadoop git clone http://git.apache.org/hbase.git /home/hadoop/hbase $ sudo -u hadoop sh -c "cd /home/hadoop/hbase ; git checkout origin/tags/0.20.2" Note: moving to "origin/tags/0.20.2" which isn't a local branch If you want to create a new branch from this checkout, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b <new_branch_name> HEAD is now at 777fb63... HBase release 0.20.2
First we clone the repository, then switch to the actual branch. You will notice that I am usingsudo -u hadoop
because Hadoop itself is started under that account and so I wanted it to match. Also, the default "training" account does not have SSH set up as explained in Hadoop's quick-start guide. Whensudo
is asking for a password use the default, which is set to "training".
You can ignore the messages git prints out while performing the checkout.
- Build Branch
Continue in Terminal:
$ sudo -u hadoop sh -c "cd /home/hadoop/hbase/ ; export PATH=$PATH:/usr/share/apache-ant-1.7.1/bin ; ant package" ... BUILD SUCCESSFUL
- Configure HBase
There are a few edits to be made to get HBase running.
$ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-site.xml <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8022/hbase</value> </property> </configuration> $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-env.sh # The java implementation to use. Java 1.6 required. # export JAVA_HOME=/usr/java/jdk1.6.0/ export JAVA_HOME=/usr/lib/jvm/java-6-sun ...
- Rev up the Engine!
The final thing is to start HBase:
$ sudo -u hadoop /home/hadoop/hbase/build/bin/start-hbase.sh $ sudo -u hadoop /home/hadoop/hbase/build/bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Version: 0.20.2, r777fb63ff0c73369abc4d799388a45b8bda9e5fd, Thu Nov 19 15:32:17 PST 2009 hbase(main):001:0>
Done!
Let's create a table and check if it was created OK.
hbase(main):001:0> list 0 row(s) in 0.0910 seconds hbase(main):002:0> create 't1', 'f1', 'f2', 'f3' 0 row(s) in 6.1260 seconds hbase(main):003:0> list t1 1 row(s) in 0.0470 seconds hbase(main):004:0> describe 't1' DESCRIPTION ENABLED {NAME => 't1', FAMILIES => [{NAME => 'f1', COMPRESSION => 'NONE', VERS true IONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => ' false', BLOCKCACHE => 'true'}, {NAME => 'f2', COMPRESSION => 'NONE', V ERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY = > 'false', BLOCKCACHE => 'true'}, {NAME => 'f3', COMPRESSION => 'NONE' , VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR Y => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0750 seconds hbase(main):005:0>
Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.
Finally a screenshot of the running HBase UI:
Tuesday, October 20, 2009
HBase on Cloudera Training Virtual Machine (0.3.1)
You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. There are a few issues though, the worst being that the current Hadoop Training VM does not include HBase at all. Also, Cloudera is using a specific version of Hadoop that it deems stable and maintains it own release cycle. So Cloudera's version of Hadoop is 0.18.3. HBase though needs Hadoop 0.20 - but we are in luck as Andrew Purtell of TrendMicro maintains a special branch of HBase 0.20 that works with Cloudera's release.
Here are the steps to get HBase running on Cloudera's VM:
This sums it up. I hope you give HBase on the Cloudera Training VM a whirl as it also has Eclipse installed and therefore provides a quick start into Hadoop and HBase.
Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.
Update: Updated title to include version number, fixed XML
Here are the steps to get HBase running on Cloudera's VM:
- Download VM
Get it from Cloudera's website. - Start VM
As the above page states: "To launch the VMWare image, you will either need VMware Player for windows and linux, or VMware Fusion for Mac."
Note: I have Parallels for Mac and wanted to use that. I used Parallels Transporter to convert the "cloudera-training-0.3.1.vmx" to a new "cloudera-training-0.2-cl3-000002.hdd", create a new VM in Parallels selecting Ubuntu Linux as the OS and the newly created .hdd as the disk image. Boot up the VM and you are up and running. I gave it a bit more memory for the graphics to be able to switch the VM to 1440x900 which is native to my MacBook Pro I am using.
Finally follow the steps explained on the page above, i.e. open a Terminal and issue:
$ cd ~/git $ ./update-exercises --workspace
- Pull HBase branch
Open a new Terminal (or issue a$ cd ..
in the open one), then:
$ sudo -u hadoop git clone http://git.apache.org/hbase.git /home/hadoop/hbase $ sudo -u hadoop sh -c "cd /home/hadoop/hbase ; git checkout origin/0.20_on_hadoop-0.18.3" ... HEAD is now at c050f68... pull up to release
First we clone the repository, then switch to the actual branch. You will notice that I am usingsudo -u hadoop
because Hadoop itself is started under that account and so I wanted it to match. Also, the default "training" account does not have SSH set up as explained in Hadoop's quick-start guide. Whensudo
is asking for a password use the default set to "training".
- Build Branch
Continue in Terminal:
$ sudo -u hadoop sh -c "cd /home/hadoop/hbase/ ; export PATH=$PATH:/usr/share/apache-ant-1.7.1/bin ; ant package" ... BUILD SUCCESSFUL
- Configure HBase
There are a few edits to be made to get HBase running.
$ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-site.xml <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/hbase</value> </property> </configuration> $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-env.sh # The java implementation to use. Java 1.6 required. # export JAVA_HOME=/usr/java/jdk1.6.0/ export JAVA_HOME=/usr/lib/jvm/java-6-sun ...
Note: There is a small glitch in the revision 826669 of that Cloudera specific HBase branch. The master UI (on port 60010 on localhost) will not start because a path is different and Jetty packages are missing because of it. You can fix it by editing the start up script and changing the path scanned:
$ sudo -u hadoop vim /home/hadoop/hbase/build/bin/hbase
Replace
for f in $HBASE_HOME/lib/jsp-2.1/*.jar; do
with
for f in $HBASE_HOME/lib/jetty-ext/*.jar; do
This is only until the developers have fixed this in the branch (compare the revision I used r813052 with what you get). Or if you do not want the UI you can ignore this and the error in the logs too. HBase will still run, just not its web based interface.
- Rev up the Engine!
The final thing is to start HBase:
$ sudo -u hadoop /home/hadoop/hbase/build/bin/start-hbase.sh $ sudo -u hadoop /home/hadoop/hbase/build/bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Version: 0.20.0-0.18.3, r813052, Mon Oct 19 06:51:57 PDT 2009 hbase(main):001:0> list 0 row(s) in 0.2320 seconds hbase(main):002:0>
Done!
This sums it up. I hope you give HBase on the Cloudera Training VM a whirl as it also has Eclipse installed and therefore provides a quick start into Hadoop and HBase.
Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.
Update: Updated title to include version number, fixed XML
Monday, October 12, 2009
HBase Architecture 101 - Storage
One of the more hidden aspects of HBase is how data is actually stored. While the majority of users may never have to bother about it you may have to get up to speed when you want to learn what the various advanced configuration options you have at your disposal mean. "How can I tune HBase to my needs?", and other similar questions are certainly interesting once you get over the (at times steep) learning curve of setting up a basic system. Another reason wanting to know more is if for whatever reason disaster strikes and you have to recover a HBase installation.
In my own efforts getting to know the respective classes that handle the various files I started to sketch a picture in my head illustrating the storage architecture of HBase. But while the ingenious and blessed committers of HBase easily navigate back and forth through that maze I find it much more difficult to keep a coherent image. So I decided to put that sketch to paper. Here it is.
Please note that this is not a UML or call graph but a merged picture of classes and the files they handle and by no means complete though focuses on the topic of this post. I will discuss the details below and also look at the configuration options and how they affect the low-level storage files.
The Big Picture
So what does my sketch of the HBase innards really say? You can see that HBase handles basically two kinds of file types. One is used for the write-ahead log and the other for the actual data storage. The files are primarily handled by the
The general flow is that a new client contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) first to find a particular row key. It does so by retrieving the server name (i.e. host name) that hosts the -ROOT- region from Zookeeper. With that information it can query that server to get the server that hosts the .META. table. Both of these two details are cached and only looked up once. Lastly it can query the .META. server and retrieve the server that has the row the client is looking for.
Once it has been told where the row resides, i.e. in what region, it caches this information as well and contacts the
Note: The
Next the
Stay Put
So how is data written to the actual storage? The client issues a
Once the data is written (or not) to the WAL it is placed in the
Files
HBase has a configurable root directory in the HDFS but the default is
The first set of files are the log files handled by the
Next there is a file called
Note: Sometimes you may see left-over
The next set of files are the actual regions. Each region name is encoded using a Jenkins Hash function and a directory created for it. The reason to hash the region name is because it may contain characters that cannot be used in a path name in DFS. The Jenkins Hash always returns legal characters, as simple as that. So you get the following path structure:
In the root of the region directory there is also a
In each column-family directory you can see the actual data files, which I explain in the following section in detail.
Something that I have not shown above are split regions with their initial daughter reference files. When a data file within a region grows larger than the configured
And this also concludes the file dump here, the last thing you see is a
HFile
So we are now at a very low level of HBase's architecture.
The files have a variable length, the only fixed blocks are the FileInfo and Trailer block. As the picture shows it is the Trailer that has the pointers to the other blocks and it is written at the end of persisting the data to the file, finalizing the now immutable data store. The Index blocks record the offsets of the Data and Meta blocks. Both the Data and the Meta blocks are actually optional. But you most likely you would always find data in a data store file.
How is the block size configured? It is driven solely by the
The default is "64KB" (or 65535 bytes). Here is what the HFile JavaDoc explains:
So each block with its prefixed "magic" header contains either plain or compressed data. How that looks like we will have a look at in the next section.
One thing you may notice is that the default block size for files in DFS is 64MB, which is 1024 times what the
One option in the HBase configuration you may see is
So far so good, but how can you see if a
The
Here is an example of what the output will look like (shortened here):
The first part is the actual data stored as
KeyValue's
In essence each
The structure starts with two fixed length numbers indicating the size of the key and the value part. With that info you can offset into the array to for example get direct access to the value, ignoring the key - if you know what you are doing. Otherwise you can get the required information from the key part. Once parsed into a
Note: One thing to watch out for is the difference between
This concludes my analysis of the HBase storage architecture. I hope it provides a starting point for your own efforts to dig into the grimy details. Have fun!
Update: Slightly updated with more links to JIRA issues. Also added Zookeeper to be more precise about the current mechanisms to look up a region.
Update 2: Added details about region references.
Update 3: Added more details about region lookup as requested.
In my own efforts getting to know the respective classes that handle the various files I started to sketch a picture in my head illustrating the storage architecture of HBase. But while the ingenious and blessed committers of HBase easily navigate back and forth through that maze I find it much more difficult to keep a coherent image. So I decided to put that sketch to paper. Here it is.
Please note that this is not a UML or call graph but a merged picture of classes and the files they handle and by no means complete though focuses on the topic of this post. I will discuss the details below and also look at the configuration options and how they affect the low-level storage files.
The Big Picture
So what does my sketch of the HBase innards really say? You can see that HBase handles basically two kinds of file types. One is used for the write-ahead log and the other for the actual data storage. The files are primarily handled by the
HRegionServer
's. But in certain scenarios even the HMaster
will have to perform low-level file operations. You may also notice that the actual files are in fact divided up into smaller blocks when stored within the Hadoop Distributed Filesystem (HDFS). This is also one of the areas where you can configure the system to handle larger or smaller data better. More on that later.The general flow is that a new client contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) first to find a particular row key. It does so by retrieving the server name (i.e. host name) that hosts the -ROOT- region from Zookeeper. With that information it can query that server to get the server that hosts the .META. table. Both of these two details are cached and only looked up once. Lastly it can query the .META. server and retrieve the server that has the row the client is looking for.
Once it has been told where the row resides, i.e. in what region, it caches this information as well and contacts the
HRegionServer
hosting that region directly. So over time the client has a pretty complete picture of where to get rows from without needing to query the .META. server again. Note: The
HMaster
is responsible to assign the regions to each HRegionServer
when you start HBase. This also includes the "special" -ROOT- and .META. tables.Next the
HRegionServer
opens the region it creates a corresponding HRegion
object. When the HRegion
is "opened" it sets up a Store
instance for each HColumnFamily
for every table as defined by the user beforehand. Each of the Store
instances can in turn have one or more StoreFile
instances, which are lightweight wrappers around the actual storage file called HFile
. A HRegion
also has a MemStore
and a HLog
instance. We will now have a look at how they work together but also where there are exceptions to the rule. Stay Put
So how is data written to the actual storage? The client issues a
HTable.put(Put)
request to the HRegionServer
which hands the details to the matching HRegion
instance. The first step is now to decide if the data should be first written to the "Write-Ahead-Log" (WAL) represented by the HLog
class. The decision is based on the flag set by the client using Put.writeToWAL(boolean)
method. The WAL is a standard Hadoop SequenceFile
(although it is currently discussed if that should not be changed to a more HBase suitable file format) and it stores HLogKey
's. These keys contain a sequential number as well as the actual data and are used to replay not yet persisted data after a server crash.Once the data is written (or not) to the WAL it is placed in the
MemStore
. At the same time it is checked if the MemStore
is full and in that case a flush to disk is requested. When the request is served by a separate thread in the HRegionServer
it writes the data to an HFile
located in the HDFS. It also saves the last written sequence number so the system knows what was persisted so far. Let"s have a look at the files now.Files
HBase has a configurable root directory in the HDFS but the default is
/hbase
. You can simply use the DFS tool of the Hadoop command line tool to look at the various files HBase stores.$ hadoop dfs -lsr /hbase/docs
...
drwxr-xr-x - hadoop supergroup 0 2009-09-28 14:22 /hbase/.logs
drwxr-xr-x - hadoop supergroup 0 2009-10-15 14:33 /hbase/.logs/srv1.foo.bar,60020,1254172960891
-rw-r--r-- 3 hadoop supergroup 14980 2009-10-14 01:32 /hbase/.logs/srv1.foo.bar,60020,1254172960891/hlog.dat.1255509179458
-rw-r--r-- 3 hadoop supergroup 1773 2009-10-14 02:33 /hbase/.logs/srv1.foo.bar,60020,1254172960891/hlog.dat.1255512781014
-rw-r--r-- 3 hadoop supergroup 37902 2009-10-14 03:33 /hbase/.logs/srv1.foo.bar,60020,1254172960891/hlog.dat.1255516382506
...
-rw-r--r-- 3 hadoop supergroup 137648437 2009-09-28 14:20 /hbase/docs/1905740638/oldlogfile.log
...
drwxr-xr-x - hadoop supergroup 0 2009-09-27 18:03 /hbase/docs/999041123
-rw-r--r-- 3 hadoop supergroup 2323 2009-09-01 23:16 /hbase/docs/999041123/.regioninfo
drwxr-xr-x - hadoop supergroup 0 2009-10-13 01:36 /hbase/docs/999041123/cache
-rw-r--r-- 3 hadoop supergroup 91540404 2009-10-13 01:36 /hbase/docs/999041123/cache/5151973105100598304
drwxr-xr-x - hadoop supergroup 0 2009-09-27 18:03 /hbase/docs/999041123/contents
-rw-r--r-- 3 hadoop supergroup 333470401 2009-09-27 18:02 /hbase/docs/999041123/contents/4397485149704042145
drwxr-xr-x - hadoop supergroup 0 2009-09-04 01:16 /hbase/docs/999041123/language
-rw-r--r-- 3 hadoop supergroup 39499 2009-09-04 01:16 /hbase/docs/999041123/language/8466543386566168248
drwxr-xr-x - hadoop supergroup 0 2009-09-04 01:16 /hbase/docs/999041123/mimetype
-rw-r--r-- 3 hadoop supergroup 134729 2009-09-04 01:16 /hbase/docs/999041123/mimetype/786163868456226374
drwxr-xr-x - hadoop supergroup 0 2009-10-08 22:45 /hbase/docs/999882558
-rw-r--r-- 3 hadoop supergroup 2867 2009-10-08 22:45 /hbase/docs/999882558/.regioninfo
drwxr-xr-x - hadoop supergroup 0 2009-10-09 23:01 /hbase/docs/999882558/cache
-rw-r--r-- 3 hadoop supergroup 45473255 2009-10-09 23:01 /hbase/docs/999882558/cache/974303626218211126
drwxr-xr-x - hadoop supergroup 0 2009-10-12 00:37 /hbase/docs/999882558/contents
-rw-r--r-- 3 hadoop supergroup 467410053 2009-10-12 00:36 /hbase/docs/999882558/contents/2507607731379043001
drwxr-xr-x - hadoop supergroup 0 2009-10-09 23:02 /hbase/docs/999882558/language
-rw-r--r-- 3 hadoop supergroup 541 2009-10-09 23:02 /hbase/docs/999882558/language/5662037059920609304
drwxr-xr-x - hadoop supergroup 0 2009-10-09 23:02 /hbase/docs/999882558/mimetype
-rw-r--r-- 3 hadoop supergroup 84447 2009-10-09 23:02 /hbase/docs/999882558/mimetype/2642281535820134018
drwxr-xr-x - hadoop supergroup 0 2009-10-14 10:58 /hbase/docs/compaction.dir
The first set of files are the log files handled by the
HLog
instances and which are created in a directory called .logs
underneath the HBase root directory. Then there is another subdirectory for each HRegionServer
and then a log for each HRegion
. Next there is a file called
oldlogfile.log
which you may not even see on your cluster. They are created by one of the exceptions I mentioned earlier as far as file access is concerned. They are a result of so called "log splits". When the HMaster
starts and finds that there is a log file that is not handled by a HRegionServer
anymore it splits the log copying the HLogKey
's to the new regions they should be in. It places them directly in the region's directory in a file named oldlogfile.log
. Now when the respective HRegion
is instantiated it reads these files and inserts the contained data into its local MemStore
and starts a flush to persist the data right away and delete the file. Note: Sometimes you may see left-over
oldlogfile.log.old
(yes, there is another .old at the end) which are caused by the HMaster
trying repeatedly to split the log and found there was already another split log in place. At that point you have to consult with the HRegionServer
or HMaster
logs to see what is going on and if you can remove those files. I found at times that they were empty and therefore could safely be removed.The next set of files are the actual regions. Each region name is encoded using a Jenkins Hash function and a directory created for it. The reason to hash the region name is because it may contain characters that cannot be used in a path name in DFS. The Jenkins Hash always returns legal characters, as simple as that. So you get the following path structure:
/hbase/<tablename>/<encoded-regionname>/<column-family>/<filename>
In the root of the region directory there is also a
.regioninfo
holding meta data about the region. This will be used in the future by an HBase fsck
utility (see HBASE-7) to be able to rebuild a broken .META.
table. For a first usage of the region info can be seen in HBASE-1867. In each column-family directory you can see the actual data files, which I explain in the following section in detail.
Something that I have not shown above are split regions with their initial daughter reference files. When a data file within a region grows larger than the configured
hbase.hregion.max.filesize
then the region is split in two. This is done initially very quickly because the system simply creates two reference files in the new regions now supposed to host each half. The name of the reference file is an ID with the hashed name of the referenced region as a postfix, e.g. 1278437856009925445.3323223323
. The reference files only hold little information: the key the original region was split at and wether it is the top or bottom reference. Of note is that these references are then used by the HalfHFileReader
class (which I also omitted from the big picture above as it is only used temporarily) to read the original region data files. Only upon a compaction the original files are rewritten into separate files in the new region directory. This also removes the small reference files as well as the original data file in the original region. And this also concludes the file dump here, the last thing you see is a
compaction.dir
directory in each table directory. They are used when splitting or compacting regions as noted above. They are usually empty and are used as a scratch area to stage the new data files before swapping them into place.HFile
So we are now at a very low level of HBase's architecture.
HFile
's (kudos to Ryan Rawson) are the actual storage files, specifically created to serve one purpose: store HBase's data fast and efficiently. They are apparently based on Hadoop's TFile
(see HADOOP-3315) and mimic the SSTable format used in Googles BigTable architecture. The previous use of Hadoop's MapFile
's in HBase proved to be not good enough performance wise. So how do the files look like?The files have a variable length, the only fixed blocks are the FileInfo and Trailer block. As the picture shows it is the Trailer that has the pointers to the other blocks and it is written at the end of persisting the data to the file, finalizing the now immutable data store. The Index blocks record the offsets of the Data and Meta blocks. Both the Data and the Meta blocks are actually optional. But you most likely you would always find data in a data store file.
How is the block size configured? It is driven solely by the
HColumnDescriptor
which in turn is specified at table creation time by the user or defaults to reasonable standard values. Here is an example as shown in the master web based interface: {NAME => 'docs', FAMILIES => [{NAME => 'cache', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'contents', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, ...
The default is "64KB" (or 65535 bytes). Here is what the HFile JavaDoc explains:
"Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB."
So each block with its prefixed "magic" header contains either plain or compressed data. How that looks like we will have a look at in the next section.
One thing you may notice is that the default block size for files in DFS is 64MB, which is 1024 times what the
HFile
default block size is. So the HBase storage files blocks do not match the Hadoop blocks. Therefore you have to think about both parameters separately and find the sweet spot in terms of performance for your particular setup.One option in the HBase configuration you may see is
hfile.min.blocksize.size
. It seems to be only used during migration from earlier versions of HBase (since it had no block file format) and when directly creating HFile
during bulk imports for example.So far so good, but how can you see if a
HFile
is OK or what data it contains? There is an App for that!The
HFile.main()
method provides the tools to dump a data file:$ hbase org.apache.hadoop.hbase.io.hfile.HFile
usage: HFile [-f] [-v] [-r ] [-a] [-p] [-m] [-k]
-a,--checkfamily Enable family check
-f,--fileFile to scan. Pass full-path; e.g.
hdfs://a:9000/hbase/.META./12/34
-k,--checkrow Enable row order check; looks for out-of-order keys
-m,--printmeta Print meta data of file
-p,--printkv Print key/value pairs
-r,--regionRegion to scan. Pass region name; e.g. '.META.,,1'
-v,--verbose Verbose output; emits file and meta data delimiters
Here is an example of what the output will look like (shortened here):
$ hbase org.apache.hadoop.hbase.io.hfile.HFile -v -p -m -f \
hdfs://srv1.foo.bar:9000/hbase/docs/999882558/mimetype/2642281535820134018
Scanning -> hdfs://srv1.foo.bar:9000/hbase/docs/999882558/mimetype/2642281535820134018
...
K: \x00\x04docA\x08mimetype\x00\x00\x01\x23y\x60\xE7\xB5\x04 V: text\x2Fxml
K: \x00\x04docB\x08mimetype\x00\x00\x01\x23x\x8C\x1C\x5E\x04 V: text\x2Fxml
K: \x00\x04docC\x08mimetype\x00\x00\x01\x23xz\xC08\x04 V: text\x2Fxml
K: \x00\x04docD\x08mimetype\x00\x00\x01\x23y\x1EK\x15\x04 V: text\x2Fxml
K: \x00\x04docE\x08mimetype\x00\x00\x01\x23x\xF3\x23n\x04 V: text\x2Fxml
Scanned kv count -> 1554
Block index size as per heapsize: 296
reader=hdfs://srv1.foo.bar:9000/hbase/docs/999882558/mimetype/2642281535820134018, \
compression=none, inMemory=false, \
firstKey=US6683275_20040127/mimetype:/1251853756871/Put, \
lastKey=US6684814_20040203/mimetype:/1251864683374/Put, \
avgKeyLen=37, avgValueLen=8, \
entries=1554, length=84447
fileinfoOffset=84055, dataIndexOffset=84277, dataIndexCount=2, metaIndexOffset=0, \
metaIndexCount=0, totalBytes=84055, entryCount=1554, version=1
Fileinfo:
MAJOR_COMPACTION_KEY = \xFF
MAX_SEQ_ID_KEY = 32041891
hfile.AVG_KEY_LEN = \x00\x00\x00\x25
hfile.AVG_VALUE_LEN = \x00\x00\x00\x08
hfile.COMPARATOR = org.apache.hadoop.hbase.KeyValue\x24KeyComparator
hfile.LASTKEY = \x00\x12US6684814_20040203\x08mimetype\x00\x00\x01\x23x\xF3\x23n\x04
The first part is the actual data stored as
KeyValue
pairs, explained in detail in the next section. The second part dumps the internal HFile.Reader
properties as well as the Trailer block details and finally the FileInfo block values. This is a great way to check if a data file is still healthy. KeyValue's
In essence each
KeyValue
in the HFile
is simply a low-level byte array that allows for "zero-copy" access to the data, even with lazy or custom parsing if necessary. How are the instances arranged?The structure starts with two fixed length numbers indicating the size of the key and the value part. With that info you can offset into the array to for example get direct access to the value, ignoring the key - if you know what you are doing. Otherwise you can get the required information from the key part. Once parsed into a
KeyValue
object you have getters to access the details.Note: One thing to watch out for is the difference between
KeyValue.getKey()
and KeyValue.getRow()
. I think for me the confusion arose from referring to "row keys" as the primary key to get a row out of HBase. That would be the latter of the two methods, i.e. KeyValue.getRow()
. The former simply returns the complete byte array part representing the raw "key" as colored and labeled in the diagram. This concludes my analysis of the HBase storage architecture. I hope it provides a starting point for your own efforts to dig into the grimy details. Have fun!
Update: Slightly updated with more links to JIRA issues. Also added Zookeeper to be more precise about the current mechanisms to look up a region.
Update 2: Added details about region references.
Update 3: Added more details about region lookup as requested.
Hive vs. Pig
While I was looking at Hive and Pig for processing large amounts of data without the need to write MapReduce code I found that there is no easy way to compare them against each other without reading into both in greater detail.
In this post I am trying to give you a 10,000ft view of both and compare some of the more prominent and interesting features. The following table - which is discussed below - compares what I deemed to be such features:
Let us look now into each of these with a bit more detail.
General Purpose
The question is "What does Hive or Pig solve?". Both - and I think this lucky for us in regards to comparing them - have a very similar goal. They try to ease the complexity of writing MapReduce jobs in a programming language like Java by giving the user a set of tools that they may be more familiar with (more on this below). The raw data is stored in Hadoop's HDFS and can be any format although natively it usually is a TAB separated text file, while internally they also may make use of Hadoop's SequenceFile file format. The idea is to be able to parse the raw data file, for example a web server log file, and use the contained information to slice and dice them into what is needed for business needs. Therefore they provide means to aggregate fields based on specific keys. In the end they both emit the result again in either text or a custom file format. Efforts are also underway to have both use other systems as a source for data, for example HBase.
The features I am comparing are chosen pretty much at random because they stood out when I read into each of these two frameworks. So keep in mind that this is a subjective list.
Language
Hive lends itself to SQL. But since we can only read already existing files in HDFS it is lacking UPDATE or DELETE support for example. It focuses primarily on the query part of SQL. But even there it has its own spin on things to reflect better the underlaying MapReduce process. Overall is seems that someone familiar with SQL can very quickly learn Hive's version of it and get results fast.
Pig on the other hand looks more like a very simplistic scripting language. As with those (and this is a nearly religious topic) some are more intuitive and some are less. As with PigLatin I was able to see what the samples do, but lacking the full knowledge of its syntax I was somewhat finding myself thinking if I really would be able to get what I needed without too many trial-and-error loops. Sure, the Hive SQL needs probably as many iterations to fully grasp - but there is at least a greater understanding of what to expect.
Schemas/Types
Hive uses once more a specific variation of SQL's Data Definition Language (DDL). It defines the "tables" beforehand and stores the schema in a either shared or local database. Any JDBC offering will do, but it also comes with a built in Derby instance to get you started quickly. If the database is local then only you can run specific Hive commands. If you share the database then others can also run these - or would have to set up their own local database copy. Types are also defined upfront and supported types are INT, BIGINT, BOOLEAN, STRING and so on. There are also array types that lets you handle specific fields in the raw data files as a group.
Pig has no such metadata database. Datatypes and schemas are defined within each script. Types furthermore are usually automatically determined by their use. So if you use a field as an Integer it is handled that way by Pig. You do have the option though to override it and have explicit type definitions, again within the script you need them. Pig has a similar set of types compared to Hive. For example it also has an array type called "bag".
Partitions
Hive has a notion of partitions. They are basically subdirectories in HDFS. It allows for example processing a subset of the data by alphabet or date. It is up to the user to create these "partitions" as they are not enforced nor required.
Pig does not seem to have such a feature. It may be that filters can achieve the same but it is not immediately obvious to me.
Server
Hive can start an optional server, which is allegedly Thrift based. With the server I presume you can send queries from anywhere to the Hive server which in turn executes them.
Pig does not seem to have such a facility yet.
User Defined Functions
Hive and Pig allow for user functionality by supplying Java code to the query process. These functions can add any additional feature that is required to crunch the numbers as required.
Custom Serializer/Deserializer
Again, both Hive and Pig allow for custom Java classes that can read or write any file format required. I also assume that is how it connects to HBase eventually (just a guess). You can write a parser for Apache log files or, for example, the binary Tokyo Tyrant Ulog format. The same goes for the output, write a database output class and you can write the results back into a database.
DFS Direct Access
Hive is smart about how to access the raw data. A "select * from table limit 10" for example does a direct read from the file. If the query is too complicated it will fall back to use a full MapReduce run to determine the outcome, just as expected.
With Pig I am not sure if it does the same to speed up simple PigLatin scripts. At least it does not seem to be mentioned anywhere as an important feature.
Join/Order/Sort
Hive and Pig have support for joining, ordering or sorting data dynamically. They perform the same purpose in both pretty allowing you to aggregate and sort the result as is needed. Pig also has a COGROUP feature that allows you to do OUTER JOIN's and so on. I think this is where you spent most of your time with either package - especially when you start out. But from a cursory look it seems both can do pretty much the same.
Shell
Both Hive and Pig have a shell that allows you to query specific things or run the actual queries. Pig also passes on DFS commands such as "cat" to allow you to quickly check what an outcome of a specific PigLatin script was.
Streaming
Once more, both frameworks seem to provide streaming interfaces so that you can process data with external tools or languages, such as Ruby or Python. How the streaming performs I do not know and if they affect them differently. This is for you to tell me :)
Web Interface
Only Hive has a web interface or UI that can be used to visualize the various schemas and issue queries. This is different to the above mentioned Server as it is an interactive web UI for a human operator. The Hive Server is for use from another programming or scripting language for example.
JDBC/ODBC
Another Hive only feature is the availability of a - again limited functionality - JDBC/ODBC driver. It is another way for programmers to use Hive without having to bother with its shell or web interface, or even the Hive Server. Since only a subset of features is available it will require small adjustments on the programmers side of things but otherwise seems like a nice-to-have feature.
Conclusion
Well, it seems to me that both can help you achieve the same goals, while Hive comes more natural to database developers and Pig to "script kiddies" (just kidding). Hive has more features as far as access choices are concerned. They also have reportedly roughly the same amount of committers in each project and are going strong development wise.
This is it from me. Do you have a different opinion or comment on the above then please feel free to reply below. Over and out!
In this post I am trying to give you a 10,000ft view of both and compare some of the more prominent and interesting features. The following table - which is discussed below - compares what I deemed to be such features:
Feature | Hive | Pig |
Language | SQL-like | PigLatin |
Schemas/Types | Yes (explicit) | Yes (implicit) |
Partitions | Yes | No |
Server | Optional (Thrift) | No |
User Defined Functions (UDF) | Yes (Java) | Yes (Java) |
Custom Serializer/Deserializer | Yes | Yes |
DFS Direct Access | Yes (implicit) | Yes (explicit) |
Join/Order/Sort | Yes | Yes |
Shell | Yes | Yes |
Streaming | Yes | Yes |
Web Interface | Yes | No |
JDBC/ODBC | Yes (limited) | No |
Let us look now into each of these with a bit more detail.
General Purpose
The question is "What does Hive or Pig solve?". Both - and I think this lucky for us in regards to comparing them - have a very similar goal. They try to ease the complexity of writing MapReduce jobs in a programming language like Java by giving the user a set of tools that they may be more familiar with (more on this below). The raw data is stored in Hadoop's HDFS and can be any format although natively it usually is a TAB separated text file, while internally they also may make use of Hadoop's SequenceFile file format. The idea is to be able to parse the raw data file, for example a web server log file, and use the contained information to slice and dice them into what is needed for business needs. Therefore they provide means to aggregate fields based on specific keys. In the end they both emit the result again in either text or a custom file format. Efforts are also underway to have both use other systems as a source for data, for example HBase.
The features I am comparing are chosen pretty much at random because they stood out when I read into each of these two frameworks. So keep in mind that this is a subjective list.
Language
Hive lends itself to SQL. But since we can only read already existing files in HDFS it is lacking UPDATE or DELETE support for example. It focuses primarily on the query part of SQL. But even there it has its own spin on things to reflect better the underlaying MapReduce process. Overall is seems that someone familiar with SQL can very quickly learn Hive's version of it and get results fast.
Pig on the other hand looks more like a very simplistic scripting language. As with those (and this is a nearly religious topic) some are more intuitive and some are less. As with PigLatin I was able to see what the samples do, but lacking the full knowledge of its syntax I was somewhat finding myself thinking if I really would be able to get what I needed without too many trial-and-error loops. Sure, the Hive SQL needs probably as many iterations to fully grasp - but there is at least a greater understanding of what to expect.
Schemas/Types
Hive uses once more a specific variation of SQL's Data Definition Language (DDL). It defines the "tables" beforehand and stores the schema in a either shared or local database. Any JDBC offering will do, but it also comes with a built in Derby instance to get you started quickly. If the database is local then only you can run specific Hive commands. If you share the database then others can also run these - or would have to set up their own local database copy. Types are also defined upfront and supported types are INT, BIGINT, BOOLEAN, STRING and so on. There are also array types that lets you handle specific fields in the raw data files as a group.
Pig has no such metadata database. Datatypes and schemas are defined within each script. Types furthermore are usually automatically determined by their use. So if you use a field as an Integer it is handled that way by Pig. You do have the option though to override it and have explicit type definitions, again within the script you need them. Pig has a similar set of types compared to Hive. For example it also has an array type called "bag".
Partitions
Hive has a notion of partitions. They are basically subdirectories in HDFS. It allows for example processing a subset of the data by alphabet or date. It is up to the user to create these "partitions" as they are not enforced nor required.
Pig does not seem to have such a feature. It may be that filters can achieve the same but it is not immediately obvious to me.
Server
Hive can start an optional server, which is allegedly Thrift based. With the server I presume you can send queries from anywhere to the Hive server which in turn executes them.
Pig does not seem to have such a facility yet.
User Defined Functions
Hive and Pig allow for user functionality by supplying Java code to the query process. These functions can add any additional feature that is required to crunch the numbers as required.
Custom Serializer/Deserializer
Again, both Hive and Pig allow for custom Java classes that can read or write any file format required. I also assume that is how it connects to HBase eventually (just a guess). You can write a parser for Apache log files or, for example, the binary Tokyo Tyrant Ulog format. The same goes for the output, write a database output class and you can write the results back into a database.
DFS Direct Access
Hive is smart about how to access the raw data. A "select * from table limit 10" for example does a direct read from the file. If the query is too complicated it will fall back to use a full MapReduce run to determine the outcome, just as expected.
With Pig I am not sure if it does the same to speed up simple PigLatin scripts. At least it does not seem to be mentioned anywhere as an important feature.
Join/Order/Sort
Hive and Pig have support for joining, ordering or sorting data dynamically. They perform the same purpose in both pretty allowing you to aggregate and sort the result as is needed. Pig also has a COGROUP feature that allows you to do OUTER JOIN's and so on. I think this is where you spent most of your time with either package - especially when you start out. But from a cursory look it seems both can do pretty much the same.
Shell
Both Hive and Pig have a shell that allows you to query specific things or run the actual queries. Pig also passes on DFS commands such as "cat" to allow you to quickly check what an outcome of a specific PigLatin script was.
Streaming
Once more, both frameworks seem to provide streaming interfaces so that you can process data with external tools or languages, such as Ruby or Python. How the streaming performs I do not know and if they affect them differently. This is for you to tell me :)
Web Interface
Only Hive has a web interface or UI that can be used to visualize the various schemas and issue queries. This is different to the above mentioned Server as it is an interactive web UI for a human operator. The Hive Server is for use from another programming or scripting language for example.
JDBC/ODBC
Another Hive only feature is the availability of a - again limited functionality - JDBC/ODBC driver. It is another way for programmers to use Hive without having to bother with its shell or web interface, or even the Hive Server. Since only a subset of features is available it will require small adjustments on the programmers side of things but otherwise seems like a nice-to-have feature.
Conclusion
Well, it seems to me that both can help you achieve the same goals, while Hive comes more natural to database developers and Pig to "script kiddies" (just kidding). Hive has more features as far as access choices are concerned. They also have reportedly roughly the same amount of committers in each project and are going strong development wise.
This is it from me. Do you have a different opinion or comment on the above then please feel free to reply below. Over and out!
Tuesday, May 26, 2009
HBase Schema Manager
As already mentioned in one of my previous posts, HBase at times makes it difficult to maintain or even create a new table structure. Imagine you have a running cluster and quite an elaborate table setup. Now you want to to create a backup cluster for load balancing and general tasks like reporting etc. How do you get all the values from one system into the other?
While you can use various examples that help you backing up the data and eventually restore it, how do you "clone" the table schemas?
Or imagine you have an existing system like the one we talked about above and you simply want to change a few things around. With an RDBMS you can save the required steps in a DDL statement and execute it on the server - or the backup server etc. But with HBase there is now DDL or even the possibility of executing pre-built scripts against a running cluster.
What I described in my previous post was a why to store the table schemas into an XML configuration file and run that against a cluster. The code handles adding new tables and more importantly the addition, removal and modification of column families for any named table.
I have put this all into a separate Java application that may be useful to you. You can get it from my GitHub repository. It is really simple to use, you create an XML based configuration file, for example:
Then all you have to do is run the application like so:
The "schema.xml" is the above XML configuration saved on your local machine. The output shows the steps performed:
You can also specify more options on the command line:
If you use the "verbose" option you get more details:
Finally you can use the "list" option to check initial connectivity and the successful changes:
A few notes: First and most importantly, if you change a large table, i.e. one with thousands of regions, this process can take quite a long time. This is caused by the
Also, I do not have Bloom Filter settings implemented, as this is still changing from 0.19 to 0.20. Once it has been finalized I will add support for it.
If you do not specify a configuration name then the first one is used. Having more than one configuration allows you to have multiple clusters defined in one schema file and by specifying the name you can execute only a specific one when you need to.
While you can use various examples that help you backing up the data and eventually restore it, how do you "clone" the table schemas?
Or imagine you have an existing system like the one we talked about above and you simply want to change a few things around. With an RDBMS you can save the required steps in a DDL statement and execute it on the server - or the backup server etc. But with HBase there is now DDL or even the possibility of executing pre-built scripts against a running cluster.
What I described in my previous post was a why to store the table schemas into an XML configuration file and run that against a cluster. The code handles adding new tables and more importantly the addition, removal and modification of column families for any named table.
I have put this all into a separate Java application that may be useful to you. You can get it from my GitHub repository. It is really simple to use, you create an XML based configuration file, for example:
<?xml version="1.0" encoding="UTF-8"?>
<configurations>
<configuration>
<name>foo</name>
<description>Configuration for the FooBar HBase cluster.</description>
<hbase_master>foo.bar.com:60000</hbase_master>
<schema>
<table>
<name>test</name>
<description>Test table.</description>
<column_family>
<name>sample</name>
<description>Sample column.</description>
<!-- Default: 3 -->
<max_versions>1</max_versions>
<!-- Default: DEFAULT_COMPRESSION_TYPE -->
<compression_type/>
<!-- Default: false -->
<in_memory/>
<!-- Default: false -->
<block_cache_enabled/>
<!-- Default: -1 (forever) -->
<time_to_live/>
<!-- Default: 2147483647 -->
<max_value_length/>
<!-- Default: DEFAULT_BLOOM_FILTER_DESCRIPTOR -->
<bloom_filter/>
</column_family>
</table>
</schema>
</configuration>
</configurations>
Then all you have to do is run the application like so:
java -jar hbase-manager-1.0.jar schema.xml
The "schema.xml" is the above XML configuration saved on your local machine. The output shows the steps performed:
$ java -jar hbase-manager-1.0.jar schema.xml
creating table test...
table created
done.
You can also specify more options on the command line:
usage: HbaseManager [<options>] <schema-xml-filename> [<config-name>]
-l,--list lists all tables but performs no further action.
-n,--nocreate do not create non-existent tables.
-v,--verbose print verbose output.
If you use the "verbose" option you get more details:
$ java -jar hbase-manager-1.0.jar -v schema.xml
schema filename: schema.xml
configuration used: null
using config number: default
table schemas read from config:
[name -> test
description -> Test table.
columns -> {sample=name -> sample
description -> Sample column.
maxVersions -> 1
compressionType -> NONE
inMemory -> false
blockCacheEnabled -> false
maxValueLength -> 2147483647
timeToLive -> -1
bloomFilter -> false}]
hbase.master -> foo.bar.com:60000
authoritative -> true
name -> test
tableExists -> true
changing table test...
no changes detected!
done.
Finally you can use the "list" option to check initial connectivity and the successful changes:
$ java -jar hbase-manager-1.0.jar -l schema.xml
tables found: 1
test
done.
A few notes: First and most importantly, if you change a large table, i.e. one with thousands of regions, this process can take quite a long time. This is caused by the
enableTable()
call having to scan the complete .META. table to assign the regions to their respective region servers. There is possibly room for improvement in my little application to handle this better - suggestions welcome!Also, I do not have Bloom Filter settings implemented, as this is still changing from 0.19 to 0.20. Once it has been finalized I will add support for it.
If you do not specify a configuration name then the first one is used. Having more than one configuration allows you to have multiple clusters defined in one schema file and by specifying the name you can execute only a specific one when you need to.
Subscribe to:
Posts (Atom)