Before we embark onto the
Scope
The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. The paper was published 2006 while the HBase sub-project of Hadoop was established only around the end of that same year to early 2007. Back then the current version of Hadoop was 0.15.0. Given we are now about 2 years in, with Hadoop 0.20.1 and HBase 0.20.2 available, you can hopefully understand that indeed much has happened since. Please also note that I am comparing a 14 page high level technical paper with an open-source project that can be examined freely from top to bottom. It usually means that there is more to tell about how HBase does things because the information is available.Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.
Terminology
There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or "tablet server" respectively. Apart from that most differences are minor or caused by usage of related technologies since Google's code is obviously closed-source and therefore only mirrored by open-source projects. The open-source projects are free to use other terms and most importantly names for the projects themselves.Features
The following table lists various "features" of BigTable and compares them with what HBase has to offer. Some are actual implementation details, some are configurable option and so on. This may be confusing but it would be difficult to sort them into categories and not ending up with one entry only in each of them.Feature | Google BigTable | Apache HBase | Notes |
Atomic Read/Write/Modify | Yes, per row | Yes, per row | Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the user to lock more than one row at a time. |
Lexicographic Row Order | Yes | Yes | All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders. |
Block Support | Yes | Yes | Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K. |
Block Compression | Yes, per column family | Yes, per column family | Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well. |
Number of Column Families | Hundreds at Most | Less than 100 | While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly. |
Column Family Name Format | Printable | Printable | The main reason for HBase here is that column family names are used as directories in the file system. |
Qualifier Format | Arbitrary | Arbitrary | Any arbitrary byte[] array can be used. |
Key/Value Format | Arbitrary | Arbitrary | Like above, any arbitrary byte[] array can be used. |
Access Control | Yes | No | BigTable enforces access control on a column family level. HBase does not have yet have that feature (see HBASE-1697). |
Cell Versions | Yes | Yes | Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level. |
Custom Timestamps | Yes (micro) | Yes (milli) | With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after already having stored newer ones beforehand. |
Data Time-To-Live | Yes | Yes | Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time. |
Batch Writes | Yes | Yes | Both systems allow to batch table operations. |
Value based Counters | Yes | Yes | BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented. |
Row Filters | Yes | Yes | Again both system allow to apply filters when scanning rows. |
Client Script Execution | Yes | No | BigTable uses Sawzall to enable users to process the stored data. |
MapReduce Support | Yes | Yes | Both systems have convenience classes that allow scanning a table in MapReduce jobs. |
Storage Systems | GFS | HDFS, S3, S3N, EBS | While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it. |
File Format | SSTable | HFile | |
Block Index | At end of file | At end of file | Both storage file formats have a similar block oriented structure with the block index stored at the end of the file. |
Memory Mapping | Yes | No | BigTable can memory map storage files directly into memory. |
Lock Service | Chubby | ZooKeeper | There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different semantics. |
Single Master | Yes | No | HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node. |
Tablet/Region Count | 10-1000 | 10-1000 | Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount of load on each server. |
Tablet/Region Size | 100-200MB | 256MB | The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value. |
Root Location | 1st META / Chubby | -ROOT- / ZooKeeper | HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master. |
Client Region Cache | Yes | Yes | The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively |
Meta Prefetch | Yes | No (?) | A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups. |
Historian | Yes | Yes | The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table. |
Locality Groups | Yes | No | It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately. |
In-Memory Column Families | Yes | Yes | These are for relatively small tables that need very fast access times. |
KeyValue (Cell) Cache | Yes | No | This is a cache that servers hot cells. |
Block Cache | Yes | Yes | Blocks read from the storage files are cached internally in configurable caches. |
Bloom Filters | Yes | Yes | These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not. |
Write-Ahead Log (WAL) | Yes | Yes | Each region server in either system stores one modification log for all regions it hosts. |
Secondary Log | Yes | No | In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization. |
Skip Write-Ahead Log | ? | Yes | For bulk imports the client in HBase can opt to skip writing into the WAL. |
Fast Table/Region Split | Yes | Yes | Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store. |
New Features
As mentioned above, a few years have passed since the original OSDI'06 BigTable paper. Jeff Dean - a fellow at Google - has mentioned a few new BigTable features during speeches and presentations he gave recently. We will have a look at some of them here.Feature | Google BigTable | Apache HBase | Notes |
Client Isolation | Yes | No | BigTable is internally used to server many separate clients and can therefore keep the data between isolated. |
Coprocessors | Yes | No | BigTable can host code that resides with the regions and splits with them as well. See HBASE-2000 for progress on this feature within HBase. |
Corruption Safety | Yes | No | This is an interesting topic. BigTable uses CRC checksums to verify if data has been written safely. While HBase does not have this, the question is if that is build into Hadoop's HDFS? |
Replication | Yes | No | HBase is working on the same topic in HBASE-1295 |
Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.
Variations and Differences
Some of the above features need a bit more looking into as they are difficult to be narrowed down to simple "Yay or Nay" questions. I am addressing them below separately.Lock Service
This is from the BigTable paper:
Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.
There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s).
Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from and what effect it may have.
Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.
The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.
As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.
The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.
HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.
Master Operation
To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server's file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.
This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.
Master Startup
The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.
Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper is (yet) only used to publish the server hosting the -ROOT- region.
Tablet/Region Splits
In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.
The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.
Compactions
The following are more terminology differences than anything else.
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.
HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.
... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
This again refers to what is called "minor compaction" in HBase.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.
Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.
Immutable Files
Knowing that files are fixed once written BigTable makes the following assumption:
The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.
I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!