HBase
HBase is a columnar NoSQL store within the Hadoop ecosystem, designed for billions of rows and millions of columns. Runs on top of HDFS. No SQL - uses API methods (GET, PUT, SCAN, DELETE).
Data Model
Table
|-- Row Key (primary access path, binary sorted)
|-- Column Family 1
| |-- column1: value (timestamp v3)
| |-- column1: value (timestamp v2) <- versioning
| |-- column2: value
|-- Column Family 2
|-- columnA: value
Key Concepts
| Concept | Details |
| Row Key | Unique ID, ALL access by row key or key range |
| Column Family | Defined at creation. Groups columns with shared properties (compression, TTL, versions) |
| Versioning | Built-in. Multiple versions per cell with timestamps. Configurable per family |
| Binary storage | All values stored in binary. App handles serialization |
| Free schema | Columns added dynamically within existing column families |
HBase vs RDBMS
| Feature | HBase | RDBMS |
| Schema | Flexible | Fixed |
| Scale | Billions of rows | Limited (unless MPP) |
| Access pattern | Key/key-range lookups | Arbitrary SQL |
| Joins | Not natively | Full support |
| Transactions | Row-level atomic | ACID across tables |
| Query language | API calls | SQL |
Properties
- Column families stored separately on HDFS
- Efficient for key-based and key-range lookups
- TTL for automatic data expiration
- MapReduce integration (input/output)
- Versions stored in descending order (newest first)
Gotchas
- No SQL - all access through GET, PUT, SCAN, DELETE API
- Row key design is critical - poor key design leads to hotspots
- Column families must be defined at table creation
- Not suitable for arbitrary analytical queries (use Hive/Spark on top)
See Also