淘宝hbase业务实践

本帖最后由 hyj 于 2014-3-2 00:00 编辑

可以带着下面问题来阅读
1.HTablePool与传统数据块连接池那个对应？
2.通过那个函数可以释放连接池？
3.habse在什么情况下会比较慢？
4.RowKey该如何设计？

准备工作

HBase Toturial，需要对HBase有一定的了解
Mysql 基础，需要对Mysql有一定的了解
Java 基础，需要对Java有一点的了解

为什么需要HBase优势：

再见了，分库分表。再见了，TDDL
更高性能的读和写。

不足：

没有SQL
没有iBtatis、Hibernate这些ORM工具，HBase的ORM目前还不成熟
HBase的RowKey的设计要求比较高
需要自己构建索引表

HBase的使用构建单例的HBaseFactory

构建单例的HBaseFactory，我们只需要关心三个事情

hbase.zookeeper.quorum
zookeeper.znode.parent
HTablePool的maxSize

我们使用的是HTablePool构建一个HBaseFactory对象

*为什么使用HTablePool

HTablePool您可以看成JDBC的连接池，适合多线程使用环境，如果需要把连接“还”给连接池的话，只需要调用HTableInterface.close() 就可以了

HBaseFactory的Interface

public interface HBaseFactory {
    /**
     * 通过 tableName 来获取这个 Table
     */
    HTableInterface getHTable(String tableName);

    /**
     * 关闭某个table
     */
    void closeHTable(HTableInterface hTableInterface);

    /** only for unit test*/
    boolean deleteTable(String tableName);
    /** only for unit test*/
    HTableDescriptor createTable(String tableName, int maxVersion);
}
复制代码

HBaseFactory的Implemention

public class HBaseFactoryImpl implements HBaseFactory {

    static Logger logger = LoggerFactory.getLogger(HBaseFactoryImpl.class);

    private HTablePool hTablePool = null;

    private HBaseAdmin hBaseAdmin = null;

    @Inject
    public HBaseFactoryImpl(String quorum, String parent, int maxSize) {
        checkArgument(isNotBlank(quorum));
        checkArgument(isNotBlank(parent));
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", quorum);
        conf.set("zookeeper.znode.parent", parent);

        conf.set("hbase.client.retries.number", "5");
        conf.set("hbase.client.pause", "200");
        conf.set("ipc.ping.interval", "3000");
        conf.setBoolean("hbase.ipc.client.tcpnodelay", true);

        hTablePool = new HTablePool(conf, maxSize);
        try {
            hBaseAdmin = new HBaseAdmin(conf);
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
            throw new IllegalStateException(e);
        }
    }

    @Override
    public HBaseAdmin getHBaseAdmin() {
        return checkNotNull(hBaseAdmin);
    }

    @Override
    public HTableInterface getHTable(String tableName) {
        checkArgument(isNotBlank(tableName));
        return checkNotNull(hTablePool.getTable(tableName));
    }

    @Override
    public void closeHTable(HTableInterface hTableInterface) {
        Closeables.closeQuietly(hTableInterface);
    }

    @Override
    public boolean deleteTable(String tableName) {
        checkArgument(isNotBlank(tableName));
        try {
            hBaseAdmin.disableTable(tableName);
            hBaseAdmin.deleteTable(tableName);
        } catch (IOException e) {
            logger.error(e.getMessage(), e);
            return false;
        }
        return true;
    }

    @Override
    public HTableDescriptor createTable(String tableName, int maxVersion) {
        return createTable(tableName, "cf", 0, maxVersion, null, null,
                null, 0);
    }

    protected HTableDescriptor createTable(
            String tableName, String columnFamily, int lifetime,
            int maxVersion, StoreFile.BloomType bloomType, String startKey,
            String endKey, int numRegions) {

        try {
            checkArgument(!checkNotNull(hBaseAdmin).tableExists(tableName),
                    "the table [%s] should not exist.", tableName);
        } catch (IOException e) {
            logger.error(e.getMessage(), e);
            throw new IllegalStateException(e);
        }

        HColumnDescriptor cf = getCF(columnFamily, lifetime, maxVersion,
                bloomType);
        HTableDescriptor table = new HTableDescriptor(tableName);
        table.addFamily(cf);
        try {
            if (StringUtils.isNotBlank(startKey)
                    && StringUtils.isNotBlank(endKey) && numRegions > 0)
                hBaseAdmin.createTable(table, Bytes.toBytes(startKey),
                        Bytes.toBytes(endKey), numRegions);
            else
                hBaseAdmin.createTable(table);
        } catch (IOException e) {
            logger.error(e.getMessage(), e);
            throw new IllegalStateException(e);
        }
        return describeTable(tableName);
    }

    private HColumnDescriptor getCF(String columnFamily, int lifetime,
                                    int maxVersion, StoreFile.BloomType bloomType) {
        HColumnDescriptor cf = new HColumnDescriptor(columnFamily);
        cf.setCompactionCompressionType(Compression.Algorithm.LZO);
        cf.setCompressionType(Compression.Algorithm.LZO);
        if (maxVersion > 0)
            cf.setMaxVersions(maxVersion > 1000000 ? 1000000 : maxVersion);
        if (lifetime > 0)
            cf.setTimeToLive(lifetime);
        if (null != bloomType)
            cf.setBloomFilterType(bloomType);
        else
            cf.setBloomFilterType(StoreFile.BloomType.ROW);
        return cf;
    }

    public HTableDescriptor describeTable(String tableName) {
        try {
            return checkNotNull(hBaseAdmin).getTableDescriptor(Bytes.toBytes(tableName));
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
            throw new IllegalStateException(e);
        }
    }

    @PreDestroy
    public void destroy() throws Exception {
        Closeables.closeQuietly(hTablePool);
        Closeables.closeQuietly(hBaseAdmin);
    }
}
复制代码

Usage

HTableInterface hTableInterface = null;
try {
    hTableInterface = hBaseFactory.getHTable("YOUR_TABLE_NAME");
    // code here …
} finally {
    hBaseFactory.closeHTable(hTableInterface);
}
复制代码

Scan

StartRow&Cache如果不设置StartRow，那就会从头开始搜索，这样做的话速度就会很慢
Cache能够保证一次搜索拿到内存的数据，否则您iterator一次就得走一次网络
关于FilterPrefixFilter是最常用的filter，有个非常需要注意的点
如果Rowkey是”123_1_00000“这样的，如果prefix是123_1，切记切记要记得写成123_1_

其次要注意filter不要太多，最好不要超过2个
关于分页在Mysql里面，常常需要用到分页，那么在HBase里面你该如何实现，使用PageFilter配合startRow，但是在Mysql里面常常会有一个总数的概念，切记切记HBase里面不要做类似Count的操作
关于分布式流式处理比方说，现在有10台机器，需要同时处理1000万的数据，那么这个时候，我们就可以用到checkAndPut。就像Mysql里面的一个乐观锁一样。

具体的做法是：

我们通过PageFilter，SingleColumnValueFilter配合startRow获取一部分数据
然后用checkAndPut标记该数据正在处理
最后再用put标记该数据已经处理

HBase实战经验向下兼容

开发过程中，难免需要加字段的，那这个时候，就需要代码、数据能够向下兼容。

比方说我们现在需要新增一个column，因为是新加的一列，原来的数据这列就是null，那么这时候从HBase里面读到的值就是null，所以写HBase代码一定要注意：

从HBase里面的数据一定要check null，如果是null，我就用一个默认值
代表元数据的DO类的默认值，最好不要是null，null永远不要存在在代码中
再一次强调，非常建议数据用String的方式存储，因为可视化的数据能够帮您解决很多问题

// firstNonNull 是 google guava Objects#firstNonNull 的方法，如果方法第一个参数是null，就返回第二个参数
Integer.parseInt(new String(firstNonNull(result.getValue(DEFAULT_COLUMN_FAMILIES, COLUMN), new byte[]{'0'})));
复制代码

RowKey的设计

1.建议使用String如果不是特殊要求，RowKey最好都是String。

方便线上使用Shell查数据、排查错误
更容易让数据均匀分布
不必考虑存储成本

2.RowKey的长度尽量短如果RowKey太长话，第一是，存储开销会增加，影响存储效率；第二是，内存中Rowkey字段过长，内存的利用率会降低，这会降低索引命中率。

一般的做法是：

时间使用Long来表示
尽量使用编码压缩

3.RowKey尽量散列RowKey的设计，最重要的是要保证散列，这样就会保证所有的数据都不都是在一个region上，避免做读写的时候负载将会集中在个别region上面。

假设我们需要存储一个用户的所有微博（暂时不需要考虑时间倒排），这时候的RowKey的设计是UserId_WeiboId ，但是这样设计的话，UserId 的分布就很可能不均匀，因为RowKey是字符串排序的。

有两种办法来解决这个问题

Reverses

UserId

Hash或者Mod

UserId

MD5 后作取前6位为前缀加入到 UserId 前面

4.RowKey排序假设我们有个很多微博用户发微博，但是这个时候，我们要开辟一个“广场”，所有的微博都是按照时间倒排序展示在这个“广场”里。这个时候我们就得为原来的

UserId_WeiboId

建立一张索引表，并且这个表的Rowkey要和时间相关

Rowkey的设计可以使用

当前时间 - 微博发表时间

的 long 值作为 RowKey 的前缀

RowKey散列

如果数据可以定期清理如果数据不是需要一直保存的话，就算所有数据落在一个region，因为按时间搜索会指定startRow，存储时候Rowkey也是连续的，所以速度也非常块，当然数据容量最好和DBA商量一下

如果数据都需要保存把DayOfMonth作为前缀那么RowKey会是
DayOfMonth_(当前时间 - 微博发表时间)

不过这样在代码实现上面的时候会有一些麻烦。

5.关于事务目前HBase的Put，Delete操作都是事务的，但是如果您希望能够对好几个Table发起一连串操作并且希望是事务的话，目前还没有好的办法。所以HBase使用的时候，要有解决数据出错的觉悟。

图文精华

淘宝hbase业务实践

推荐 /2