I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).
I think the interface is not much powerful. Can this interface be used for large analytical data processing?
I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).
I think the interface is not much powerful. Can this interface be used for large analytical data processing?
No It can't. Any WHERE clause ends up as a full SCAN in HBase table and scans are extremely slow. Please check https://phoenix.apache.org/ as an alternative.
Apache Phoenix is more applicable for querying HBase. you can also query HBase using Hive, then your query will get converted in Map Reduce Job which will take more time then Phoenix.
PS : You can use Hive for Big Data analytics even if you are using Hbase.
A good solution to do Analytic queries over HBase faster is combine HBase with Hive and Impala.
As an example of this would be the following scenario:
I have a Kafka producer receiving thousands of signals from IoT devices from a socket in json format. I am processing this signals with a consumer in Spark streaming and putting these signals in a HBase table.
HBase table and data example
$ hbase shell
hbase> create_namespace 'device_iot'
hbase> create 'device_iot:device', 'data'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:deviceID', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23c'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:temperature', '12'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:latitude', '52.14691120000001'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:longitude', '11.658838699999933'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:time', '2019-08-14T23:30:30000'
Hive table on top of HBase table
CREATE EXTERNAL TABLE t_iot_devices (
id string, deviceID string, temperature int, latitude double, longitude double, time string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:deviceID,data:temperature,data:latitude,data:longitude,data:time") 
TBLPROPERTIES("hbase.table.name" = "device_iot:device");
Querying in Impala
impala> invalidate metadata;
SELECT deviceID, max(temperature) AS maxTemperature 
FROM t_iot_devices 
GROUP BY deviceID;
+--------------------------------------+----------------+
| deviceid                             | maxtemperature |
+--------------------------------------+----------------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 39             |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 39             |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 39             |
+--------------------------------------+----------------+
SELECT deviceID, substr(time,1,10) AS day, max(temperature) AS highest 
FROM t_iot_devices 
WHERE  substr(time,1,10) = '2019-07-07' 
GROUP BY deviceID, substr(time,1,10);
+--------------------------------------+------------+---------+
| deviceid                             | day        | highest |
+--------------------------------------+------------+---------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 2019-07-07 | 34      |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 2019-07-07 | 35      |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 2019-07-07 | 22      |
+--------------------------------------+------------+---------+