4

What are functional reasons why Hadoop cannot be a Data Warehouse

On several sites one can see statements that a Hadoop cluster is not a replacement of a traditional data warehouse. However, I can't find the real reasons why.

I am aware that technically there are some things that are not available/ mature in Hadoop, but I am really looking for the functional impact.


What I found so far, including mitigations

I found some arguments, but none so critical that I would advise against using Hadoop as DWH. Here is a selection:

  1. You can't do quick ad hoc queries or reporting, as Hadoop tends to incur overhead for map and reduce jobs.

However, in the situation that I am looking at, this should not be a problem as data is only made available via the (regular) datamart. Also, you would be able to use spark sql if you wanted to dig into some tables.

  1. You can't get certain results, as Hadoop does not support stored procedures.

In the situation that I am looking at there are not many stored procedures (fortunately!) and using tools like R or Python you can really get any result that you need.

  1. You can't recover from disasters, as Hadoop does not have backups integrated

However, as all code is scripted and data can be offloaded to a backup, it should be possible to recover from disasters.

  1. You can't do compliance and privacy, as there is no security and data lineage

With a toolkit like Knox + Ranger + Atlas this can be achieved.

  1. Its not easy to build queries, as you can't build the flow but need to write sql or pig code.

There appear to be several tools like Talend where you can build flows with icons like in typical query builders.

  1. Hadoop is harder to maintain, as it requires specific knowledge

True, but in the situation that I am looking at there is a fair amount of knowledge as they currently use a Hadoop analytics platform.

Dennis Jaheruddin
  • 496
  • 1
  • 5
  • 24

4 Answers4

3

It's true, with Hadoop and some tricks you can do the same thing that a DWH is able to do.

However it doesn't make sense to re-invent the wheel to have Hadoop doing the same things of a data warehouse in an inefficient way. Many can say Hadoop is cheaper than a Data Warehouse in terms of hardware and software: it's true, there is a big difference, but we have to consider the time spent to implement a system like that, the know-how and the skills required, the maintenance of the cluster, the upgrades of the services and the risk of using immature tools or tools that in the future could be abandoned.

The real aspect to take to select between Hadoop and a Data Warehouse are:

  • Type of workloads (read vs write, tactical vs report, etc.)
  • Type of data (structured or unstructured)
  • Data Integration (schema-on-read vs schema-on-write)
  • Query SLAs (execution time, concurrency, etc.)
  • Skills required (amount of resources and know-how required for the implementation)
  • SQL compliance (integration with tools)
  • Optimization (workload management, indexes, hash maps, etc.)
  • Maturity (security,bug,etc.)
  • Type on analysis (SQL or non SQL analysis)

An hybrid architecture made with both best fits many use cases. I can save resources (CPU,storage) from the data warehouse offloading historical data and the ETL processing on Hadoop, I can do analysis on unstructured data and at the same time I can have higher performance, data integration and high concurrency querying the "hot" data stored in the Data Warehouse.

Answer to the comment:

It depends by what you want to do with Hadoop, you can fill the data warehouse directly of putting raw data on hadoop and do the ETL on it ot charge the warehouse.

There are a lot of use cases related to the integration of Hadoop with a data warehouse, for example:

  • Data Lake: all raw data stored on Hadoop. This can give you a place where you can capture, refine and explore original raw data and metadata and maybe do aggregations or ETL to fill a data model in the data warehouse.
  • Historicization: you can develop scripts to offload cold data to Hadoop (e.g. last year transactions on DWH and older transactions on Hadoop). You can access both data through a query federator (e.g. Presto) who can give you the possibility to join data which reside on different platforms (i.e. to do the UNION ALL between the historical part of a table on Hadoop and the recent part on the data warehouse)

If you want to use Hadoop as a data lake the data flow is: source -> HDFS (cleansing) -> data warehouse

If you use Hadoop only for historicization: source -> data warehouse -> HDFS

Query federators like Presto opens a lot of use cases and the possibility to use data from different systems in the same query. This unlock the chance to have cold data on Hadoop and hot data on the data warehouse OR the possibility to have the "core" data on the data warehouse and the rest on Hadoop.

3

A Hadoop cluster is by no means a replacement for a traditional data warehouse. Bare Hadoop only does two things :

  1. Distributed storage and resources
  2. MapReduce

On top of Hadoop is built an entire ecosystem of software packages, most notably Pig, Hive, HBase, Phoenix, Spark, ZooKeeper, Cloudera Impala, Flume, Sqoop, Oozie, Storm.

Today, you get to choose what you want from a plethora of products.

Want to use SQL ? Have a look at these data virtualization servers : Cirro Data Hub, Cisco/Composite Information Server, Denodo Platform, Informatica Data Services, Red Hat JBoss Data Virtualization and Stone Bond Enterprise Enabler Virtuoso.

Want the product to store data in its own native SQL database or in Hadoop? Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft PolyBase, Actian ParAccel and Teradata Aster Database (via SQL-H).

Add to these :

  • Apache Hive - the original SQL-on-Hadoop
  • Stinger of Hortonworks
  • Apache Drill - open implementation of Google's Dremel (aka BigQuery)
  • Spark SQL - real-time, in-memory, parallelized processing
  • Apache Phoenix - the "SQL skin for HBase"
  • Cloudera Impala - another implementation of Dremel/Apache Drill
  • HAWQ for Pivotal HD - parallel SQL processing and high compliance with SQL standards on the Pivotal own Hadoop distribution
  • Presto - Built by Facebook's engineers and used internally
  • Oracle Big Data SQL - only integrates with Oracle Database 12c
  • IBM BigSQL - tied to IBM's Hadoop and InfoSphere BigInsights

Conclusion : Whatever your database warehouse requirements are, you can find some product on Hadoop, or a combination of products, that does what you want.

The downside : Finding your ideal product(s), learning how to drive them and what are their shortcomings, develop your distributed database application, report bugs and push for improvements - all this will take a horrible amount of time on your part. You are looking for the functional impact - so look for the impact upon you and your time, especially if you do not have a Hadoop specialist in your team.

Final conclusion : Hadoop is not a Data Warehouse, but the applications built on it are, and every possible flavor is catered for. But good luck for navigating in that jungle. If your needs are modest enough, I would suggest creating your own application built upon MapReduce, or going for a more classical solution using the tools that you know. Know also that MapReduce is not a good match for all problems.

Some more reading :

harrymc
  • 498,455
1

Hadoop is one of several options for the situations you listed. It sounds like you are looking for a single system/federator/datapipe from which you can ad hoc query multiple data sources. Other options for the Hadoop functions are Spark, Pentaho, Apache Pig and Hortonworks.

But instead of looking at this tool first, look at your data and analysis needs.

  1. You have multiple data sources
  2. You want to run ad-hoc queries
  3. You need to manage those multiple data sources in terms of being accessible and "queryable" to your analysts/end users. And you (thinking in IT terms here) need to be able to do this management without it becoming a second job.
  4. I'm assuming you will add more data sources as time goes on.
  5. I'm assuming your data sources will grow and the potential exists for queries on larger data sets. 6, You want disaster recovery and security/compliance.
  6. You'd like the option to use a variety of query methods including stored procedures.

Looking at that first, determine what tools meet those needs. There are IPaaS (Integration Platform as a Service -- essentially data integration in the cloud) vendors such as Mulesoft and SnapLogic. You have Hadoop and it's cousins, I say cousins because in that space the products tend to have enough differences that I couldn't lump them together like SQL databases. You have data lakes, which use raw data and thus ease the need for heavy transformation work. And you have data stream processing, which handles multiple data streams and filters data rather than tossing it.

Look at your business needs (including budget and resources), compare that to what's available and then determine the best tool for your company.

-1

Hadoop is a framework and data warehouse is a software... Confused? Data warehouse will just coordinate between data and you. It will simply deal with storing and maintaining lifecycle of data. Where as Hadoop, in addition to coordination between data and you it performs simple/complex operations on data if you ask it to do.

The reason hadoop cant be a better fit for datawarehousing is there are several other tools to accomplish the same task efficient than hadoop.