What are functional reasons why Hadoop cannot be a Data Warehouse
On several sites one can see statements that a Hadoop cluster is not a replacement of a traditional data warehouse. However, I can't find the real reasons why.
I am aware that technically there are some things that are not available/ mature in Hadoop, but I am really looking for the functional impact.
What I found so far, including mitigations
I found some arguments, but none so critical that I would advise against using Hadoop as DWH. Here is a selection:
- You can't do quick ad hoc queries or reporting, as Hadoop tends to incur overhead for map and reduce jobs.
However, in the situation that I am looking at, this should not be a problem as data is only made available via the (regular) datamart. Also, you would be able to use spark sql if you wanted to dig into some tables.
- You can't get certain results, as Hadoop does not support stored procedures.
In the situation that I am looking at there are not many stored procedures (fortunately!) and using tools like R or Python you can really get any result that you need.
- You can't recover from disasters, as Hadoop does not have backups integrated
However, as all code is scripted and data can be offloaded to a backup, it should be possible to recover from disasters.
- You can't do compliance and privacy, as there is no security and data lineage
With a toolkit like Knox + Ranger + Atlas this can be achieved.
- Its not easy to build queries, as you can't build the flow but need to write sql or pig code.
There appear to be several tools like Talend where you can build flows with icons like in typical query builders.
- Hadoop is harder to maintain, as it requires specific knowledge
True, but in the situation that I am looking at there is a fair amount of knowledge as they currently use a Hadoop analytics platform.