Data Lakes: Hadoop Vs. In-Memory Databases
I am getting asked with increasing frequency, “What do you think is better for a data lake: Hadoop or an in-memory database?” Given the amount of FUD and novelty in data analytics, it’s a simple question but one with some potentially puzzling answers. It’s like asking, “How should I get from here to Chicago?” and being told you can drive or have your car lifted in the air by a helicopter and electro-magnet a la James Bond in You Only Live Twice.”
Figure 1 – Sean Connery is about to take a car trip the hard way – in “You Only Live Twice” – a scene that brings to mind the comparison between Hadoop and in-memory databases.
What’s the best way to get to your analytics destination? The term “best”, may mean different things to different people. For example, is “best” the fastest? Is “best” the most efficient”? Is “best the most cost effective? Is “best” the most novel and creative? Using these different perspectives, the best path may not be the most obvious, especially if you’re trying to make sense of the current deluge of acronyms, frameworks and marketectures. Hadoop and in-memory databases are different technologies, but they overlap. They’re not the same but they are compatible. It doesn’t have to be an either/or conversation.
Let’s briefly compare Hadoop and in-memory databases. Hadoop is an open source framework for big data analytics. It uses distributed/grid computing to enable applications to analyze large data sets. Hadoop, which has essentially become synonymous with the idea of Big Data, allows an organization to ingest large, highly diverse sources of data and analyze them in ways that are faster and more efficient than is possible with traditional, relational database systems. It’s designed for “petabyte scale” analysis of structured, semi-structured and unstructured data. It achieves this scale by breaking large workloads into smaller pieces and then distributing them across a cluster of commodity x86 hardware.
Hadoop does not do the analytics by itself. Hadoop is a framework which supports the Hadoop Distributed File System (HDFS) and MapReduce. Most importantly, this framework supports a wide variety of tools (projects) which enhance Hadoop’s massively parallel capabilities. Software such as Flume and Sqoop may be used to load data. HBase and Hive may be used for SQL queries. Kafka, Spark or Flink are used ingest data or perform streaming analytics. A host of other tools may be employed to manage, maintain and secure the Hadoop cluster.
While tools such as Spark are great at in-memory analytics using streaming data via mini-batches, Spark does not support a database. Therefore, tools like HBase or Cloudera Impala (with HDFS in-memory caching) may be used on the Hadoop cluster. Other tools like SAP HANA, SSAS, MemSQL or VoltDB may be used as in in-memory database to store data for historic analysis.
An in-memory database is a database is designed to run completely in random access memory (RAM). SAP HANA is an example of an industry leading enterprise ready in-memory database. With advances in memory technology and a drop in memory costs, it is now possible to have data sets held in RAM that would have been hard to imagine a few years ago. An in-memory database can easily hold multiple terabytes worth of information in active memory. The advantage of the in-memory approach is speed. Unlike databases historically, which had to pull data off a disk and process it, the in-memory database can access data at many times the speed that is possible with a spinning disk.
So far, so good. Hadoop and in-memory databases are different types of technology. One is a software framework. The other is a database designed for specific kinds of hardware. Why do we even have a “which is better?” debate? This is where the industry needs to shoulder some responsibility. There are many different applications that can run on Hadoop and keep data in-memory. As vendors and open source communities scramble to achieve dominance in the emerging field of big data, jargon and opinions abound.
It is worth pointing out that you can actually have Hadoop and in-memory databases at the same time. An in-memory database can be part of an extended Hadoop ecosystem. You can even run Hadoop in-memory. Each has its place. When is it best to use one, the other, or both? Well… it depends.
The answer revolves around speed, space and cost. In-memory databases are blazingly fast, but they are limited in what they can store. While in memory databases in the 1TB range are common, multi-core, systems can scale up to 3TB and multi-node HANA systems can scale to 100 nodes, 4000 computing cores and 100 TB in DRAM.
When sizing in-memory databases, it should also be noted that your raw data is compressed significantly and, in the real world, some types of data are more compressible than others. So, while compression ratios can vary widely based upon data types, cardinality and distribution of data, the most common compression rations which I have seen fall in the 5x to 10x compression range.
Conversely, Hadoop can handle petabytes of information. We aren’t getting petabytes into solid state memory this year. If you’re handling a really big data set, you won’t be able to, practically, do the whole thing in-memory. But, you can put part of it in-memory. There are several Hadoop architectures hybrid in nature which contain both disk and in-memory elements where rapid processing is needed.
Cost will be your other factor in making the decision. Solid state memory is more expensive than equivalent spinning hard disk drives (HDDs). Running an in-memory database will cost more, on a byte by byte basis, then using the commodity disk drives and servers that Hadoop is famous for running. You have to figure out your business case. If you don’t need the speed, you may not want to invest in the in-memory option. If you need the speed, which is often the case with real time decision making based on big data, for instance, the in-memory database is probably your best choice. But, as time goes on, it appears that a hybrid approach of using streaming data, in-memory database with a Hadoop cluster may provide excellent performance at a reasonable cost. As you may expect, the final answer is “it depends” and each organization’s requirements will vary.
If you are interested in learning more about Hadoop vs. In-Memory Databases, contact us for a 1:1 consultation or schedule your complimentary Workshop with our big data experts.
Learn more about how CenturyLink Big Data as a Service can help manage your organization’s data.