Data Lakes and Master Data Management
Master Data Management (MDM) is the dental floss of the IT world. It’s the responsible, grown up thing that you know is good for you even if it hurts sometimes. MDM refers to a combination of processes and technologies that work to ensure that your data is accurate and authoritative. Though it’s a big, complicated subject, MDM can be summed up in a simple example: If your customer database has an entry for John Smith and John Smyth, each of which has the email jsmith21@gmail – an MDM solution will spot the duplication and enable the database to create a single, authoritative record (or “entity”) for Mr. Smith.
Lately, I am getting asked, “Is MDM required in data lakes?” It’s a valid question, and the answer isn’t always super clear. You can’t ignore MDM and related metadata requirements in data lakes, but it’s not always mandatory. The difficulty is that MDM can be cumbersome to implement and maintain. It tends to erect (necessary) obstacles to smooth, agile data management. For those who work in big data and exciting, open-ended data lakes, MDM may seem like an optional hassle, something to be avoided at all costs.
Tonight on Weekend Update: Should MDM be applied to cloud-based data lakes?
To help resolve the issue, I am going to address a few underlying questions using the “Point/Counterpoint” approach best exemplified by Jane Curtin and Dan Akroyd on Saturday Night Live’s “Weekend Update.” We won’t, however, be using the phrase, “Jane, you ignorant schema.”
Question 1: Do you actually need MDM for data lakes and “Big Data”
- Point: No, you don’t. MDM is only for transactional data on systems of record. Data lakes may contain all sorts of log data and other unstructured information that is essentially outside the realm of MDM. Who cares if server log data is duplicative? It’s not going to affect the bottom line or SEC reporting requirements.
- Counterpoint: But what about data used to support decision making? What if you want to make a decision about IT security based on duplicate firewall log data in your data lake? You would be basing your decision on flawed information.
- Resolution: The best practice is apply MDM selectively in data lakes. It’s most likely not necessary for every data set. However, there are many situations where, even if the data doesn’t seem like the typical MDM fodder (e.g. ERP, CRM etc.) you need to categorize it and match it against master records.
Question 2: Doesn’t MDM force you into a “schema on write” situation, which slows us down?
- Point: Being able to do “schema on read” (see earlier post) is one of the main benefits of data lakes. Now, you’re telling me that I have to go through the whole schema creation process when I write the data into the lake, just so I can keep up with the MDM policies? We might as well not even bother!
- Counterpoint: You can’t just dump data into the lake without any concern for its relationship with the master data records. You run the risk of working with low quality, duplicative data. You could also possibly misinterpret the business meaning associated with the data.
- Resolution: It depends. Yes, it is highly desirable to do some of your data imports using schema on write, especially if you have users who are not data scientists and need to rely on the schema and meta-data to understand the business meaning behind each data element. This is especially true for data elements which are derived based upon business rules and algorithms. However, there are certain data sets that won’t require a schema. For example, when performing text processing, the English language may be subject to interpretation. So, tying text in a data lake to a schema may not be a good idea.
Question 3: Can you create master data entities from unstructured data like tweets?
- Point: The data is too variable. That’s the whole point of unstructured data. It doesn’t work within columns and rows.
- Counterpoint: Ok, tweets don’t follow the rules of columnar databases, but they can still absolutely be assigned master data entities. For example, the @handle is a unique identifier. MDM in the data lake should append @handle to the metadata of any tweet texts. That will ensure accuracy in analysis.
- Resolution: Certain types of unstructured data may, in fact, be impossible to match and cleanse with MDM processes. So, you can ignore those from an MDM perspective. However, the best practice is to assess the data at the ingestion stage and figure out if the analytics process will benefit from applying MDM.
Question 4: Can MDM systems even work with cloud-based data lakes?
- Point: MDM systems were never meant to work with the kind of free range data we have in our data lakes. MDM is primarily used for data warehouses and data marts. It will be a total nightmare to integrate an MDM system with your big data, if it’s even possible.
- Counterpoint: The truth is, you can use Hadoop for MDM. It takes a fair amount of specialized configuration, but it is definitely possible. For example, you can create an MDM “matching engine” using MapReduce and establish an MDM repository as an HDFS file. Some third party MDM software solutions are also starting to become available from companies such as Informatica, SAP, SAS and others.
- Resolution: If it involves data, there will be a Hadoop solution for it. This applies to MDM. You don’t necessarily have to extend an existing enterprise MDM solution into a cloud-based data lake. However, you should look at the possibility of applying the same MDM standards to big data in the cloud as well as in your own data center.
Question 5: How do you manage MDM processes when the data lake is not inside the enterprise?
- Point: Don’t even bother. A cloud-based data lake, which is outside the regular enterprise data flows – even involving data from outside organizations not under the enterprise’s control?
- Counterpoint: It shouldn’t matter where you are hosting or where the data lake fits into the logical architecture of the enterprise. If you want to adapt your big data to MDM, you can do it. And, in some cases, the data lake may be part of the enterprise flow of data. For example, your data lake may feed into your data warehouse or multiple data marts. In that case, you definitely want to subject it to MDM.
- Resolution: You actually find it easier to take advantage of the cloud architecture to make MDM simpler to deploy and manage. You’re already outside the box, so to speak.