• Technologies
    • Black Lotus Labs
    • Cloud
    • Edge Compute
    • Collaboration
    • Managed Services
    • Network
    • Security
  • Business Advice
    • 4th Industrial Revolution
    • Business Continuity & Disaster Recovery
    • Customer Experience
    • Data Driven Business
    • Operational Efficiency
    • Tech Trends
  • Industries
    • Financial Services
    • Healthcare
    • Gaming
    • Manufacturing
    • Media and Entertainment
    • Pharmaceutical
    • Public Sector
    • Retail
    • Technology
  • About Us
    • Leadership Perspectives
    • NewsRoom
    • LATAM Blog
  • Technologies
    • Black Lotus Labs
    • Cloud
    • Edge Compute
    • Collaboration
    • Managed Services
    • Network
    • Security
  • Business Advice
    • 4th Industrial Revolution
    • Business Continuity & Disaster Recovery
    • Customer Experience
    • Data Driven Business
    • Operational Efficiency
    • Tech Trends
  • Industries
    • Financial Services
    • Healthcare
    • Gaming
    • Manufacturing
    • Media and Entertainment
    • Pharmaceutical
    • Public Sector
    • Retail
    • Technology
  • About Us
    • Leadership Perspectives
    • NewsRoom
    • LATAM Blog

Data Lakes and Master Data Management

LumenLumen Posted On April 12, 2016
0


0
Shares
  • Share On Facebook
  • Tweet It

Modern server room interior in datacenter, web network and internet telecommunication technology, big data storage and cloud service concept, 3d render

Master Data Management (MDM) is the dental floss of the IT world. It’s the responsible, grown up thing that you know is good for you even if it hurts sometimes.  MDM refers to a combination of processes and technologies that work to ensure that your data is accurate and authoritative.  Though it’s a big, complicated subject, MDM can be summed up in a simple example: If your customer database has an entry for John Smith and John Smyth, each of which has the email jsmith21@gmail – an MDM solution will spot the duplication and enable the database to create a single, authoritative record (or “entity”) for Mr. Smith.

Lately, I am getting asked, “Is MDM required in data lakes?” It’s a valid question, and the answer isn’t always super clear. You can’t ignore MDM and related metadata requirements in data lakes, but it’s not always mandatory.  The difficulty is that MDM can be cumbersome to implement and maintain. It tends to erect (necessary) obstacles to smooth, agile data management.  For those who work in big data and exciting, open-ended data lakes, MDM may seem like an optional hassle, something to be avoided at all costs.

Tonight on Weekend Update: Should MDM be applied to cloud-based data lakes?Tonight on Weekend Update: Should MDM be applied to cloud-based data lakes?

To help resolve the issue, I am going to address a few underlying questions using the “Point/Counterpoint” approach best exemplified by Jane Curtin and Dan Akroyd on Saturday Night Live’s “Weekend Update.”  We won’t, however, be using the phrase, “Jane, you ignorant schema.”

Question 1: Do you actually need MDM for data lakes and “Big Data”

  • Point: No, you don’t. MDM is only for transactional data on systems of record. Data lakes may contain all sorts of log data and other unstructured information that is essentially outside the realm of MDM. Who cares if server log data is duplicative? It’s not going to affect the bottom line or SEC reporting requirements.
  • Counterpoint: But what about data used to support decision making?  What if you want to make a decision about IT security based on duplicate firewall log data in your data lake?  You would be basing your decision on flawed information.
  • Resolution: The best practice is apply MDM selectively in data lakes. It’s most likely not necessary for every data set. However, there are many situations where, even if the data doesn’t seem like the typical MDM fodder (e.g. ERP, CRM etc.) you need to categorize it and match it against master records.

Question 2: Doesn’t MDM force you into a “schema on write” situation, which slows us down?

  • Point: Being able to do “schema on read” (see earlier post) is one of the main benefits of data lakes.  Now, you’re telling me that I have to go through the whole schema creation process when I write the data into the lake, just so I can keep up with the MDM policies?  We might as well not even bother!
  • Counterpoint: You can’t just dump data into the lake without any concern for its relationship with the master data records.  You run the risk of working with low quality, duplicative data. You could also possibly misinterpret the business meaning associated with the data.
  • Resolution: It depends.  Yes, it is highly desirable to do some of your data imports using schema on write, especially if you have users who are not data scientists and need to rely on the schema and meta-data to understand the business meaning behind each data element. This is especially true for data elements which are derived based upon business rules and algorithms. However, there are certain data sets that won’t require a schema. For example, when performing text processing, the English language may be subject to interpretation. So, tying text in a data lake to a schema may not be a good idea.

Question 3: Can you create master data entities from unstructured data like tweets?

  • Point: The data is too variable. That’s the whole point of unstructured data. It doesn’t work within columns and rows.
  • Counterpoint: Ok, tweets don’t follow the rules of columnar databases, but they can still absolutely be assigned master data entities. For example, the @handle is a unique identifier.  MDM in the data lake should append @handle to the metadata of any tweet texts. That will ensure accuracy in analysis.
  • Resolution: Certain types of unstructured data may, in fact, be impossible to match and cleanse with MDM processes.  So, you can ignore those from an MDM perspective.  However, the best practice is to assess the data at the ingestion stage and figure out if the analytics process will benefit from applying MDM.

Question 4: Can MDM systems even work with cloud-based data lakes?

  • Point: MDM systems were never meant to work with the kind of free range data we have in our data lakes.  MDM is primarily used for data warehouses and data marts. It  will be a total nightmare to integrate an MDM system with your big data, if it’s even possible.
  • Counterpoint: The truth is, you can use Hadoop for MDM. It takes a fair amount of specialized configuration, but it is definitely possible. For example, you can create an MDM “matching engine” using MapReduce and establish an MDM repository as an HDFS file. Some third party MDM software solutions are also starting to become available from companies such as Informatica, SAP, SAS and others.
  • Resolution: If it involves data, there will be a Hadoop solution for it. This applies to MDM.  You don’t necessarily have to extend an existing enterprise MDM solution into a cloud-based data lake. However, you should look at the possibility of applying the same MDM standards to big data in the cloud as well as in your own data center.

Question 5: How do you manage MDM processes when the data lake is not inside the enterprise?

  • Point: Don’t even bother. A cloud-based data lake, which is outside the regular enterprise data flows – even involving data from outside organizations not under the enterprise’s control?
  • Counterpoint: It shouldn’t matter where you are hosting or where the data lake fits into the logical architecture of the enterprise.  If you want to adapt your big data to MDM, you can do it.  And, in some cases, the data lake may be part of the enterprise flow of data. For example, your data lake may feed into your data warehouse or multiple data marts. In that case, you definitely want to subject it to MDM.
  • Resolution: You actually find it easier to take advantage of the cloud architecture to make MDM simpler to deploy and manage. You’re already outside the box, so to speak.

Related posts:

  1. Data Lakes: Hadoop Vs. In-Memory Databases
  2. Hit the Big Data and Analytics Easy Button with BDaaS
  3. 6 Benefits Of Being A Data-Driven Business
  4. These Are the 3 Biggest Data Acquisition Challenges for the Digital Business
0
Shares
  • Share On Facebook
  • Tweet It


AdviceBig DataLumen


Lumen
Author

Lumen

Lumen is guided by our belief that humanity is at its best when technology advances the way we live and work. With 450,000 route fiber miles serving customers in more than 60 countries, we deliver the fastest, most secure global platform for applications and data to help businesses, government and communities deliver amazing experiences.

Trending Now
Banking and Financial Services: Trends to Look Out For in 2021
Emily Lyons February 24, 2021
It’s Time for the Next Generation of 9-1-1
Michael Zody February 18, 2021
You may also like
How to Give Enterprises Exactly What They Want When It Comes to Data
January 28, 2021
How to Relieve the Management Burden of Multi-Cloud
December 3, 2020
Enabling Industry 4.0 and Operational Resiliency through Data-Driven Decision Making
October 26, 2020
Attack of Things!
Read Next

Attack of Things!

  • Categories

    Adaptive Networking

    Connected Security

    Hybrid Cloud

    Communications and Collaboration

    Edge Computing


  • Lumen is guided by our belief that humanity is at its best when technology advances the way we live and work. With 450,000 route fiber miles serving customers in more than 60 countries, we deliver the fastest, most secure platform for applications and data to help businesses, government and communities deliver amazing experiences.

Services not available everywhere. ©2021 Lumen Technologies. All Rights Reserved.
Press enter/return to begin your search