As the Big Data buzz continues to grow, so does the number of use cases being considered, tested, and implemented. Jason Hiner’s recent article on ZDNET, “Big Data’s biggest problem: It’s too hard toget the data in,” echoes our biggest challenges in successfully delivering a Big Data solution; data governance and data quality.
Jason Hiner defines Big Data as “the marriage of structured data (your company’s proprietary information) with unstructured data (public sources such as social media streams and government feeds).” This junction of sources is the crux of the problem.
Jason’s definition of Big Data correctly implies that the task of master data management is multiplied across multiple sources. While traditional data warehouse projects typically address one domain that is well understood, master data management is seen as a surmountable challenge.
Sullexis is not a stranger to this problem. Our team has worked through a Big Data project that pulled together structured and unstructured data from 80 different sources. The biggest task when working with many points of data is identifying the same data elements among the various sources and ensuring that these ‘keys’ have been created with the same standards and governance. Even where we’ve ingested the ‘same’ semi-structured data set, if data governance is absent, it becomes very difficult to provide insights.
Data cleansing and prepping can be more challenging than pulling the value out of data. In fact, one estimate suggests that 50 to 80 percent of a data scientist’s time is spent preparing data. While grouping and analyzing what value needs to be pulled from structured and unstructured data is a tedious task, most of the time on Big Data projects is spent after the data has been extracted when organization is required.
The ZDNET article suggests that there are three potential solutions to the issue of data cleansing:
- Big data analytics software gets better
- The Data Preparer role becomes a bigger part of Big Data projects
- Artificial Intelligence (AI) will be used to help cleanse data
Given the difficulties Sullexis typically encounters on building Master Equipment Lists (MEL) for Asset intensive Operations, the second solution is the only credible option. Humans are needed to analyze and label with meta-data the unstructured data sources. For example, PDF images that are typically provided on mass for brown field projects or historical data need to be properly identified and categorized. Analytic software will not solve this, and I believe it will be a long time before AI can help either.
In addition, implementing stricter management and governance on operations that create data decreases the time spent on data cleansing and enrichment activities. There may not be an easy cure for disorganized data sets, but standards and organization would help minimize this common Big Data problem.