In Big Data, Preparing the Data is Most of the Work

A common misconception about Big Data is that it is a black box: you load data and magically gain insight. This is not the case. As this New York Times article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights” describes, loading a big data platform with quality data with enough structure to deliver value is a lot of work. Data scientist spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the Times article estimates 50%-80% of a data scientists’ time is spent on data preparation. We agree.

Data Selection

Before you start your project, define what data you need. This seems obvious, but in the world of big data, we hear a lot of people say, “just throw it all in”. If you ingest low quality data that is not salient to your business objectives, it will add noise to your results.

The noisier the data, the more difficult it will be to see the important trends. You must have a defined strategy for the data sources you need and the particular subset of that data, which is relevant for the questions you want to ask.

Define the Relationships

In most corporate big data projects, the business challenges demand a data store comprised of a combination of structured, semi-structured and unstructured data. You often need to organize a set of unstructured/semi-structured documents from SharePoint or a shared drive against master data contained in a set of structured systems. When importing structured data from multiple systems, the data relationships must be defined. Your big data platform will not magically know that “customer no” in one set of data is the same as “cust_id” in another. You must define the relationships between the data sources.

This is a common challenge that many organizations are facing. As such, there are some interesting products coming to market to assist data scientists in identifying possible common data elements in large data sets as described here.

Extract and Organize

This is where you will spend the most time. This is where we have spent the most time working for our clients. Acquiring the data can be a major challenge. If it is public data, is there an API or do we have to scrape it from the web? If corporate data, who can provide extracts and documentation on the data structure? What are the security considerations? The organization of the data includes many steps: translating system specific codes into meaningful/usable data, mapping common fields consistently to be able to relate them, handling incomplete or erroneous data, replicating application logic to make the data self-describing. The list seems endless. You have to spend a lot of time inspecting the data, querying the data, and processing it.

A further difficulty we have experienced during this very long phase is that you have nothing to show your stakeholders. They expect slick demos with glossy visualizations, and they expect them quickly. And you are stuck in the data.

Load the Data

You’ve done it. Finally, the data is ready to load to your big data platform, and the exciting work of analytics and visualization can begin. With clean, organized, structured data, the analytics and visualization phase will progress quickly and will deliver real value.


Preparing data for ingest to a big data platform is a lot of work. There are no shortcuts. If you want to achieve valuable insights via analytics and visualizations, you’ve got to invest the time to build a high quality data store. Set expectations carefully with you stakeholders to prepare them for the investment in data preparation. You will be glad you did.