Our Managing Director of Data Engineering recently helped craft a call to action for oil and gas companies to address an overlooked roadblock in the journey to reduce carbon emissions in the industry. Read his posting here
Now more than ever companies are relying on their big data and analytics to support innovation and digital transformation strategies. To meet this challenge, big data teams scale out on-prem Hadoop resources to create a unified data fabric, connecting all aspects of the business together. However, many Hadoop users struggle with complexity, imbalanced infrastructure, excessive maintenance overhead, and overall, unrealized value.
Elephants, Bees, and Whales, Oh My!
Let’s start with the basics: what is Hadoop? Hadoop provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Basically, it’s a set of programs enabling multiple servers to coordinate and work together to solve a single problem. Out of the box, Hadoop is massively complex and reliant on heavy user configuration, requiring users to have a deep understanding of not only Hadoop but the underlying OS as well. Hadoop is made even more complex by the ecosystem tools. If you google Hadoop Ecosystem logos, it looks like someone dropped a zoo on top of a Home Depot. Confusing logos aside, Hadoop started out as a cheap data store and efficient batch ETL processing platform, but never managed to grow into a full-fledged data analytics platform. Many companies have tried to extract analytics value from Hadoop and failed because Hadoop was never really intended to be an analytics platform.
The Dark Side of Hadoop
Hadoop is a highly efficient storage platform, and great for running batch ETL processes, but so many companies saw Hadoop as the solution to the real problem big data brought with it: time-to-insight. Traditional ETL and RDBMS tools were not able to keep up with the terabytes of data being generated. Hadoop promised if all that data was centrally located in a distributed environment, all our analytics problems would be solved. Companies began pouring entire source systems and transaction logs into Hadoop, but quickly realized it wasn’t so simple. Building efficient data stores for querying was complex and end-users wishing to use that data needed to be intimately familiar with the underlying complexity in order to take advantage of the cluster’s computing power. Worse, all of that complexity demanded constant maintenance – nodes going down, services needing restart… Nothing seemed automated without a lot of custom scripting and programming.
Companies purchased Hadoop to match the growing needs of the business for ad-hoc queries against larger data sets. They wanted to tear down data silos but instead built complicated data swamps. The Hadoop ecosystem tools never evolved to be enterprise-friendly for the average business user to use and required highly skilled data engineers to performance tune every business query.
Enter the Cloud
I am not saying cloud providers have the panacea to all our problems, that would put me in the same group of people who praised Hadoop as the 2nd Coming. What I will say is public cloud providers have created an attractive alternative for customers still laboring over Hadoop on-prem.
Know Before You Buy
To be fair, cloud providers developed many of their big data-friendly services by listening to the complaints of Hadoop customers, so it makes sense why it’s such an attractive alternative to running Hadoop on-prem. Big data solutions needed to be scalable, provide massive amounts of computing power, and provide a convenient entry point for users of all data literacy to glean insight. To be fair, big data problems are still present:
- Data models still need to be optimized for a purpose
- Developers should be aware of operational limitations
- Deployed use cases need to be monitored and tuned over time
The difference is cloud services abstract the complex parts of those tasks away from the customer. No more cluster maintenance means little to no downtime. Query tools require more configuration than customization, saving tons of human capital down the road when something breaks. Monitoring and logging are not low-level Linux services writing text to a document; instead, web-based rules-driven dashboards help DevOps keep an eye on usage and error detection.
Don’t Go At It Alone
Like all technology stacks, cloud services don’t come without their negatives. There’s a learning curve for the cloud. While some of your Hadoop services will translate over and play well with a SaaS architecture, some will not. Understanding your cluster service layout, dependencies, and currently deployed use cases is imperative if you plan to make the shift to the cloud. Also, moving to the cloud means re-architecting at least some percentage of your on-prem solution. No matter what anyone tells you, lift and shift is only appealing upfront; a lift and shift implementation lacks the foresight needed to avoid overpaying for cloud resources, and possibly recreating the same problems that drove the migration. If you bring the same processes that weren’t solving the problem into a new environment, you’ve only compounded your problems.
Migrations might sound scary and daunting, but staying with a solution that isn’t working is just as dangerous. If you are considering making the switch to the cloud, Sullexis is here to help. Our Data Engineering Practice has years of experience working with Hadoop and helping clients re-invent their data lakes in the cloud.
Our client was looking for cost savings opportunities by consolidating individual software systems into the Microsoft suite of products. As part of that process, they made a decision to move from SAP Crystal Reports to Microsoft SQL Server Report Builder (SSRS). The client identified 395 business reports that needed to be updated, tested and deployed. These reports are critical to daily operations covering areas such as agreements, portfolio management, regulatory and settlements.
Challenges associated with the conversion included:
- The reports had been developed over a long period of time, by different developers and included a substantial number of customizations with limited commonalities.
- Crystal allows developers a great deal of flexibility for report layout and configuration, creating challenges when mapping into SSRS.
- Some of the reports were not working correctly in production.
The Sullexis team worked closely with the client to organize the reports into packages that tied to the responsible business team, complexity, and likely rework. As reports were converted into SSRS, developers identified and recommended changes to make the existing reports easier to use. The team coordinated closely with the client when reports needed to be redesigned to fix existing issues, improve readability, component reuse, and maintainability.
Sullexis was able to deploy the reports for use on schedule and on budget. During that project, the client added additional reports to the scope as a result of the team’s performance. This brought the final report count to 450. The business was able to seamlessly transition from the old reports to the new without impact on their daily operation. New reports from additional data sources are currently being identified to add to the SSRS report repository as part of their larger business strategy. Further benefits:
- The migration to SSRS provides improved scalability and maintenance by leveraging the client’s existing Microsoft platform.
- The client immediately saw quicker response times as loads were now managed across multiple SSRS servers.
- The client reduced its licensing and maintenance costs by retiring the old reporting tool.
- New enhancements and support for the reports has been dramatically improved.
Jeff Diaz has joined the firm as the Managing Director of Sullexis’ Data Engineering practice. For nearly a decade, Jeff has architected innovative solutions across a broad range of technology verticals. From business intelligence to machine learning-backed analytics to data storage and recovery, Jeff’s expertise has delivered high ROI use cases for clients.
Prior to joining Sullexis, Jeff served in various leadership roles at PREDICTif Solutions, where he championed and ran the company’s first cloud practice, later serving in a COO capacity. Jeff is AWS certified and an avid practitioner of machine learning. He advocates strong data governance combined with machine learning methodologies to provide a modernized, systematic approach to solving today’s energy needs. Jeff has deep subject matter expertise in multiple industries, including Upstream and Midstream Oil and Gas, Restaurant Services, Manufacturing, Supply Chain, Technology, and Utilities (electrical and gas).
Sullexis is excited to announce the launch of Linq Analytics, LLC.
Based in Austin, Texas, Linq Analytics helps B2B organizations with multiple CRM and ERP systems create and maintain a unified view of their customer account hierarchy so they can:
- Roll-up and analyze customer account information by geography and business unit
- Analyze customer account white space and penetration
- Review sales rep coverage, overlap, and potential conflict
- Assess potential targeted marketing and incentive program impact
Sullexis will continue to work closely with Linq to provide data consulting and data management services to Linq’s clients.
To learn more about Linq Analytics, visit www.linqanalytics.com.
About Sullexis, Sullexis specializes in helping its clients create, manage, and enhance data to accelerate and improve decision making. Founded in 2006, Sullexis is headquartered in Houston, TX. To learn more about Sullexis, visit www.sullexis.com.
Sullexis kicked off a major new initiative for a premier independent E&P company to revamp their Offshore employee bonus program, resulting in the creation of an Offshore Bonus Calculation Engine and reporting dashboard to enable the program.
- The new program would be based on key performance measures in four distinct categories: Health and Safety, Financial Performance, Well Production Performance, and Individual Performance. Each scored factor would contribute to the calculated bonus for each offshore employee
- The KPI’s would need to be derived from five disparate systems including, an Odata API to a third party hosted system, three separate SQL Server instances, and CSV/Excel files.
- This meant managers would need a way to visualize measurement calculations and approve final bonusesThe client would need a real-time interactive dashboard for employees to see how their bonuses are affected by the underlying metrics.
In an accelerated six-week effort, a small team developed a Power Bi dashboard solution that gathered and reported the supporting detail for each measurement and produced real-time bonus scores for each employee. The solution also provided three separate real-time reporting dashboards for the key measurements in Health and Safety, Financial Performance, and Well Production Performance, per facility in the Gulf of Mexico.
Key Components Include
- Seven key measurements derived from data located in the five data sources.A cross-reference mechanism built to allow users to associate data elements from the various systems.
- A bonus calculation dashboard designed to be data-driven so that the weighted factors could be adjusted by management.
- Workflow automation to load data from disparate sources into a relational table model which allowed data to be joined across sources.
- The dashboard brings to life the vision that upper management had of managers and employees having real-time access to measurements of the key factors the company wants to focus on.
- Management can now quickly see facilities that are excelling or falling behind on key factors in health and safety, financial performance, or production performance.
- Employees can go to the dashboard for real-time feedback on those factors affecting their personal bonuses.
Sullexis helped an Upstream Super Major develop a Key Performance Indicator (KPI) dashboard for integrity management engineering utilizing Microsoft Power Bi.
- The customer found the existing process was inflexible and could not incorporate new safety-oriented reporting requirements.
- The existing process for gathering and reporting the KPI’s to upper management was an inefficient use of resources: data was moved and manipulated multiple times in Excel spreadsheets consuming significant time from five different rolls to collect and manage the data.
- The reporting was delivered to management as a printed PDF file with no ability to dig into details that supported the monthly performance indicators being reported.
- In each step of the process, the Excel processes and PDF creation were hardcoded to the existing measurements. Adding to or changing the reporting involved an immense effort in the coordination of all participants.
- The team developed a solution to pull in details of all open defects and calculate forty-two KPI’son the detail-level data.
- Pulling the supporting detailed data into the Power BI solution, the team was able to provide drill-down capabilities to the summarized graphical visualizations.
Key components include:
- Tracks the performance of inspections and repairs of engineering resources on large production platforms in the Gulf.
- Data from disparate sources were modeled into a relational table model which allowed data to be joined across sources.
- The system added a robust look back ability allowing management to not only see a snapshot of this month’s performance on the forty-two measurements but previous month measurement data for comparison.
- Management can now see trend visualizations of measurements with the ability to drill down to the supporting detail.
- The management reporting is now directly pulling from the source detail, eliminating two layers of excel complexity and employee intervention, freeing those resources to focus elsewhere.
- The new solution’s added flexibility enables visualizations to handle additional data points without the rework required in the prior solution.
This post is a continuation of a singular topic: managing data science projects at an enterprise scale. In the first article, I laid out the 7 key factors (reproduced below) and focused on the first four. Just to refresh your memory, those factors are:
- Clearly define your goals and success criteria (yes, they’re two different things)
- Give your data science team time to succeed
- Redefine your use case as a machine learning problem statement
- Figure out what it would take to solve a problem without using machine learning
- Understand your correct ratio of data engineers to data scientists
- Architect with CI/CD in mind from the beginning
- Leverage data science code for data pipelining in production
Part 2 will focus on the latter three. So let’s jump into it!
5. Understand the correct ratio of data engineers to data scientists
Not all data scientists are created equally!
Some data science (DS) folks are born out of the fire that is data engineering, others are specialist PHD-holding gurus. Each has their own strengths and weaknesses, but all of them will have to learn how to manipulate data. To ease the burden on them and your systems, you’ll need some dedicated people to help them get the data they need when they need it. This is especially important if your DS types are relatively new to your organization and don’t know their way around the data landscape. Bring in some seasoned data engineers and your timeline will come down incredibly. It helps if they have some experience working with DS teams since a lot of data science requires data in a very strict predictable (pun intended) format that lends itself to feature engineering.
6. Architect with CI/CD in mind from the beginning
The background of this recommendation is outside the scope of this article, but I plan on doing a follow up that elaborates on this one. Data pipelines will have to be built to support the model that goes into production, but unlike batch reporting, machine language (ML) use cases tend to target real-time needs. Even near-real-time models pose a challenge for engineers. Worse is when data scientists create models that have a shifting feature set based on the most recent data (think about how you would handle “Most Active Customers in the past 30 days” as a feature input to a model).
Go back to our previous example about product recommendation. Wouldn’t you want your model to incorporate the most up to date reviews and customer attributes? Of course, this all depends on the nature of your industry: business-to-consumer (B2C) marketing is much more real-time than say pipeline routing where the majority of contracts are negotiated long in advance. How about dealing with new customers? Will a model need to make recommendations the same day for a new customer? Understanding how your data changes over time is key to building sustainable pipelines to support model development and refinement.
7. Leverage data science code for data pipelining in production
I saved the best for last; and this isn’t the first nor the last time I will write about this: leverage the work your data science team has already done. DS work is similar to data analytics in that both teams readily need access to multiple data sources, time to massage data, and direction on business objectives; yet differ when we look at the process. DS is an R&D activity. Your DS team shouldn’t be fixated on delivering day-to-day operational metrics. DS teams are like wild horses: they need room to stretch their legs and run. And like a wild horse, if you follow it long enough, you’ll find water.
Your DS team will likely spend a lot of time using applications like Jupyter, R Studio, or PyCharm. The latter of which provides some benefits with more GUI tools for CI/CD and version control. Using a more “notebook” style environment to import, transform, and model data is more appealing for the DS team because these tools operate as a GUI interface on top of a run-time language: they can selectively execute small blocks of code, check the results, iterate changes quickly, then move on. This makes the DS development process much smoother… for development.
Once a model is trained, productionizing notebook code can be a nightmare. We need to productionize this code because the work the DS team did to build the model involved a lot of data prep (bringing in data from multiple sources, joins, filters, etc..). Why all of this needs to be captured and recreated for production pipelines is beyond the scope of this post. I will list some reasons though
- The feature set used to train the model will be needed when an inference is requested for any supervised learning models
- The calculations to obtain numerical features (10-day averages, running totals, etc..) might be tailored to fit the specific use case (net sales can mean ANYTHING)
- Writing these features to persistent storage in your warehouse might cause governance issues if these calculations are not properly documented with the data
Access to the code base where the DS team keeps their code under revision (if you aren’t using some sort of version control shame on you) will be necessary. Also, don’t keep your engineering team in the dark until it’s ready to go to production (they need sunlight and fresh water to grow). Stand-ups should include engineering representatives as well as data scientists. These incremental updates will help everyone code towards a common goal: automating production delivery.
That concludes this two-part series about managing data science projects. I hope you all found something useful – feel free to quote me. Follow me on LinkedIn and stay up-to-date on the latest from Sullexis. Sullexis is a data-centric, client-obsessed consulting company. Our four practices work together to deliver the highest quality of service