Zach bought a new Ferrari 458. High-performing even by elite sports car standards, this vehicle boasted a V8 engine capable of producing 562 horsepower. It zoomed from 0-62 MPH in under three breathtaking seconds, reaching top speeds of a very illegal (in most places) 210 MPH. Every morning, Zach meticulously washed and polished his jewel of an automobile. But he never used the high-octane fuel recommended by Ferrari, skipped all of the regular maintenance appointments, and never rotated the tires, checked the brakes, or changed the oil. As you can see, we refer to Zach's lovely vehicle in the past tense.
This is exactly what companies do when they invest megabucks in data warehousing and maintenance, yet never check or cleanse their data. Like Zach's Ferrari, the engine soon clogs, and it starts spitting a nasty-smelling funk. That's when you begin to blame your analytical tools and your data scientists -- when the real problem is the data quality.
If You Start at the Wrong Place You Won't End up Where You Want Either
There's an old acronym database administrators used to toss around: GIGO. Garbage in, garbage out. It's true about putting inferior motor oil and low-O fuel in a Ferrari, it's true about eating doughnuts and sugary drinks before bedtime, and it's true about your data. If you put bad data into good analytical tools, you'll never wind up with reliable results. In fact, the answers you get could be so skewed that you'd be better off without your expensive data warehouse and analysis.
Where Good Data Goes Bad
Data quality can become inferior in a number of ways, including:
It was bad to begin with. Highly manual processes lead to human errors, which result in unreliable data. When it comes to marketing data, people often intentionally enter false data into webforms to gain access to whitepapers and e-books without the fear of being hounded by your marketing department.
It goes bad over time. People change jobs, change titles, change email providers, change phone numbers, and move out of state with no forwarding address. As soon as you collect certain types of data (particularly marketing data), it starts to go bad.
It's poorly formatted. Folks enter data into a spreadsheet. Next quarter, they copy and paste that data into another spreadsheet, perhaps into a table in Word or PowerPoint, and back and forth... Soon the data is put in the wrong fields, formulas are corrupt, and all kinds of mistakes and errors creep in. You'll never get good business intelligence analytics out of this.
It's redundant. Most databases are bloated by duplicated data. Businesses love to talk about their data lakes filled with exabytes of data, but they don't like to think about the gigabytes (potentially petabytes) that are actually duplicate records. This leads to all sorts of trouble, which we'll discuss now.
The High Costs of Bad Data in Business Intelligence Analytics
Certain types of data have a notoriously short shelf life. It begins to go out of date as quickly as you collect it. These types of data (like marketing databases) need to be checked and cleansed more regularly than data that holds its on better.
If one sales rep calls a lead, they may or may not buy. If two or three sales reps contact the same lead because they're working with duplicated data, it looks like your right hand doesn't know what your left is up to. Similarly, data can contain bad information that makes your whole company look like they don't know what they're doing. The same is true with data used for operational intelligence, business intelligence analytics, machine learning apps, or any other data analytics endeavor. Garbage in, garbage out.
Righting the Data Quality Ship
What does it take to achieve a high quality of data?
Begin with a data cleansing program. This process needs to be as manual and hands-on as possible. Automation can't address specific issues as well as human insight and experience.
Put better systems and procedures in place for managing data quality. Eliminate as many manual processes as possible through automation tools to overcome human errors, and make processes transparent. Build accountability into processes and procedures so that workers are responsible for assuring the quality of their work.
Put a cap on marketing fields. It's far better to have a marketing database filled with 20 solid, reliable data points on each lead and customer than to have a database filled with 150 data points, 25% of which are inaccurate.
Establish a regular procedure for checking data, eliminating duplicates, and addressing erroneous or out-of-date data that creeps in after the initial data cleansing.
Aaron Crouch is the Business Intelligence Practice Leader for Aptera. Upon graduating from Purdue University, Aaron immediately began working on Data Warehouses and BI implementations. With 14 years of Business Intelligence experience, Aaron has strategically planned, designed, and developed solutions for multiple Fortune 500 companies both in the United States and abroad. His work spans the fields of retail, education, healthcare, manufacturing, and automotive vertical markets.