With the recent boom in awareness and uses cases surrounding artificial intelligence (AI), the conversation around data quality continues to be important. It’s easy to get swept away in awe of many of these emerging AI systems, while forgetting to consider how the data they were trained on is impacting their performance and limiting their utility in real-world scenarios.
With AI, you get out what you put in. After all, AI and machine learning (ML) models are just really powerful extrapolation machines. They “understand” patterns within the data you train them on, and then identify those patterns in instances of data that are novel but similar to the data they have seen before.
As with athletes and their diets, if the quality of the input is reduced, the quality of the output (in this case, performance) is similarly reduced, no matter how talented the athlete may be. If the athlete eats only fast food for a week, their mile time will increase, and they may be more likely to suffer injuries. Similarly, without high-quality data, organizations can’t implement AI successfully.
Which companies are the most important vendors in AI and Hyperautomation? Click here to see the Acceleration Economy Top 10 AI/Hyperautomation Short List, as selected by our expert team of practitioner-analysts
Data Quality Problems and Bias
To understand how data quality has an impact on AI utility, let’s look at attempts to automate the hiring process. A few years ago, Amazon tried using AI to streamline the recruitment process. It created AMZN.O, an AI system that would take in hundreds of resumes and spit out the top few candidates, allowing human recruiters to avoid the menial work of filtering through subpar resumes and quickly selecting the best ones. Sounds like proper automation, right?
Not quite. The company trained the underlying model on all the resumes the company had received in the past, most of which were from male candidates. Since more men were hired at Amazon, especially in certain fields like software engineering, the system came to associate the language and attributes prevalent in male resumes with being a better candidate. Conversely, with less training data on female resumes, the system tended to downgrade or misinterpret attributes of top female candidates.
This is a clear example of poor data quality resulting in biased AI systems. Other systems for recruitment, granting bank loans, and granting mortgages have also been flagged for their bias against certain demographics. The intended outcome of these systems could drive tremendous value to the organizations putting them in place — and make the loan or mortgage process easier for consumers — but the quality of the training data is preventing widespread adoption.
Data Profiling: How H20.ai Cleans, Reviews Data
H2O.ai, one of the Acceleration Economy Hyperautomation Top 10 companies, is a strong supporter of “data profiling,” or the act of cleaning and reviewing data from existing sources. In other words, H2O.ai conducts quality control not just on its own AI cloud platform, but also on the data it is fed.
Data profiling can take many shapes. Even before approaching the dataset, it’s important to evaluate your own assumptions –— what are you testing for, what data points indicate desired state, how data points can best be labeled, how data sets can be filtered or augmented to be more inclusive of edge cases, and so on. Then you can leverage platforms like that of H20.ai, which have capabilities in machine learning to filter and fix faulty data points. Preliminary analysis can be performed to identify any weak spots in the data or discover biases that may not be desired.
How ‘Synthetic’ Data Increases Data Quality and Volume
These post hoc tactics of cleaning data aren’t always enough. Sometimes you just need to gather better data from the get go, which can be an expensive and difficult undertaking. That’s why companies are turning towards synthetic data to train their AI systems. Unlike real data which needs to be captured, synthetic data is created by algorithms or even generative AI models.
For example, companies building the AI systems behind self-driving cars need enormous amounts of visual data on driving situations. Not only is this visual data — literally every last angle, lighting, and weather condition, for example, that a car might encounter on a drive — laborious to capture and store, it also often includes faces and license plates, which need to comply with a wide variety of privacy regulations. To avoid this, researchers at MIT created an algorithm that generates video clips complete with 3D models of roadside objects, humans, and unique traffic situations. The models they trained on these synthetic video clips actually performed better than models trained on real video clips of people driving.
This is partly because you can include more outlying situations in synthetic data sets that would otherwise rarely occur in reality, which helps the model deal with uncommon challenges. You can also avoid real-world biases or happenstance correlations that might bias the end model, such as granting smaller loans to people of color.
Synthetic data allows us to build models that push the world in the way we want to. This can be used for ill and for good. For example, if we want to increase economic mobility in region X, we might make sure our training set features people in region X receiving larger loans. This practice, of course, can be used in the reverse as well.
Conclusion
To reiterate, AI is just an extrapolation machine. The outputs are often just reflections of the way things were identified and handled in the past. Many of our human biases carry over.
But it doesn’t have to be this way. Through focused effort on improving data quality and building strategic training data sets, we can create AI systems that push the world in the direction we want to go in. These systems will improve our way of life and drive value to the organizations that build them.
Looking for real-world insights into artificial intelligence and hyperautomation? Subscribe to the AI and Hyperautomation channel: