In Part One of the “Initiating, Executing, and Managing Successful AI Projects in Any Organization” series, we identified and anticipated the key elements of data projects as well as showed how projects should be treated as assets that must return value; in Part Two, we discussed how to set up and organize a team to successfully execute artificial intelligence (AI) and data projects; and now, in this third part, we’ll talk about the raw material: the data.
Fundamentals – the Why
Perhaps this is the most fundamental question: Why do we need data? If there is no data, there is no project. But let’s be cautious with the opposite idea — “If data is there, we do have a project.” The fact that data is available doesn’t mean that it is useful and that an AI project can be performed and executed. Along the same lines, the data’s size does not grant an AI project feasibility. Neither a big nor small amount of available data secures an AI project.
What does determine feasibility is the “state of data” available, which is going to serve the ultimate goal as described in Part One. By “state of data,” I mean the data’s quality. I’m sure that you have heard about Garbage-In-Garbage-Out (GIGO), and that’s exactly what I mean. Having a large amount of data that is not curated, staged, cleaned, stored, processed, or maintained, is simply data with low-to-no value for an AI project.
In my experience as a data scientist and artificial intelligence developer, I have found that a smaller data size that’s better in quality delivers much better results than having a large amount of low-quality data. I have also seen how great AI projects developed with good-quality data have been ruined because large quantities of low-quality data were added later to retrain the model. I like to tell my partners: Good data is great, but more (i.e., “unknown quality”) data is not.
Process – the How
Here are some recommendations for leaders and managers implementing AI projects. If possible, address the following concerns prior to serving data to the AI project team
Focus on how data is generated, stored, and maintained: Understand how data is generated — from a system, an application, a website, forms, human entry, and so on. Depending on how the data is generated, different challenges will arise that the project team will have to deal with. If it is system- or machine-generated, make you understand data generation rules — quantity, format, size, latency, system version, etc. If it is human-generated, understand how it is entered, who inputs it, and under what conditions. It is very important to understand the context in this case. The same scrutiny applies to data storage or data maintenance.
Data needs to be prepared for modeling, after the information technology (IT): This is often not very well understood by my managers and leaders. There are two big situations that limit the vision of leaders and managers concerning data preparation.
- Having an IT who runs and maintains all databases, data warehouses, etc.
- The “Excel” assumption
Data has been usually stored and prepared to make historical analysis at aggregated level — it has never been prepared for disaggregated statistical analysis. This means that every single record and every single feature column needs to be carefully analyzed, as this is how artificial intelligence models work.
On the other hand, thinking that “this can be done easily” — as you are probably thinking about how you would do it in Excel — does not apply to large quantities of data. Excel only works in very small amounts of data locally on your computer. When developing a large-scale data solution, the situation is very different.
Listen first, and let data talk to you: My recommendation is to analyze data with your North Star in mind as the ultimate goal, but “listen” first to determine if the data is capable of delivering what is expected. Perhaps you may discover that it can deliver something else, something that hadn’t been thought of before.
Final Thoughts
To summarize, data is the real asset. It is the foundation upon which artificial intelligence is built. It is critical to understand if the foundation is solid enough to sustain a solid, scalable artificial intelligence solution.
More data is not always better; in fact, it’s much worse if it is not curated and is lacking in quality.
Most importantly, analyze data from a statistical perspective and let it tell you what is possible and what is not.
Want more insights into all things data? Visit the Data Modernization channel: