Artificial intelligence (AI) has been “almost ready for prime time” for at least 50 years. Finally, the predictions may be coming true with the advent of Generative AI, which combines enormous amounts of processing power, available for the first time at an affordable price, with vast amounts of data that is used to train the AI models.
As Generative AI — embodied by ChatGPT — becomes practical and affordable, tech and business leaders must understand how AI impacts data and security strategies. That starts with the concept of model training. In this context, “training” means exposing the AI tool, under the direction of “knowledge engineers,” to lots and lots of data that teaches the tool how to reach conclusions from that data trove.
When I say lots and lots of data, you may be wondering, exactly how much is that? According to OpenAI, the company behind ChatGPT, ChatGPT was trained with 570GB of data, 300 billion words, and 175 billion parameters. I don’t know about your frame of reference, but that’s a lot of data to me.
As a business leader, you should have at least three concerns regarding training data as it relates to your data strategy:
- Did the training data introduce bias — deliberate or unconscious — into the model?
- Was the data relevant to the specific requirements of your business and the industry in which you compete?
- Was any of your proprietary or confidential data inadvertently (or deliberately) included in the training dataset?
In this analysis, I’ll drill into those concerns and provide specific recommendations about how CXOs can address them.
Bias and Potential Embarrassment
Bias — a skewed point of view — has been a sticking point around many AI deployments. No one wants to suffer the embarrassment Microsoft did with “Tay,” an early version of the company’s conversational AI. As was well documented, malicious users fed Tay enough slanted information that it quickly became “inflammatory,” to say the least, in its answers.
Bias can be overt — as happened with Tay — or unconscious, caused by gaps in the provided training information. For example, it appears that ChatGPT was trained using written materials from the Internet. Might that strategy have introduced bias because it didn’t include spoken word materials? Perhaps. Future AI systems will be trained with far more extensive and diverse datasets, which should help lower the risk of bias.
Recommendation for your data strategy: Even with larger, future AI models, your Chief Data Officer and Chief Digital Officer need to validate your chosen AI to ensure it’s free from bias and the embarrassment that can result.
Which companies are the most important vendors in data? Click here to see the Acceleration Economy Top 10 Data Modernization Short List, as selected by our expert team of practitioner-analysts
Relevance to Your Business, Industry
The key question in the context of AI “relevance” is this: does the AI model specifically relate to your business and industry — what I’ll call your domain — so it delivers value to your organization?
Most commercial AI models start with a substantial training data set, but so far, the majority are stuffed with general knowledge. Organizations need either an “empty” model they can train with their data or general-knowledge AI that allows adding organization-specific training data. This way, domain-specific questions can be answered correctly.
For example, forensic engineering firms are hired by insurance firms and lawyers to determine why a structure collapsed, and whether the structure was built in accordance with all applicable local, regional, and national building codes. Today, such an analysis is done by highly trained and experienced engineers who visit the site to inspect, take photos and measurements, and acquire samples for lab analysis. The engineers also examine plans and drawings, and they research numerous building codes and “best practices” for that type of structure built at that site at that particular point in time.
Imagine an AI model “trained” with all building codes and best practices across time, plus photos, videos, and measurements of collapsed structures vs. intact structures. The AI model could even be trained to ask for specific images, measurements, and test results (or given command of a drone to gather its own data on-site), then issue findings relevant to settling insurance claims and even lawsuits. The training datasets for such a forensic engineering AI would of course be highly specialized — but extremely useful.
Now, let’s reconsider the “bias” question as it relates to training an AI model. If you’re training an organization-specific model, you may want to include deliberate bias to generate answers that favor your organization. For example, if your product is of higher quality but more expensive than its competitors, you might train the AI model that cost isn’t as significant as quality. Or perhaps you want to be known for offering unbiased answers that inform prospects or customers, even if they make your products look inferior to others (Think of the insurance companies that compare rates and sometimes show a prospect that a competitor is a better choice for that prospect for a given product).
Recommendation for your data strategy: Consider your culture and your data —the CMO and maybe the CEO and board, plus the Chief Data Officer, need to drive this discussion.
The Data Leakage Security Risk
Finally, let’s consider the risk of “data leakage” when it comes to AI models. Obviously, data protection is a core element of data strategy. Inadvertent disclosure of proprietary information or deliberate theft of such information has been a problem since we started keeping records on a clay tablet.
As it relates to AI models, there’s a key question that CXOs must be able to answer: Has any proprietary organizational data leaked and become incorporated in a public model, or has some AI model builder—perhaps from a business rival—purloined proprietary data and included it in a public model or a competitor’s model?
Internet-connected AI models significantly raise the risk of such events. First, because AI models never forget anything: Whatever they ingest is available as needed. Second, because AI is excellent at identifying obscure — to a human — data relationships and patterns, even seemingly trivial facts might drive decisions that hurt your organization. Third, Internet access means people around the globe might have access to information that previously existed only on a printout in an evildoer’s briefcase.
Recommendation for your data strategy: Involve the CISO, Chief Data Officers, and “information custodians” (general counsels, internal audit, and compliance heads) in a project to evaluate and tighten up, as needed, your information protection (or Intellectual Property Protection) programs.
Conclusion
Today’s AI models consume enormous data sets for training purposes, with future tools aiming to consume “all human knowledge.” Given the power of today’s Generative AI tools, and the far more powerful products being designed and envisioned, it’s vital that your data strategy incorporates a heightened degree of data security.
Until this point, it might have been OK to have a data strategy overseen by a Chief Data Officer, along with a security strategy overseen by a CISO. But today onward, organizations need a blended “data + security” strategy. Now is the time to understand the data strategy and security implications of powerful Generative AI tools and be sure you’re protecting your organization from data bias, irrelevance, and IP leakage.
Want more insights into all things data? Visit the Data Modernization channel: