Natural language processing (NLP) is quickly becoming ubiquitous. From customer service bots to telehealth products to the education system, this technology is filling critical roles. Many of the world’s knowledge workers now rely on artificial intelligence (AI) models to disseminate knowledge, in effect, influencing how the world thinks. That means the demand for keeping such systems under control — and not saying unsavory or harmful things — is also increasing.
These protections are provided through the use of guardrails, or programmatic barriers, put into place to prevent large language models (LLMs) from including certain topics in their output, including violence, profanity, criminal behaviors, race, hate speech, and more. Look no further than Gary Marcus’ substack to get a sense of the warped morality that these LLMs can take.
Research Exposes Flaws in NLP Guardrails
Recent experiments have shown weaknesses in current guardrails. Researchers from Carnegie Mellon University and the Center for AI Safety in San Francisco conducted a study that revealed significant flaws in the systems produced by OpenAI, Google, and Anthropic.
For example, the researchers found one way to break guardrails by appending prompts with a string of characters. This and many other simple tactics circumvent the safety measures put into place and enable the system to generate responses including unsavory topics.
Even more unsettling, the researchers were able to produce jailbreak attempts automatically, unlocking a near-infinite source of ways that these systems can be taken advantage of. As a result, there are concerns that LLMs will never be able to completely avoid off-the-rails behavior.
You just need a little imagination to picture the potential consequences. Maybe an AI-based learning program starts teaching school children about new curse words. Or your cutting-edge AI customer service agent explains to your clients the need to get violent sometimes. Lots of sticky situations are possible.
Benefits of LLMs
However, there is one counterpoint: the Internet creates tremendous value despite its shortcomings. It’s not right to assume that any powerful technological platforms — the Internet, generative AI, or anything that comes in the future — will ever be 100% “clean.” It should certainly not impede our progress in evolving these systems and building impactful businesses around them. As always, there is a tradeoff between content control, innovation, and utility of such LLMs. Open-source models, for example, cannot only lead to innovation leaps but also pave the way for more loopholes and errors to be exploited.
Models will also take on increasingly niche roles within products, which can help minimize the risk of LLMs going haywire. For example, over time we may iron out the kinks enough to where we have a “kids-friendly” version of LLMs and a “regular” version, just like YouTube has YouTube Kids and normal YouTube.
Risk of Using Generative AI in the Workplace
So, how is all of this relevant for you in the workplace? At the very least, protect yourself against litigation in your publicly disclosed terms and conditions by explaining your use of generative AI and its potential errors. Others have argued that users should disclose the use of generative AI everywhere, including in any marketing slides, emails, social posts, assets, or transcripts where generative AI is used. Personally, I find this overkill — it would be like disclosing that you used Google Search to help you create a slide deck or a report — but that’s a company culture decision and it’s up to you.
Regulation is yet to come into full effect on this topic and will undoubtedly change the conversation about how to “grade” AI systems’ safety and use in different environments. The EU is working hard to push the AI Act forward, and it may set a precedent for similar laws in other countries including the U.S.
No technology is perfect, especially in its early days. As an early-stage founder, this is painfully clear to me. We just need to face the problems head-on and have tough, proactive conversations about fixing them.