Scientists develop AI monitoring agent to detect and stop harmful outputs

Share This Post

The monitoring system is designed to detect and thwart both prompt injection attacks and edge-case threats.

A team of researchers from artificial intelligence (AI) firm AutoGPT, Northeastern University and Microsoft Research have developed a tool that monitors large language models (LLMs) for potentially harmful outputs and prevents them from executing. 

The agent is described in a preprint research paper titled “Testing Language Model Agents Safely in the Wild.” According to the research, the agent is flexible enough to monitor existing LLMs and can stop harmful outputs, such as code attacks, before they happen.

Per the research:

“Agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans.”

The team writes that existing tools for monitoring LLM outputs for harmful interactions seemingly work well in laboratory settings, but when applied to testing models already in production on the open internet, they “often fall short of capturing the dynamic intricacies of the real world.”

This, seemingly, is because of the existence of edge cases. Despite the best efforts of the most talented computer scientists, the idea that researchers can imagine every possible harm vector before it happens is largely considered an impossibility in the field of AI.

Even when the humans interacting with AI have the best intentions, unexpected harm can arise from seemingly innocuous prompts.

An illustration of the monitor in action. On the left, a workflow ending in a high safety rating. On the right, a workflow ending in a low safety rating. Source: Naihin, et., al. 2023

To train the monitoring agent, the researchers built a data set of nearly 2,000 safe human-AI interactions across 29 different tasks ranging from simple text-retrieval tasks and coding corrections all the way to developing entire webpages from scratch.

Related: Meta dissolves responsible AI division amid restructuring

They also created a competing testing data set filled with manually created adversarial outputs, including dozens intentionally designed to be unsafe.

The data sets were then used to train an agent on OpenAI’s GPT 3.5 turbo, a state-of-the-art system, capable of distinguishing between innocuous and potentially harmful outputs with an accuracy factor of nearly 90%.

Read Entire Article
spot_img
- Advertisement -spot_img

Related Posts

Bitcoin Realized Losses Spike 3 Times The Weekly Average – Healthy Correction Or Downturn?

Bitcoin has faced its first major correction since early November, dropping 13% from its all-time high of $108,364 This sudden pullback has sent shockwaves across the crypto market, shifting

Video-Sharing Firm Rumble Secures $775 Million Investment From Tether to Drive Growth

Rumble, the video-sharing platform and cloud services provider has just inked a $775 million deal with Tether, the heavyweight of stablecoins This blockbuster investment signals a partnership between

Dogecoin Price Stuck In A Range Amid Market Crash, What Happens When It Breaks Out?

Crypto analyst Trader Tardigrade has provided insights into the current Dogecoin price action The analyst revealed that Dogecoin is currently stuck in a range amid the recent crypto market crash and

Coinbase believes tokenization, DeFi will be key themes in 2025 amid pro-crypto policies

Coinbase’s latest market outlook for 2025 identifies tokenization, DeFi resurgence, and a shift toward pro-crypto regulation in the US as key trends for the following year Coinbase anticipates

Bitcoin Price Repeating December 2023’s Playbook: Is The ‘Actual Breakout’ Yet To Come?

Bitcoin (BTC) tries the $100,000 support zone after falling to $98,000 during the recent market shakeout According to some market watchers, the flagship crypto’s recent performance resembles

All Eyes on Trump: Bitcoin Crash Could Pave the Way for a Historic Rebound in 2025

Bitcoin (BTC) recently hit an all-time high of $108,364, only to tumble to $92,118 within three days This dramatic price drop coincides with growing political unease in the US, as the federal