By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Viral Trending contentViral Trending content
  • Home
  • World News
  • Politics
  • Sports
  • Celebrity
  • Business
  • Crypto
  • Gaming News
  • Tech News
  • Travel
Reading: Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results
Notification Show More
Viral Trending contentViral Trending content
  • Home
  • Categories
    • World News
    • Politics
    • Sports
    • Celebrity
    • Business
    • Crypto
    • Tech News
    • Gaming News
    • Travel
  • Bookmarks
© 2024 All Rights reserved | Powered by Viraltrendingcontent
Viral Trending content > Blog > Business > Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results
Business

Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

By Viral Trending Content 5 Min Read
Share
SHARE

Stay informed with free updates

Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.

Artificial intelligence start-up Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology.

In a paper released on Monday, the San Francisco-based start-up outlined a new system called “constitutional classifiers”. It is a model that acts as a protective layer on top of large language models such as the one that powers Anthropic’s Claude chatbot, which can monitor both inputs and outputs for harmful content.

The development by Anthropic, which is in talks to raise $2bn at a $60bn valuation, comes amid growing industry concern over “jailbreaking” — attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons.

Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced “prompt shields” last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.

Mrinank Sharma, a member of technical staff at Anthropic, said: “The main motivation behind the work was for severe chemical [weapon] stuff [but] the real advantage of the method is its ability to respond quickly and adapt.”

Anthropic said it would not be immediately using the system on its current Claude models but would consider implementing it if riskier models were released in future. Sharma added: “The big takeaway from this work is that we think this is a tractable problem.”

The start-up’s proposed solution is built on a so-called “constitution” of rules that define what is permitted and restricted and can be adapted to capture different types of material.

Some jailbreak attempts are well-known, such as using unusual capitalisation in the prompt or asking the model to adopt the persona of a grandmother to tell a bedside story about a nefarious topic.

Recommended

To validate the system’s effectiveness, Anthropic offered “bug bounties” of up to $15,000 to individuals who attempted to bypass the security measures. These testers, known as red teamers, spent more than 3,000 hours trying to break through the defences.

Anthropic’s Claude 3.5 Sonnet model rejected more than 95 per cent of the attempts with the classifiers in place, compared to 14 per cent without safeguards.

Leading tech companies are trying to reduce the misuse of their models, while still maintaining their helpfulness. Often, when moderation measures are put in place, models can become cautious and reject benign requests, such as with early versions of Google’s Gemini image generator or Meta’s Llama 2. Anthropic said their classifiers caused “only a 0.38 per cent absolute increase in refusal rates”.

However, adding these protections also incurs extra costs for companies already paying huge sums for computing power required to train and run models. Anthropic said the classifier would amount to a nearly 24 per cent increase in “inference overhead”, the costs of running the models.

Bar chart of Tests conducted on its latest model showing Effectiveness of Anthropic’s classifiers

Security experts have argued that the accessible nature of such generative chatbots has enabled ordinary people with no prior knowledge to attempt to extract dangerous information.

“In 2016, the threat actor we would have in mind was a really powerful nation-state adversary,” said Ram Shankar Siva Kumar, who leads the AI red team at Microsoft. “Now literally one of my threat actors is a teenager with a potty mouth.”

You Might Also Like

JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays

White House warned staff against betting on futures markets amid Iran war, official says

Only five ships crossed the Strait of Hormuz Thursday, far below Iran’s pledge as negotiations begin

TReDS tweak to ease MSME credit flow amid global pressure

1 FTSE 250 stock I like and 1 I’ll avoid after the stock market correction

TAGGED: bbc business, Business, business ideas, business insider, Business News, business plan, google my business, income, money, opportunity, small business, small business idea
Share This Article
Facebook Twitter Copy Link
Previous Article Trump agrees to pause tariffs on Mexico, but import taxes still in place for Canada and China
Next Article How Costs for Online Sales After Trump’s Trade Move
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

- Advertisement -
Ad image

Latest News

JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays
Business
Apple AI Pin Specs Leak: Dual Cameras, No Screen & More
Tech News
A ‘glass-like’ battlefield: German Army chief on the future of warfare
World News
Polymarket Sees Record $153M Daily Volume After Chainlink Integration
Crypto
Natasha Lyonne Then & Now: See Before & After Photos of the Actress Here
Celebrity
Cult Hit Doki Doki Literature Club Fights Removal From Google Play Store Over ‘Depiction Of Sensitive Themes’
Gaming News
Dead as Disco Launches Into Early Access on May 5th, Groovy New Gameplay Released
Gaming News

About Us

Welcome to Viraltrendingcontent, your go-to source for the latest updates on world news, politics, sports, celebrity, tech, travel, gaming, crypto news, and business news. We are dedicated to providing you with accurate, timely, and engaging content from around the globe.

Quick Links

  • Home
  • World News
  • Politics
  • Celebrity
  • Business
  • Home
  • World News
  • Politics
  • Sports
  • Celebrity
  • Business
  • Crypto
  • Gaming News
  • Tech News
  • Travel
  • Sports
  • Crypto
  • Tech News
  • Gaming News
  • Travel

Trending News

cageside seats

Unlocking the Ultimate WWE Experience: Cageside Seats News 2024

Investing £5 a day could help me build a second income of £329 a month!

JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays

cageside seats
Unlocking the Ultimate WWE Experience: Cageside Seats News 2024
May 22, 2024
Investing £5 a day could help me build a second income of £329 a month!
March 27, 2024
JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays
April 10, 2026
Brussels unveils plans for a European Degree but struggles to explain why
March 27, 2024
© 2024 All Rights reserved | Powered by Vraltrendingcontent
  • About Us
  • Contact US
  • Disclaimer
  • Privacy Policy
  • Terms of Service
Welcome Back!

Sign in to your account

Lost your password?