Do Models like GPT-4 Behave Safely When Given the Ability to Act?: This AI Paper Introduces MACHIAVELLI Benchmark to Improve Machine Ethics and Build Safer Adaptive Agents

On Apr 12, 2023

Natural language processing is one area where AI systems are making rapid strides, and it is important that the models need to be rigorously tested and guided toward safer behavior to reduce deployment risks. Prior evaluation metrics for such sophisticated systems focused on measuring language comprehension or reasoning in vacuums. But now, models are being taught for actual, interactive work. This means that benchmarks need to evaluate how models perform in social settings.

Interactive agents can be put through their paces in text-based games. Agents need planning abilities and the ability to grasp the natural language to progress in these games. Agents’ immoral tendencies should be considered alongside their technical talents while setting benchmarks.

A new work by the University of California, Center For AI Safety, Carnegie Mellon University, and Yale University proposes the Measuring Agents’ Competence & Harmfulness In A Vast Environment of Long-horizon Language Interactions (MACHIAVELLI) benchmark. MACHIAVELLI is an advancement in evaluating an agent’s capacity for planning in naturalistic social settings. The setting is inspired by text-based Choose Your Own Adventure games available at choiceofgames.com, which actual humans developed. These games feature high-level decisions while giving agents realistic objectives while abstracting away low-level environment interactions.

🚀 Check Out 100’s AI Tools in AI Tools Club

The environment reports the degree to which agent acts are dishonest, lower utility, and seek power, among other behavioral qualities, to keep tabs on unethical behavior. The team achieves this by following the below-mentioned steps:

Operationalizing these behaviors as mathematical formulas
Densely annotating social notions in the games, such as characters’ wellbeing
Using the annotations and formulas to produce a numerical score for each behavior.

They demonstrate empirically that GPT-4 (OpenAI, 2023) is more effective at collecting annotations than human annotators.

Artificial intelligence agents face the same internal conflict as humans do. Like language models trained for next-token prediction often produce toxic text, artificial agents trained for goal optimization often exhibit immoral and power-seeking behaviors. Amorally trained agents may develop Machiavellian strategies for maximizing their rewards at the expense of others and the environment. By encouraging agents to act morally, this trade-off can be improved.

The team discovers that moral training (nudging the agent to be more ethical) decreases the incidence of harmful activity for language-model agents. Furthermore, behavioral regularization restricts undesirable behavior in both agents without substantially decreasing reward. This work contributes to the development of trustworthy sequential decision-makers.

The researchers try techniques like an artificial conscience and ethics prompts to control agents. Agents can be guided to display less Machiavellian behavior, although much progress remains possible. They advocate for more research into these trade-offs and emphasize expanding the Pareto frontier rather than chasing after limited rewards.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

🚀 Check Out 100’s AI Tools in AI Tools Club

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.

🚀 JOIN the fastest ML Subreddit Community

Credit: Source link