Marwa Abdulhai

I am a PhD student at UC Berkeley in the Berkeley Artificial Intelligence Research (BAIR) lab advised by Sergey Levine. Previously, I did my masters and undergraduate studies at MIT, where I worked with Jonathan P. How.

My research aims to build AI systems that are not only capable and intelligent, but also safe, cooperative, and aligned with human values. I am currently a Cooperative AI PhD Fellow. I am thankful to be supported by Open AI Research, the Quad Fellowship, and the UC Berkeley AI Policy Hub. Specifically, I am interested in:

AI Safety & Ethics: Designing algorithms to evaluate and mitigate deceptive behavior and align AI behavior with human values.
Reinforcement Learning (RL): Developing multi-turn RL methods for dialogue, specifically focusing on negotiation and cooperative settings.
Language Models & Interaction: Investigating how LLMs can better model humans to build better simulators for reinforecment learning algorithms and social science studies.

Email / Github / Twitter / LinkedIn

Preprints

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning.
Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques.
NeurIPS 2025
Paper / Code / Website

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving consistency in LLM-generated dialogue, reducing inconsistency by over 55\%, resulting in more coherent, faithful, and trustworthy simulated users.

Deception in Dialogue: Evaluating and Mitigating Deceptive Behavior in Large Language Models.
Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine.
Pre-print.
Paper / Code / Website

Large Language Models (LLMs) interact with hundreds of millions of people worldwide, powering applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. In this paper, we systematically investigate the extent to which LLMs engage in deception within dialogue. We benchmark 8 state-of-the-art models on 4 dialogue tasks, showing that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

Selected Publications

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models.
Marwa Abdulhai, Isadora White, C. Snell, C. Sun, J. Hong, Y. Zhai, K. Xu, S. Levine.
ICLR GenAI4DM Workshop 2024; ICML 2025
Paper / Code / Website

We introduce LMRL‑Gym, a benchmark and toolkit for developing reinforcement learning algorithms that operate with large language models (LLMs) across eight multi-turn dialogue and text-game tasks. This work demonstrates training LLMs to plan and act strategically, ask clarifying questions, and solve long-horizon tasks—bridging reactive generation and goal-driven interaction.

Defining Deception in Decision Making.
Marwa Abdulhai, Micah Carroll, Justin Svegliato, Anca Dragan, Sergey Levine.
AAMAS 2024
Paper

This paper formalizes deception in decision-making processes and proposes principled methods to detect and evaluate deceptive trajectories in multi-agent systems, using game-theoretic and reinforcement learning approaches.

Moral Foundations of Large Language Models.
Marwa Abdulhai, Gregory Serapio-García, Clement Crepy, Daria Valter, John Canny, Natasha Jaques.
AAAI-23 R2HCAI (Best Paper); EMNLP 2024
Paper / Code / Website

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks.

Virtual Personas for Language Models via an Anthology of Backstories.
S. Moon*, Marwa Abdulhai*, M. Kang*, J. Suh*, John Canny.
EMNLP 2024
Paper / Code

Large language models (LLMs) are trained from vast repositories of text authored by millions of distinct authors, reflecting an enormous diversity of human traits. In this work, we introduce "Anthology", a method for conditioning LLMs to particular virtual personas by harnessing open-ended life narratives, which we refer to as "backstories." We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations. Across three nationally representative human surveys conducted as part of Pew Research Center's American Trends Panel (ATP), we demonstrate that Anthology achieves up to 18% improvement in matching the response distributions of human respondents and 27% improvement in consistency metrics.

Other Publications

Evaluating Generalization Capabilities of LLM Agents in Mixed-Motive Scenarios Using Concordia
Chandler Smith, Marwa Abdulhai, Manfred Diaz, [80 authors], D. Hadfield-Menell, Natasha Jaques, J. Hernandez-Orallo. Joel Leibo.
Pre-print
Contest / Code

Large language model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. This work introduces an approach to measuring human-appropriate cooperative intelligence, emphasizing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts.

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques.
2nd Workshop on Models of Human Feedback for AI Alignment at ICML 2025.
Paper

Effective conversational agents like large language models (LLMs) must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context-limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF. We show improved generalization capabilities compared to standard multi-turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.

Blind Faith? User Preference and Expert Assessment of AI-Generated Religious Content.
Sabriya Alam, Marwa Abdulhai, Niloufar Salehi.
FAccT 2025
Paper

The increasing use of AI tools like ChatGPT for religious information-seeking among Muslims raises critical questions about the intersection of technology, faith, and expertise. This mixed-methods study investigates user and expert evaluations of AI-generated religious content, through both quantitative and qualitative analysis of user preferences and expert feedback elicited through interviews, surveys, and expert consultations. Our findings reveal a significant disconnect: despite expressing distrust in AI for religious guidance and stating a preference for scholarly answers, Muslim users overwhelmingly preferred AI-generated responses to Islamic questions in blind evaluations, favoring them in 81.3% of cases.

Context-Specific Representation Abstraction for Deep Option Learning.
Marwa Abdulhai, Dongki Kim, Matthew Riemer, Miao Liu, Gerald Tesauro, Jonathan P. How
AAAI-22
Paper / Code / Video

We introduce Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space.

A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning.
Dongki Kim, Miao Liu, Matthew Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, Jonathan P. How
ICML-21, AAAI-20 Symposium
Paper / Code / Video

We develop a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to multiagent settings. Our meta-agent directly considers both an agent’s own non-stationary policy dynamics and the non-stationary policy dynamics of other agents to adapt fast.

Teaching

Graduate Student Instructor for CS188: Artificial Intelligence
Spring 2019
Course Website /

Head Graduate Student Instructor CS285 Deep Reinforcement Learning
Fall 2022
Course Website /

Teaching Assistant for 6.141: Robotics Science & Systems
Spring 2019
Assignments / Medium Article

Student Instructor for MIT Refugee Action Hub
January 2019
Program Page

Lab Assistant for 6.009: Fundamentals of Programming in Python
December 2018
Course Page

Template from this website.