David Scott Krueger
Université de Montréal
H-index: 19
North America-Canada
Top articles of David Scott Krueger
Title | Journal | Author(s) | Publication Date |
---|---|---|---|
Affirmative Safety: An Approach to Risk Management for Advanced Ai | Available at SSRN 4806274 | Akash Wasil Joshua Clymer David Krueger Emily Dardaman Simeon Campos | 2024/4/24 |
Visibility into AI Agents | arXiv preprint arXiv:2401.13138 | Alan Chan Carson Ezell Max Kaufmann Kevin Wei Lewis Hammond | 2024/1/23 |
Foundational challenges in assuring alignment and safety of large language models | arXiv preprint arXiv:2404.09932 | Usman Anwar Abulhair Saparov Javier Rando Daniel Paleka Miles Turpin | 2024/4/15 |
Safety Cases: Justifying the Safety of Advanced AI Systems | arXiv preprint arXiv:2403.10462 | Joshua Clymer Nick Gabrieli David Krueger Thomas Larsen | 2024/3/15 |
A Generative Model of Symmetry Transformations | arXiv preprint arXiv:2403.01946 | James Urquhart Allingham Bruno Kacper Mlodozeniec Shreyas Padhy Javier Antorán David Krueger | 2024/3/4 |
Thinker: Learning to Plan and Act | Stephen Chung Ivan Anokhin David Krueger | 2023/7/27 | |
Black-Box Access is Insufficient for Rigorous AI Audits | arXiv preprint arXiv:2401.14446 | Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis | 2024/1/25 |
Blockwise self-supervised learning at scale | arXiv preprint arXiv:2302.01647 | Shoaib Ahmed Siddiqui David Krueger Yann LeCun Stéphane Deny | 2023/2/3 |
Open problems and fundamental limitations of reinforcement learning from human feedback | arXiv preprint arXiv:2307.15217 | Stephen Casper Xander Davies Claudia Shi Thomas Krendl Gilbert Jérémy Scheurer | 2023/7/27 |
BaDLoss: Backdoor Detection via Loss Dynamics | Neel Alex Shoaib Ahmed Siddiqui Amartya Sanyal David Krueger | 2023/10/13 | |
Goal Misgeneralization as Implicit Goal Conditioning | Diego Dorn Neel Alex David Krueger | 2023/11/27 | |
On the fragility of learned reward functions | arXiv preprint arXiv:2301.03652 | Lev McKinney Yawen Duan David Krueger Adam Gleave | 2023/1/9 |
Mechanistic mode connectivity | Ekdeep Singh Lubana Eric J Bigelow Robert P Dick David Krueger Hidenori Tanaka | 2023/7/3 | |
Towards Meta-Models for Automated Interpretability | Lauro Langosco Neel Alex William Baker David John Quarel Herbie Bradley | 2023/10/13 | |
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models | arXiv preprint arXiv:2312.14751 | Alan Chan Ben Bucknall Herbie Bradley David Krueger | 2023/12/22 |
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks | arXiv preprint arXiv:2311.12786 | Samyak Jain Robert Kirk Ekdeep Singh Lubana Robert P Dick Hidenori Tanaka | 2023/11/21 |
Harms from increasingly agentic algorithmic systems | Alan Chan Rebecca Salganik Alva Markelius Chris Pang Nitarshan Rajkumar | 2023/6/12 | |
Reward model ensembles help mitigate overoptimization | arXiv preprint arXiv:2310.02743 | Thomas Coste Usman Anwar Robert Kirk David Krueger | 2023/10/4 |
Characterizing manipulation from AI systems | EEAMO 2023 | Micah Carroll* Alan Chan* Henry Ashton David Krueger | 2023/3/16 |
(Out-of-context) Meta-learning in Language Models | Dmitrii Krasheninnikov Egor Krasheninnikov Bruno Kacper Mlodozeniec David Krueger | 2023/12/12 |