Himabindu Lakkaraju
Harvard University
H-index: 34
North America-United States
Top articles of Himabindu Lakkaraju
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
arXiv preprint arXiv:2404.18870
2024/4/29
Himabindu Lakkaraju
H-Index: 18
Quantifying uncertainty in natural language explanations of large language models
2024/4/18
Chirag Agarwal
H-Index: 7
Himabindu Lakkaraju
H-Index: 18
Towards Safe and Aligned Large Language Models for Medicine
arXiv preprint arXiv:2403.03744
2024/3/6
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
arXiv preprint arXiv:2402.17840
2024/2/27
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
arXiv preprint arXiv:2402.10376
2024/2/16
Alex Oesterling
H-Index: 0
Himabindu Lakkaraju
H-Index: 18
Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability
arXiv preprint arXiv:2402.10688
2024/2/16
Which models have perceptually-aligned gradients? an explanation via off-manifold robustness
Advances in neural information processing systems
2024/2/13
Sebastian Bordt
H-Index: 1
Himabindu Lakkaraju
H-Index: 18
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
Advances in Neural Information Processing Systems
2024/2/13
Himabindu Lakkaraju
H-Index: 18
Post hoc explanations of language models can improve language models
arXiv preprint arXiv:2305.11426
2023/5/19
Jiaqi Ma
H-Index: 6
Dylan Slack
H-Index: 3
Asma Ghandeharioun
H-Index: 14
Sameer Singh
H-Index: 1
Himabindu Lakkaraju
H-Index: 18
Understanding the Effects of Iterative Prompting on Truthfulness
arXiv preprint arXiv:2402.06625
2024/2/9
Chirag Agarwal
H-Index: 7
Himabindu Lakkaraju
H-Index: 18
Faithfulness vs. Plausibility: On the (Un) Reliability of Explanations from Large Language Models
arXiv preprint arXiv:2402.04614
2024/2/7
Chirag Agarwal
H-Index: 7
Himabindu Lakkaraju
H-Index: 18
OpenXAI: Towards a Transparent Evaluation of Model Explanations
Advances in Neural Information Processing Systems
2022/12/6
Chirag Agarwal
H-Index: 7
Martin Pawelczyk
H-Index: 3
Isha Puri
H-Index: 1
Marinka Zitnik
H-Index: 25
Himabindu Lakkaraju
H-Index: 18
Consistent explanations in the face of model indeterminacy via ensembling
2023/6/9
Dan Ley
H-Index: 0
Himabindu Lakkaraju
H-Index: 18
Investigating the Fairness of Large Language Models for Predictions on Tabular Data
arXiv preprint arXiv:2310.14607
2023/10/23
In-context unlearning: Language models as few shot unlearners
arXiv preprint arXiv:2310.07579
2023/10/11
Martin Pawelczyk
H-Index: 3
Himabindu Lakkaraju
H-Index: 18
Word-level explanations for analyzing bias in text-to-image models
arXiv preprint arXiv:2306.05500
2023/6/3
Alexander Lin
H-Index: 25
Himabindu Lakkaraju
H-Index: 18
Are Large Language Models Post Hoc Explainers?
arXiv preprint arXiv:2310.05797
2023/10/9
On the Trade-offs between Adversarial Robustness and Actionable Explanations
arXiv preprint arXiv:2309.16452
2023/9/28
Chirag Agarwal
H-Index: 7
Himabindu Lakkaraju
H-Index: 18
The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
2023/5/23
Certifying llm safety against adversarial prompting
arXiv preprint arXiv:2309.02705
2023/9/6