arXiv:2007.06381 Abstract | arXiv Analytics

arXiv:2007.06381 [cs.LG]Abstract References Reviews Resources

A simple defense against adversarial attacks on heatmap explanations

Published 2020-07-13Version 1

With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.

Comments: Accepted at 2020 Workshop on Human Interpretability in Machine Learning (WHI)

Categories: cs.LG, cs.AI, stat.ML

Keywords: adversarial attacks, heatmap explanations, simple defense, multiple explanation methods, neural networks

Related articles: Most relevant | Search more

arXiv:2003.03778 [cs.LG] (Published 2020-03-08)

Adversarial Attacks on Probabilistic Autoregressive Forecasting Models

Raphaël Dang-Nhu, Gagandeep Singh, Pavol Bielik, Martin Vechev

arXiv:2006.06861 [cs.LG] (Published 2020-06-11)

Robustness to Adversarial Attacks in Learning-Enabled Controllers

Zikang Xiong, Joe Eappen, He Zhu, Suresh Jagannathan

arXiv:1802.06552 [cs.LG] (Published 2018-02-19)

Are Generative Classifiers More Robust to Adversarial Attacks?

Yingzhen Li

arXiv Analytics

arXiv:2007.06381 [cs.LG]Abstract References Reviews Resources

A simple defense against adversarial attacks on heatmap explanations

Links

Toolbox

arXiv:2007.06381 [cs.LG]AbstractReferencesReviewsResources

A simple defense against adversarial attacks on heatmap explanations

Links

Toolbox

arXiv:2007.06381 [cs.LG]Abstract References Reviews Resources