We aim to build neural networks that are intrinsically robust against adversarial attacks. We focus on classifying images in real-world scenarios with complex backgrounds under unforeseen adversarial attacks. Previous defenses lack interpretability and have limited robustness against unforeseen attacks, failing to deliver trustworthiness to users. We will study Bayesian models, which are more interpretable and have intrinsic robustness. We will explore two directions: extending an existing Bayesian classifier with better models and building new Bayesian models from discriminative models.
- An Ju, UC Berkeley
- David Wagner, UC Berkeley
- Trevor Darrell, UC Berkeley
- Ahmad Beirami, FAIR
Artificial intelligence has become increasingly fundamental to our society: it revolutionizes how people move, pay, receive government benefits, and more; but artificial intelligence also causes new threats that are unique to the complex neural network models underpinning modern artificial intelligence techniques. Specifically, neural networks are susceptible to adversarial perturbations: input modifications that do not change human perception will alter a neural network's behavior. For example, neural networks that recognize traffic signs can misclassify under adversarial attacks; an attacker can therefore put stickers over traffic signs that will lead autonomous driving cars to a sudden halt, potentially causing car accidents. Unfortunately, studies have revealed that adversarial examples exist in a variety of tasks involving images, speech, and natural language. For example, chatbots, commonly used in AI-powered customer service, may be triggered to reply with racially offensive sentences; voice controls, such as used by Amazon Echo, may be controlled by attackers to execute a command without the user's knowledge, causing privacy and economic loss. In summary, adversarial examples greatly undermine the reliability and trustworthiness of intelligent systems. This proposal aims to improve the robustness of neural networks.
Most defenses proposed thus far target a specific type of attack and do not show intrinsic robustness, robustness universally against all types of attacks. For example, adversarial training uses adversarial examples to train neural networks. However, researchers find that adversarially trained models show better robustness against the specific attack used at training time than other attacks. Therefore, such models do not have intrinsic robustness. Since real-world threats are unforeseen, the lack of intrinsic robustness casts a shadow on the model's trustworthiness.
One promising direction for intrinsic robustness is Bayesian models. Traditionally, we use discriminative models, like models used in adversarial training. Discriminative models are black-box models that convert a complex signal, such as a high-resolution image, into an estimate of the likelihood of each class; such models are hard to interpret and tend to give extreme estimates. Bayesian models, on the other hand, capture uncertainty better. As an example, Figure 1 illustrates that while generative classifiers will assign high confidence for images that do not contain object from any candidate class, a Bayesian model can correctly handle such inputs and provide a classification more consistent with human perception than discriminative models. Put into the perspective of adversarial examples, Bayesian models may detect inputs that are adversarially perturbed and assign them a higher uncertainty. With careful design, the estimate of a sample's conditional likelihood is more stable for Bayesian models than discriminative models. Therefore, Bayesian models are promising as a robust and trustworthy building block for future AI infrastructure.
Bayesian models have already yielded some promising results for defending adversarial attacks. Schott et al. have achieved state-of-the-art robustness on a image classification dataset MNIST; their model, Analysis-by-synthesis (ABS), shows robustness against a wide range of attacks, suggesting that ABS's Bayesian approach can achieve intrinsic robustness against adversarial attacks. Furthermore, Golan et al.'s study, by comparing ABS against several defenses that use discriminative models, suggests that ABS is more consistent with human perception. However, Bayesian models do not generalize to more challenging tasks. Fetaya et al. have pointed out that ABS cannot handle more challenging datasets such as CIFAR. Therefore, our research aims to address the problems of Bayesian methods on challenging tasks, aiming for new state-of-the-art robustness on realistic datasets.
We have already made some progress on improving Bayesian models. Our model, E-ABS, has successfully advanced Bayesian models to more realistic datasets, addressing some problems brought up in Fataya et al.'s work. E-ABS has set a new state-of-the-art robustness on a traffic sign classification benchmark and a street number recognition benchmark. Therefore, our continued work on improving Bayesian models, as explained in this proposal, holds promise for successfully advancing the robustness and trustworthiness of neural networks.
Our project comprises of two parts. First, we plan to address ABS's shortcomings by introducing a more principled way of estimating sample likelihoods. Second, we will combine Bayesian models and discriminative models, leading to a new paradigm for robust image classification.
We aim to achieve new state-of-the-art empirical robustness on CIFAR-10 with our generative classifiers and variational Bayes randomized smoothing models. Our goal is to deliver a model that has clean accuracy comparable with discriminative models with robustness against all types of attacks. CIFAR-10 is a image classification benchmark widely used to evaluate computer vision models. Achieving a new state-of-the-art empirical robustness on CIFAR-10 is a significant improvement over Schott et al.'s ABS, showing real promise that Bayesian models can work on complex signals.
- Contact An Ju at firstname.lastname@example.org