Robustness for Deep Learning/Ethical AI Through Human Value Modeling

Despite the recent advances in adversarial training based defenses, deep neural networks are still vulnerable to adversarial attacks outside the perturbation type they are trained to be robust against. In this project, we propose Protector, a two-stage pipeline to improve the robustness against multiple perturbation types. We demonstrate that Protector outperforms prior adversarial training based defenses by over 5%, when tested against the union of L1, L2 and L attacks.


Current work also involves the collaboration with Pratyush Maini (undergraduate student at IIT Delhi) and Prof. Bo Li (UIUC).


In this project, we study defenses against adversarial examples for multiple perturbation types. Recent works have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, when evaluating the model against each individual attack, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against that perturbation type.

Facing this challenge, we introduce the problem of categorizing adversarial examples based on their Lp perturbation types. Based on our analysis, we propose Protector, a two-stage pipeline to improve the robustness against multiple perturbation types. Instead of training a single predictor, Protector first categorizes the perturbation type of the input, and then utilizes a predictor specifically trained against the predicted perturbation type to make the final prediction.

We validate our approach from both theoretical and empirical aspects. First, we present theoretical analysis to show that for benign samples with the same ground truth label, their distributions become highly distinct when added with different types of perturbations, and thus can be separated. Further, we show that there exists a natural tension between attacking the top-level perturbation classifier and the second-level predictors -- strong attacks against the second-level predictors make it easier for the perturbation classifier to predict the adversarial perturbation type, and fooling the perturbation classifier requires planting weaker (or less representative) attacks against the second-level predictors. As a result, even an imperfect perturbation classifier is sufficient to significantly improve the overall robustness of the model to multiple perturbation types.

Empirically, we compare Protector to the state-of-the-art defenses against multiple perturbations on MNIST and CIFAR-10. Protector outperforms prior approaches by over 5% against the union of the L1, L2 and L attacks. While past work has focused on the worst case metric against all attacks, on average they suffer significant trade-offs against individual attacks. From the suite of 25 different attacks tested, the average improvement for Protector over all the attacks w.r.t. the state-of-the-art baseline defense is ~15% on both MNIST and CIFAR-10. In particular, by adding random noise to the model input at test time, we further increase the tension between attacking top-level and second-level components, and bring in additional improvement of robustness against adaptive attackers. Additionally, Protector provides a modular way to integrate and update defenses against a single perturbation type.