Anzeige
Anzeige

Más contenido relacionado

Similar a Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)(20)

Anzeige

Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)

  1. Responsible AI in Industry: Practical Challenges and Lessons Learned TutoriaI 2021 Krishnaram Kenthapadi (Amazon AWS AI), Ben Packer (Google AI), Mehrnoosh Sameki (Microsoft Azure), Nashlie Sephus (Amazon AWS AI) https://sites.google.com/view/ResponsibleAITutorial
  2. Massachusetts Group Insurance Commission (1997): Anonymized medical history of state employees William Weld vs Latanya Sweeney Latanya Sweeney (MIT grad student): $20 – Cambridge voter roll born July 31, 1945 resident of 02138
  3. 64% Uniquely identifiable with ZIP + birth date + gender (in the US population) Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, WPES 2006
  4. • Ethical challenges posed by AI systems • Inherent biases present in society • Reflected in training data • AI/ML models prone to amplifying such biases Algorithmic Bias
  5. Laws against Discrimination Immigration Reform and Control Act Citizenship Rehabilitation Act of 1973; Americans with Disabilities Act of 1990 Disability status Civil Rights Act of 1964 Race Age Discrimination in Employment Act of 1967 Age Equal Pay Act of 1963; Civil Rights Act of 1964 Sex And more...
  6. Fairness Privacy Transparency Explainability
  7. Motivation & Business Opportunities Regulatory. We need to understand why the ML model made a given prediction and also whether the prediction it made was free from bias, both in training and at inference. Business. Providing explanations to internal teams (loan officers, customer service rep, compliance teams) and end users/customers Data Science. Improving models through better feature engineering and training data generation, understanding failure modes of the model, debugging model predictions, etc.
  8. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon SageMaker VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Trainium Inferentia FPGA AI SERVICES ML SERVICES FRAMEWORKS & INFRASTRUCTURE DeepGraphLibrary Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Lex Amazon Personalize Amazon Forecast Amazon Comprehend +Medical Amazon Textract Amazon Kendra Amazon CodeGuru Amazon Fraud Detector Amazon Translate INDUSTRIAL AI CODE AND DEVOPS NEW Amazon DevOps Guru Voice ID For AmazonConnect Contact Lens NEW Amazon Monitron NEW AWS Panorama + Appliance NEW Amazon Lookout for Vision NEW Amazon Lookout for Equipment Scaling Fairness, Explainability & Privacy across the AWS ML Stack NEW Amazon HealthLake HEALTHCARE AI NEW Amazon Lookout for Metrics ANOMALY DETECTION Amazon Transcribe for Medical Amazon Comprehend for Medical Label data NEW Aggregate & prepare data NEW Store & share features Auto ML Spark/R NEW Detect bias Visualize in notebooks Pick algorithm Train models Tune parameters NEW Debug & profile Deploy in production Manage & monitor NEW CI/CD Human review review NEW: Model management for edge devices NEW: SageMaker JumpStart SAGEMAKER STUDIO IDE
  9. LinkedIn operates the largest professional network on the Internet Tell your story 740M members 55M+ companies are represented on LinkedIn 90K schools listed (high school & college) 36K skills listed 14M+ open jobs on LinkedIn Jobs 280B Feed updates
  10. Tutorial Outline • Fairness-aware ML: An overview • Explainable AI: An overview • Privacy-preserving ML: An overview • Responsible AI tools • Case studies • Key takeaways
  11. Fairness-aware ML: An Overview Nashlie Sephus, PhD Applied Science Manager, AWS AI
  12. • ML Fairness Considerations • ML and Humans • What is fairness/inclusion for ML? • Where May Biases Occur? • Testing Techniques with Face Experiments • Takeaways Outline 13
  13. Product Introspection (1): Make Your Key Choices Explicit [Mitchell et al., 2018] Goals Decision Prediction Profit from loans Whether to lend Loan will be repaid Justice, Public safety Whether to detain Crime committed if not detained • Goals are ideally measurable • What are your non-goals? • Which decisions are you not considering? • What is the relationship between Prediction and Decision?
  14. Product Introspection (2): Identify Potential Harms • What are the potential harms? • Applicants who would have repaid are not given loans • Convicts who would not commit a crime are locked up. • Are there also longer term harms? • Applicants are given loans, then go on to default, harming their credit score • Are some harms especially bad?
  15. Seek out Diverse Perspectives • Fairness Experts • User Researchers • Privacy Experts • Legal • Social Science Backgrounds • Diverse Identities • Gender • Sexual Orientation • Race • Nationality • Religion
  16. AI/ML and Humans 17
  17. ! 16
  18. 20
  19. Error-free (no system is perfect) 100% confident Intended to replace human judgement What ML Is Not 21
  20. Fairness Techniques in Faces 24
  21. Detect presence of a face in an image or a video. Face Detection 25
  22. A system to determine the gender, age, emotion, presence of facial hair, etc. from a detected face. Face Analysis
  23. A system to determine a detected faces identity by matching it against a database of faces and their associated identities. Face Recognition 27
  24. Estimation of the confidence or certainty of any prediction Expressed in the form of a probability or confidence score Confidence Score 28
  25. Face Recognition: Common Causes of Errors ILLUMINATION VARIANCE POSE / VIEWPOINT AGING EXPRESSION / STYLE OCCLUSION Lighting, camera controls like exposure, shadows, highlights Face pose, camera angles Natural aging, artificial makeup Face expression like laughing, facial hair such as a beard, hair style Part of the face hidden as in group pictures 29
  26. Where Can Biases Exist? 30
  27. AFRICAN SCANDINAVIAN A B C The PPB dataset [Buolamwini 2018] 6.3% 20.8% GenderShades.Org 31 [Buolamwini & Gebru 2018]
  28. Darker Skinned Female 32
  29. Racial Comparisons of Datasets [FairFace] 33
  30. Black Male
  31. Latino Hispanic Male 35
  32. Southeast Asian Female 36
  33. [Nvidia] 37
  34. Launch with Confidence: Testing for Bias • How will you know if users are being harmed? • How will you know if harms are unfairly distributed? • Detailed testing practices are often not covered in academic papers • Discussing testing requirements is a useful focal point for cross-functional teams
  35. Reproducibility - Notebook Experiments 39
  36. PPB2 Data Analytics 40
  37. 41
  38. Gender Classification – PPB2 42
  39. 43
  40. 44
  41. 45
  42. 46
  43. Short Hair 47
  44. Gender Classification w.r.t. Hair Lengths – PPB2 48
  45. FairFace Dataset Analytics 49
  46. Gender classification – FairFace 50
  47. Model Predictions Evaluate for Inclusion - Confusion Matrix
  48. Model Predictions Positive Negative Evaluate for Inclusion - Confusion Matrix
  49. Model Predictions Positive Negative ● Exists ● Predicted True Positives ● Doesn’t exist ● Not predicted True Negatives Evaluate for Inclusion - Confusion Matrix
  50. Model Predictions Positive Negative ● Exists ● Predicted True Positives ● Exists ● Not predicted False Negatives ● Doesn’t exist ● Predicted False Positives ● Doesn’t exist ● Not predicted True Negatives Evaluate for Inclusion - Confusion Matrix
  51. Efficient Testing for Bias • Development teams are under multiple constraints • Time • Money • Human resources • Access to data • How can we efficiently test for bias? • Prioritization • Strategic testing
  52. Choose your evaluation metrics in light of acceptable tradeoffs between False Positives and False Negatives
  53. Takeaways • Testing for blindspots amongst intersectionality is key. • Taking into account confidence scores/thresholds and error bars when measuring for biases is necessary. • Representation matters. • Transparency, reproducibility, and education can promote change. • Confidence in your product's fairness requires fairness testing • Fairness testing has a role throughout the product iteration lifecycle • Contextual concerns should be used to prioritize fairness testing 57
  54. Explainable AI: An Overview Krishnaram Kenthapadi Amazon AWS AI
  55. What is Explainable AI? Data Opaque AI AI product Confusion with Today’s Opaque AI ● Why did you do that? ● Why did you not do that? ● When do you succeed or fail? ● How do I correct an error? Opaque AI Prediction, Recommendation Clear & Transparent Predictions ● I understand why ● I understand why not ● I know why you succeed or fail ● I understand, so I trust you Explainable AI Data Explainable AI Explainable AI Product Prediction Explanation Feedback
  56. - Humans may have follow-up questions - Explanations cannot answer all users’ concerns Weld, D., and Gagan Bansal. "The challenge of crafting intelligible intelligence." Communications of the ACM (2018). Example of an End-to-End XAI System
  57. Neural Net CNN GAN RNN Ensemble Method Random Forest XGB Statistical Model AOG SVM Graphical Model Bayesian Belief Net SLR CRF HBN MLN Markov Model Decision Tree Linear Model Non-Linear functions Polynomial functions Quasi-Linear functions Accuracy Explainability Interpretability Learning • Challenges: • Supervised • Unsupervised learning • Approach: • Representation Learning • Stochastic selection • Output: • Correlation • No causation How to Explain? Accuracy vs. Explainability
  58. Text Tabular Images On the Role of Data in XAI
  59. Explainable AI Methods Approach 1: Post-hoc explain a given AI model ● Individual prediction explanations in terms of input features, influential examples, concepts, local decision rules ● Global prediction explanations in terms of entire model in terms of partial dependence plots, global feature importance, global decision rules Approach 2: Build an interpretable model ● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive Models (GAMs) 63
  60. Credit: Slide based on https://twitter.com/chandan_singh96/status/1138811752769101825 Integrated Gradients Opaque, complex model
  61. Explaining Model Predictions: Credit Lending 65 Fair lending laws [ECOA, FCRA] require credit decisions to be explainable
  62. Feature Attribution Problem Attribute a model’s prediction on an input to features of the input Examples ● Attribute a lending model’s prediction to its features ● Attribute an object recognition model’s prediction to its pixels ● Attribute a text sentiment model’s prediction to individual words Several attribution methods: ● Ablations ● Gradient based methods (specific to differentiable models) ● Score back-propagation based methods (specific to neural networks) ● Game theory based methods (Shapley values) 66
  63. Application of Attributions ● Debugging model predictions E.g., Attribution an image misclassification to the pixels responsible for it ● Generating an explanation for the end-user E.g., Expose attributions for a lending prediction to the end-user ● Analyzing model robustness E.g., Craft adversarial examples using weaknesses surfaced by attributions ● Extract rules from the model E.g., Combine attribution to craft rules (pharmacophores) capturing prediction logic of a drug screening network 67
  64. Next few slides We will cover the following attribution methods** ● Ablations ● Gradient based methods (specific to differentiable models) ● Score Backpropagation based methods (specific to NNs) We will also discuss game theory (Shapley value) in attributions **Not a complete list! See Ancona et al. [ICML 2019], Guidotti et al. [arxiv 2018] for a comprehensive survey 68
  65. Ablations Drop each feature and attribute the change in prediction to that feature Pros: ● Simple and intuitive to interpret Cons: ● Unrealistic inputs ● Improper accounting of interactive features ● Can be computationally expensive 69
  66. Feature*Gradient Attribution to a feature is feature value times gradient, i.e., xi* 𝜕y/𝜕xi ● Gradient captures sensitivity of output w.r.t. feature ● Equivalent to Feature*Coefficient for linear models ○ First-order Taylor approximation of non-linear models ● Popularized by SaliencyMaps [NeurIPS 2013], Baehrens et al. [JMLR 2010] 70 Gradients in the vicinity of the input seem like noise?
  67. Local linear approximations can be too local 71 score “fireboat-ness” of image Interesting gradients uninteresting gradients (saturation) 1.0 0.0
  68. Score Back-Propagation based Methods Re-distribute the prediction score through the neurons in the network ● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014] Easy case: Output of a neuron is a linear function of previous neurons (i.e., ni = ⅀ wij * nj) e.g., the logit neuron ● Re-distribute the contribution in proportion to the coefficients wij 72 Image credit heatmapping.org
  69. Score Back-Propagation based Methods Re-distribute the prediction score through the neurons in the network ● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014] Tricky case: Output of a neuron is a non-linear function, e.g., ReLU, Sigmoid, etc. ● Guided BackProp: Only consider ReLUs that are on (linear regime), and which contribute positively ● LRP: Use first-order Taylor decomposition to linearize activation function ● DeepLift: Distribute activation difference relative a reference point in proportion to edge weights 73 Image credit heatmapping.org
  70. Score Back-Propagation based Methods Re-distribute the prediction score through the neurons in the network ● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014] Pros: ● Conceptually simple ● Methods have been empirically validated to yield sensible result Cons: ● Hard to implement, requires instrumenting the model ● Often breaks implementation invariance Think: F(x, y, z) = x * y *z and G(x, y, z) = x * (y * z) Image credit heatmapping.org
  71. Baselines and additivity ● When we decompose the score via backpropagation, we imply a normative alternative called a baseline ○ “Why Pr(fireboat) = 0.91 [instead of 0.00]” ● Common choice is an informationless input for the model ○ E.g., All black or all white image for image models ○ E.g., Empty text or zero embedding vector for text models ● Additive attributions explain F(input) - F(baseline) in terms of input features
  72. score intensity Interesting gradients uninteresting gradients (saturation) 1.0 0.0 Baseline … scaled inputs ... … gradients of scaled inputs …. Input Another approach: gradients at many points
  73. IG(input, base) ::= (input - base) * ∫0 -1▽F(𝛂*input + (1-𝛂)*base) d𝛂 Original image Integrated Gradients Integrated Gradients [ICML 2017] Integrate the gradients along a straight-line path from baseline to input
  74. 78 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Feature Attributions using Shapley Values Coalition Game ● Players collaborating to generate some payoff ● Shapley value: Fair way to quantify each player’s contribution to the game’s payoff 78
  75. 79 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Shapley Value: Intuition 79
  76. 80 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Shapley Value [Annals of Mathematical studies,1953] = Average over all player orderings & subsets SC M / {i} Marginal contribution Coalition contribution with and without this player f – payoff of the game M – all players S – subset of players i – specific player Fair way to quantify each player’s contribution to the game’s payoff
  77. 81 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Shapley Value Justification Shapley values are unique under four simple axioms ● Dummy: If a player never contributes to the game then it must receive zero attribution ● Efficiency: Attributions must add to the total payoff ● Symmetry: Symmetric players must receive equal attribution ● Linearity: Attribution for the (weighted) sum of two games must be the same as the (weighted) sum of the attributions for each of the games 81
  78. 82 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Explaining Model Predictions using Shapley Values f Game payoff Game theory All players Subset of players Specific player Instance being explained Model prediction All features Subset of features Specific feature Instance being explained M S i x Machine learning https://github.com/slundberg/shap Lundberg et al: NeurIPS (2017)
  79. 83 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Explaining Model Predictions: SHAP Algorithm How it works = Average over all feature orderings & subsets SC M / {i} f – model prediction M – all features S – subset of features i – specific feature x – instance being explained https://github.com/slundberg/shap Lundberg et al: NeurIPS (2017)
  80. 84 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Explaining Model Predictions: SHAP Algorithm Different solutions for runtime and algorithmic challenges = Average over all feature orderings & subsets SC M / {i} f – model prediction M – all features S – subset of features i – specific feature x – instance being explained https://github.com/slundberg/shap Lundberg et al: NeurIPS (2017) Exponential time How to run a model with missing features?
  81. 85 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Explaining Model Predictions: SHAP Algorithm Different solutions for runtime and algorithmic challenges = Average over all feature orderings & subsets SC M / {i} f – model prediction M – all features S – subset of features i – specific feature x – instance being explained https://github.com/slundberg/shap Lundberg et al: NeurIPS (2017) Exponential time How to run a model with missing features? Shap.KernelExplainer • All models • Explanations Sample subsets, defaults to 2|M| + 2048 Fill missing values from a background dataset • Training set • Median of the dataset • K-means representatives • All-black image Kernel Open-source SHAP implementation
  82. Baselines (or Norms) are essential to explanations [Kahneman-Miller 86] ● E.g., A man suffers from indigestion. Doctor blames it to a stomach ulcer. Wife blames it on eating turnips. Both are correct relative to their baselines. ● The baseline may also be an important analysis knob. Attributions are contrastive, whether we think about it or not. Lesson learned: Baselines are important
  83. Some limitations and caveats for attributions
  84. Some things that are missing: ● Feature interactions (ignored or averaged out) ● What training examples influenced the prediction (training agnostic) ● Global properties of the model (prediction-specific) An instance where attributions are useless: ● A model that predicts TRUE when there are even number of black pixels and FALSE otherwise Attributions don’t explain everything
  85. Attributions are for human consumption Naive scaling of attributions from 0 to 255 Attributions have a large range and long tail across pixels After clipping attributions at 99% to reduce range ● Humans interpret attributions and generate insights ○ Doctor maps attributions for x-rays to pathologies ● Visualization matters as much as the attribution technique
  86. Privacy-preserving ML: An Overview Krishnaram Kenthapadi Amazon AWS AI
  87. What is Privacy? • Right of/to privacy • “Right to be let alone” [L. Brandeis & S. Warren, 1890] • “No one shall be subjected to arbitrary interference with [their] privacy, family, home or correspondence, nor to attacks upon [their] honor and reputation.” [The United Nations Universal Declaration of Human Rights] • “The right of a person to be free from intrusion into or publicity concerning matters of a personal nature” [Merriam-Webster] • “The right not to have one's personal matters disclosed or publicized; the right to be left alone” [Nolo’s Plain-English Law Dictionary]
  88. Data Privacy (or Information Privacy) • “The right to have some control over how your personal information is collected and used” [IAPP] • “Privacy has fast-emerged as perhaps the most significant consumer protection issue—if not citizen protection issue—in the global information economy” [IAPP]
  89. Data Privacy vs. Security • Data privacy: use & governance of personal data • Data security: protecting data from malicious attacks & the exploitation of stolen data for profit • Security is necessary, but not sufficient for addressing privacy.
  90. Data Privacy:Technical Problem Given a dataset with sensitive personal information, how can we compute and release functions of the dataset while protecting individual privacy? Credit: Kobbi Nissim
  91. A History of Privacy Failures … Credit: Kobbi Nissim,Or Sheffet
  92. Lessons Learned … • Attacker’s advantage: Auxiliary information; high dimensionality; enough to succeed on a small fraction of inputs; active; observant … • Unanticipated privacy failures from new attack methods • Need for rigorous privacy notions & techniques
  93. Threat Models
  94. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Threat Models
  95. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Threat Models
  96. Threat Models User Access Only • Users store their data • Noisy data or analytics transmitted Trusted Curator • Stored by organization • Managed only by a trusted curator/admin • Access only to noisy analytics or synthetic data External Threat • Stored by organization • Organization has access • Only privacy enabled models deployed
  97. Differential Privacy
  98. Curator Defining Privacy
  99. Defining Privacy 103 Curator Curator + your data - your data
  100. Differential Privacy 104 Databases D and D′ are neighbors if they differ in one person’s data. Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Curator + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006] Curator
  101. (ε, 𝛿)-Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Differential Privacy 105 Curator Parameter ε quantifies information leakage ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]+𝛿. Curator Parameter 𝛿 gives some slack Dwork, Kenthapadi, McSherry, Mironov, Naor [EUROCRYPT 2006] + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006]
  102. Differential Privacy: Random Noise Addition If ℓ1-sensitivity of f : D → ℝn: maxD,D′ ||f(D) − f(D′)||1 = s, then adding Laplacian noise to true output f(D) + Laplacen(s/ε) offers (ε,0)-differential privacy. Dwork, McSherry, Nissim, Smith [TCC 2006]
  103. Examples of Attacks • Membership inference attacks • Reconstruction attacks • Model inversion attacks • Different attacks correspond to different threat models
  104. Privacy-preserving MLTechniques /Tools • Differentially private model training • Differentially private stochastic gradient descent • PATE • Differentially private synthetic data generation • Federated learning • Trusted execution environments (secure enclaves) • Secure multiparty computation • Homomorphic encryption • Different techniques applicable under different threat models • May need a hybrid approach
  105. AI Fairness and Transparency Tools Mehrnoosh Sameki Microsoft Azure
  106. Microsoft
  107. Machine Learning Transparency and Fairness
  108. Interpretability
  109. Understand and debug your model Glassbox Models: Model Types: Linear Models, Decision Trees, Decision Rules, Explainable Boosting Machines Blackbox Models: Model Formats: Python models using scikit predict convention, Scikit, Tensorflow, Pytorch, Keras, Explainers: SHAP, LIME, Global Surrogate, Feature Permutation DiCE Diverse Counterfactual Explanations Interpret Glassbox and blackbox interpretability methods for tabular data Interpret-text Interpretability methods for text data Interpret-community Community-driven interpretability techniques for tabular data https://github.com/interpretml Azureml-interpret AzureML SDK wrapper for Interpret and Interpret-community
  110. Interpretability Approaches
  111. Models designed to be interpretable. Lossless explainability. Fever? Internal Bleeding? Stay Home Stay Home Go to Hospital Decision Trees Rule Lists Linear Models ….
  112. Explain any ML system. Approximate explainability. Model Explanation Perturb Inputs Analyze SHAP LIME Partial Dependence Sensitivity Analysis
  113. Fairness Useful links: • AI Show • Tutorial Video • Customer Highlight
  114. There are many ways that an AI system can behave unfairly. Fairness in AI Avoiding negative outcomes of AI systems for different groups of people A model for screening loan or job application might be much better at picking good candidates among white men than among other groups. A voice recognition system might fail to work as well for women as it does for men.
  115. https://github.com/fairlearn/fairlearn Assessing unfairness in your model Fairness Assessment: Usecommonfairness metricsandaninteractive dashboardto assess which groups of peoplemay benegatively impacted. Model Formats: Python models using scikit predict convention, Scikit, Tensorflow, Pytorch, Keras Metrics: 15+ Common group fairness metrics Model Types: Classification, Regression Unfairness Mitigation: Use state-of-the-art algorithmsto mitigate unfairness in yourclassificationandregressionmodels.
  116. Fairness Assessment Input Selections Sensitive attribute Performance metric Assessment Results Disparity in performance Disparity in predictions Mitigation Algorithms Post-processing algorithm Reductions Algorithm
  117. Situation: Solution: Philips Healthcare used Fairlearn to check whether our ICU models perform similarly for patients with different ethnicities and gender identities, etc. Microsoft CSE (Led by Tempest Van Schaik) collaborated with Philips to build a solution using Azure DevOps pipelines, Azure Databricks and Mlflow. Built a pipeline to make fairness monitoring routine, checking the fairness of predictions for patients of different genders, ethnicities, and medical conditions, using Fairlearn metrics. Fairness analysis helped show that Philips’ predictive model performs better than industry standard ICU models Standard model predictions for a patient differ depending on how the ICU documented their test results. Customer: Philips Industry: Healthcare Size: 80,000+ employees Country: Netherlands Products and services: Microsoft Azure DevOps Microsoft Azure Databricks MLFlow Putting fairness monitoring in production with ICU models Philips Healthcare Informatics Philips Healthcare Informatics helps ICUs benchmark their performance (e.g. mortality rate). They create quarterly benchmark reports that compare actual performance vs performance predicted by ML models. They have models trained on the largest ICU dataset in USA: 400+ ICUs, 6M+ patient stays, billions of vital signs & lab tests. Deploying ICU models responsibly Philips needed a scalable, reliable, repeatable and responsible way to bring ML models into production.
  118. Situation: Solution: Impact: “Azure Machine Learning and its Fairlearn capabilities offer advanced fairness and explainability that have helped us deploy trustworthy AI solutions for our customers, while enabling stakeholder confidence and regulatory compliance.” —Alex Mohelsky, Partner and Advisory Data, Analytic, and AI Leader, EY Canada Customer: EY Industry: Partner Professional Services Size: 10,000+ employees Country: United Kingdom Products and services: Microsoft Azure Microsoft Azure Machine Learning Read full story here Organizations won’t fully embrace AI until they trust it. EY wanted to help its customers embrace AI to help them better understand their customers, identify fraud and security breaches sooner, and make loan decisions faster and more efficiently. The company developed its EY Trusted AI Platform, which uses Microsoft Azure Machine Learning capabilities to assess and mitigate unfairness in machine learning models. Running on Azure, the platform uses Fairlearn and InterpretML, open-source capabilities in Azure Machine Learning. When EY tested Fairlearn with real mortgage data, it reduced the accuracy disparity between men and women approved or denied loans from 7 percent to less than 0.5 percent. Through this platform, EY helps customers and regulators alike gain confidence in AI and machine learning.
  119. TECH SOCIETY Reception & Adoption Educational materials are key for adoption of fairness toolkits  manuals, case studies, white papers [Lee & Singh, 2020]
  120. • Interpretability at training time • Combination of glass-box models and black-box explainers • Auto reason code generation for local predictions • Ability to cross reference to other techniques to ensure stability and consistency in results H2O
  121. IBM Open Scale Goal: to provide AI operations team with a toolkit that allows for: • Monitoring and re-evaluating machine learning models after deployment
  122. IBM Open Scale Goal: to provide AI operations team with a toolkit that allows for: • Monitoring and re-evaluating machine learning models after deployment • ACCURACY • FAIRNESS • PERFORMANCE
  123. IBM Open Scale
  124. IBM Open Scale
  125. IBM Open Scale Fairness Accuracy Performance
  126. IBM Open Scale
  127. AI Fairness 360 Datasets Toolbox Fairness metrics (30+) Bias mitigation algorithms (9+) Guidance Industry-specific tutorials
  128. AI Fairness 360 Datasets Toolbox Fairness metrics (30+) Bias mitigation algorithms (9+) Guidance Industry-specific tutorials Pre-processing algorithm: a bias mitigation algorithm that is applied to training dat In-processing algorithm: a bias mitigation algorithm that is applied to a model during its training Post-processing algorithm: a bias mitigation algorithm that is applied to predicted labels
  129. What If Tool Goal: Code-free probing of machine learning models • Feature perturbations (what if scenarios) • Counterfactual example analysis • [Classification] Explore the effects of different classification thresholds, taking into account constraints such as different numerical fairness metrics.
  130. What If Tool
  131. Datasheets for Datasets [Gebru et al., 2018] • Better data-related documentation • Datasheets for datasets: every dataset, model, or pre-trained API should be accompanied by a data sheet that documents its • Creation • Intended uses • Limitations • Maintenance • Legal and ethical considerations • Etc.
  132. Model Cards for Model Reporting[Mitchell et al., 2018]
  133. Fact Sheets [Arnold et al., 2019] • Is distinguished from “model cards” and “datasheets” in that the focus is on the final AI service: • What is the intended use of the service output? • What algorithms or techniques does this service implement? • Which datasets was the service tested on? (Provide links to datasets that were used for testing, along with corresponding datasheets.) • Describe the testing methodology. • Describe the test results. • Etc.
  134. Responsible AI Case Studies at Amazon Krishnaram Kenthapadi Amazon AWS AI
  135. 142 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | 142 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Detect bias in ML models and understand model predictions Amazon SageMaker Clarify
  136. 143 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Predictive Maintenance Manufacturing, Automotive, IoT Demand Forecasting Retail, Consumer Goods, Manufacturing Fraud Detection Financial Services, Online Retail Credit Risk Prediction Financial Services, Retail Extract and Analyze Data from Documents Healthcare, Legal, Media/Ent, Education Computer Vision Healthcare, Pharma, Manufacturing Autonomous Driving Automotive, Transportation Personalized Recommendations Media & Entertainment, Retail, Education Churn Prediction Retail, Education, Software & Internet https://aws.amazon.c om/sagemaker/gettin g-started Amazon SageMaker Customer ML Use cases
  137. 144 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Bias and Explainability: Challenges 1 Without detection, it is hard to know if bias has entered an ML model: • Imbalances may be present in the initial dataset • Bias may develop during training • Bias may develop over time after model deployment 2 Machine learning models are often complex & opaque, making explainability critical: • Regulations may require companies to be able to explain model predictions • Internal stakeholders and customers may need explanations for model behavior • Data science teams can improve models if they understand model behavior
  138. 145 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Amazon SageMaker Clarify Detect bias in ML models and understand model predictions Detect bias during data preparation Identify imbalances in data Evaluate the degree to which various types of bias are present in your model Check your trained model for bias Understand the relative importance of each feature to your model’s behavior Explain overall model behavior Understand the relative importance of each feature for individual inferences Explain individual predictions Provide alerts and detect drift over time due to changing real-world conditions Detect drift in bias and model behavior over time Generated automated reports Produce reports on bias and explanations to support internal presentations
  139. 146 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify works across the ML lifecycle Collect and prepare training data Train and tune model Evaluate and qualify model Deploy model in production Monitor model in production Measure Bias Metrics Measure and Tune Bias Metrics Measure Explainability Metrics Catalog Model Metrics Measure Bias Metrics Measure Explainability Metrics Monitor Bias Metric Drift Monitor Explainability Drift SageMaker Data Wrangler SageMaker Training Autopilot Hyperparameter Tuning SageMaker Processing SageMaker Hosting SageMaker Model Monitor
  140. 147 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | How SageMaker Clarify works Amazon SageMaker Clarify
  141. 148 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify – Detect Bias During Data Preparation Bias report in SageMaker Data Wrangler
  142. 149 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify – Check Your Trained Model for Bias Bias report in SageMaker Experiments
  143. 150 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify – Monitor Your Model for Bias Drift Bias Drift in SageMaker Model Monitor
  144. 151 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify – Understand Your Model Model Explanation in SageMaker Experiments
  145. 152 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify – Monitor Your Model for Drift in Behavior Explainability Drift in SageMaker Model Monitor
  146. 153 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Case Study & Demo: https://youtu.be/cQo2ew0DQw0
  147. 154 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Clarify Use Cases Regulatory Compliance Internal Reporting Operational Excellence Customer Service
  148. 155 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Best Practices • Fairness as a Process: • The notions of bias and fairness are highly application dependent and the choice of the attribute(s) for which bias is to be measured, as well as the choice of the bias metrics, may need to be guided by social, legal, and other non- technical considerations. • Building consensus and achieving collaboration across key stakeholders (such as product, policy, legal, engineering, and AI/ML teams, as well as end users and communities) is a prerequisite for the successful adoption of fairness-aware ML approaches in practice. • Fairness and explainability considerations may be applicable during each stage of the ML lifecycle.
  149. 156 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Fairness and Explainability by Design in the ML Lifecycle
  150. 157 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Additional Pointers For more information on Amazon SageMaker Clarify, please refer: • https://aws.amazon.com/sagemaker/clarify • Amazon Science / AWS Articles • https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects- bias-and-increases-the-transparency-of-machine-learning-models • https://www.amazon.science/latest-news/how-clarify-helps-machine-learning- developers-detect-unintended-bias • Technical paper: Fairness Measures for Machine Learning in Finance • https://github.com/aws/amazon-sagemaker-clarify Acknowledgments: Amazon SageMaker Clarify core team, Amazon AWS AI team, and partners across Amazon
  151. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Amazon SageMaker Debugger Debug and profile ML model training and get real-time insights
  152. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Why debugging and profiling Training bugs Large compute instances Long training times Training ML models is difficult and compute intensive
  153. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark SageMaker Debugger Real-time monitoring Relevant data capture Automatic error detection SageMaker Studio integration Debug data while training is ongoing Zero code change Persistent in your S3 bucket Built-in and custom rules Early termination Alerts about rule status Save time and cost Find issues early Accelerate prototyping Detect performance bottlenecks View suggestions on resolving bottlenecks, Interactive visualizations Monitor utilization Profile by step or time duration Right size instance Improve utilization Reduce cost System resource usage Time spent by training operations
  154. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark SageMaker Debugger Training in progress Analysis in progress Customer’s S3 Bucket Amazon SageMaker Cloud Watch Event Amazon SageMaker Studio Visualization SageMaker Notebook Action  Stop the training Action  Analyze using Debugger SDK Action  Visualize Tensors using managed Tensorboard
  155. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Supported Frameworks Framework Versions TensorFlow 1.15, 2.1 MXNet 1.6 PyTorch 1.4, 1.5 XGBoost 0.90-2, 1.0-1 Zero code change
  156. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark SageMaker Debugger Benefits Valuable insights into model training Pinpoints issues with low level profiling Helps to improve model accuracy Can lead to significant cost savings and faster model convergence
  157. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Deployment Result: Optimizing Visual Search Model https://aws.amazon.com/blogs/machine-learning/autodesk-optimizes-visual-similarity-search- model-in-fusion-360-with-amazon-sagemaker-debugger/ Training Model Architecture Model Deployment Autodesk optimizes visual similarity search model in Fusion 360 with SageMaker Debugger
  158. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark https://www.amazon.science/publications/amazon-sagemaker-debugger-a-system-for-real-time-insights-into-machine-learning-model- training
  159. 166 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Additional Pointers For more information on Amazon SageMaker Debugger, please refer: • https://aws.amazon.com/sagemaker/debugger • AWS Articles • https://aws.amazon.com/blogs/aws/amazon-sagemaker-debugger-debug-your- machine-learning-models • https://aws.amazon.com/blogs/machine-learning/detecting-hidden-but-non-trivial- problems-in-transfer-learning-models-using-amazon-sagemaker-debugger • Technical paper: Amazon SageMaker Debugger: A System for Real-Time Insights into Machine Learning Model Training (MLSys 2021) • https://pypi.org/project/smdebug Acknowledgments: Amazon SageMaker Debugger core team, Amazon AWS AI team, and partners across Amazon
  160. Fairness for Opaque Models via Model Tuning (Hyperparameter Optimization) • Can we tune the hyperparameters of a model to achieve both accuracy and fairness? • Can we support both opaque models and opaque fairness constraints? • Use Bayesian optimization for HPO with fairness constraints! • Explore hyperparameter configurations where fairness constraints are satisfied V. Perrone, M. Donini, K. Kenthapadi, C. Archambeau, Bayesian Optimization with Fairness Constraints, ICML 2020 AutoML workshop (best paper award).
  161. Defining Fairness
  162. Fairness Definitions - Examples
  163. ϵ-Fairness
  164. Global Optimization for Opaque Functions
  165. Bayesian Optimization
  166. Fair Bayesian Optimization
  167. Results: Model-specific vs. Model-agnostic Methods
  168. Fairness in human-in-the-loop settings • Joint learning framework to learn a classifier and a deferral system for multiple experts simultaneously • Synthetic and real-world experiments on the efficacy of our method V. Keswani, M. Lease, K. Kenthapadi, Towards Unbiased and Accurate Deferral to Multiple Experts. 2020 (Tech Report).
  169. Human-in-the-loop frameworks Desirable to augment ML model predictions with expert inputs Popular examples – Healthcare models Content moderation tasks Useful for improving accuracy, incorporating additional information, and auditing models. Child maltreatment hotline screening
  170. Errors and biases in human-in-the-loop frameworks ML tasks often suffer from group-specific bias, induced due to misrepresentative data or models. Human-in-the-loop frameworks can reflect biases or inaccuracies of the human experts. Concerns include: • Racial bias in human-in-the-loop framework for recidivism risk assessment (Green, Chen – FAT* 2019) • Ethical concerns regarding audits of commercial facial processing technologies (Raji et al. – AIES 2020) • Automation bias in time critical decision support systems (Cummings – ISTC 2004) Can we design human-in-the-loop frameworks that take into account the expertise and biases of the human experts?
  171. Model 𝑋 − non-protected attributes; 𝑌 − class label; 𝑍 − protected attribute/group membership Number of experts available = 𝑚 − 1 Input −𝑋 Classifier 𝐹: 𝑋 → 𝑌 Deferrer 𝐷: 𝑋 → 0,1 𝑚 . . If 𝐷1 = 1 If 𝐷2 = 1 If 𝐷3 = 1 If 𝐷4 = 1 If 𝐷𝑚−1 = 1 If 𝐷𝑚 = 1 Deferrer 𝐷 choose a committee of experts. The majority decision of committee is the final prediction Final output of selected committee Experts might have access to additional information, including group membership 𝑍. There might be a cost/penalty associated with each expert review.
  172. Privacy Research @ Amazon - Sampler Work done by Oluwaseyi Feyisetan, Tom Diethe, Thomas Drake, Borja Balle
  173. Simple but effective, privacy-preserving mechanism Task: subsample from dataset using additional information in privacy- preserving way. Building on existing exponential analysis of k-anonymity, amplified by sampling… Mechanism M is (β, ε, δ)-differentially private Model uncertainty via Bayesian NN ”Privacy-preserving Active Learning on Sensitive Data for User Intent Classification” [Feyisetan, Balle, Diethe, Drake; PAL 2019]
  174. Differentially-private text redaction Task: automatically redact sensitive text for privatizing various ML models.  Perturb sentences but maintain meaning e.g. “goalie wore a hockey helmet”  “keeper wear the nhl hat” Apply metric DP and analysis of word embeddings to scramble sentences Mechanism M is d χ – differentially private Establish plausible deniability statistics: Nw := Pr[M(w ) = w ] Sw := Expected number of words output by M(w) “Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations” [Feyisetan, Drake, Diethe, Balle; WSDM 2020]
  175. Analysis of DP redaction Show plausible deniability via dist. of Nw & Sw for ε: ε  0 : Nw decreases, Sw increases ε  inf : Nw increases, Sw decreases. Impact of accuracy given ε (epsilon) on multi-class classification and question answering tasks, respectively:
  176. Improving data utility of DP text redaction Task: redact text, but use additional structured information to better preserve utility. Can we improve redaction for models that fail for extraneous words? ~Recall-sensitive Extend d χ privacy to hyperbolic embeddings [Tifrea 2018] via Hyperbolic: utilize high-dimensional geometry to infuse embeddings with graph structure E.g. uni- or bi-directional syllogisms from WebIsADb New privacy analysis of Poincaré model and sampling procedure Mechanism takes advantage of density in data to apply perturbations more precisely. “Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text” Feyisetan, Drake, Diethe; ICDM 2019 Tiling in Poincaré disk Hyperbolic Glove emb. projected into B2 Poincaré disk
  177. Analysis of Hyperbolic redaction New method improves over privacy and utility because of ability to encode meaningful structure in embeddings. Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines Plausible deniability stat Nw (Pr[M(w ) = w) improved.
  178. Responsible AI Case Studies at Google Ben Packer Google AI
  179. Google Assistant
  180. Google Assistant Key Points: • Think about user harms How does your product make people feel • Adversarial ("stress") testing for all Google Assistant launches • People might say racist, sexist, homophobic stuff • Diverse testers • Think about expanding who your users could and should be • Consider the diversity of your users
  181. Computer Vision
  182. Google Camera Key points: • Check for unconscious bias • Comprehensive testing: "make sure this works for everybody"
  183. Night Sight
  184. This is a “Shirley Card” Named after a Kodak studio model named Shirley Page, they were the primary method for calibrating color when processing film. SKIN TONE IN PHOTOGRAPHY SOURCES Color film was built for white people. Here's what it did to dark skin. (Vox) How Kodak's Shirley Cards Set Photography's Skin-Tone Standard, NPR
  185. Until about 1990, virtually all Shirley Cards featured Caucasian women. SKIN TONE IN PHOTOGRAPHY SOURCES Color film was built for white people. Here's what it did to dark skin. (Vox) Colour Balance, Image Technologies, and Cognitive Equity, Roth How Photography Was Optimized for White Skin Color (Priceonomics)
  186. As a result, photos featuring people with light skin looked fairly accurate. SKIN TONE IN PHOTOGRAPHY SOURCES Color film was built for white people. Here's what it did to dark skin. (Vox) Colour Balance, Image Technologies, and Cognitive Equity, Roth How Photography Was Optimized for White Skin Color (Priceonomics) Film Kodachrome Year 1970 Credit Darren Davis, Flickr
  187. Photos featuring people with darker skin, not so much... SKIN TONE IN PHOTOGRAPHY SOURCES Color film was built for white people. Here's what it did to dark skin. (Vox) Colour Balance, Image Technologies, and Cognitive Equity, Roth How Photography Was Optimized for White Skin Color (Priceonomics) Film Kodachrome Year 1958 Credit Peter Roome, Flickr
  188. Google Clips
  189. Google Clips "We created controlled datasets by sampling subjects from different genders and skin tones in a balanced manner, while keeping variables like content type, duration, and environmental conditions constant. We then used this dataset to test that our algorithms had similar performance when applied to different groups." https://ai.googleblog.com/2018/05/automat ic-photography-with-google-clips.html
  190. Geena Davis Inclusion Quotient [with Geena Davis Institute on Gender in Media]
  191. Smart Compose
  192. Adversarial Testing for Smart Compose in Gmail
  193. Adversarial Testing fEmbedding Modelin Gmail https://developers.googleblog.com/2018/04/text-embedding-models-contain-bias.html
  194. Adversarial Testing fEmbedding Modelin Gmail https://developers.googleblog.com/2018/04/text-embedding-models-contain-bias.html
  195. Adversarial Testing for Smart Compose in Gmail
  196. Machine Translation
  197. (Historical) Gender Pronouns in Translate
  198. Three Step Approach
  199. 1. Detect Gender-Neutral Queries Train a text classifier to detect when a Turkish query is gender-neutral. • trained on thousands of human-rated Turkish examples
  200. 2. Generate Gender-Specific Translations • Training: Modify training data to add an additional input token specifying the required gender: • (<2MALE> O bir doktor, He is a doctor) • (<2FEMALE> O bir doktor, She is a doctor) • Deployment: If step (1) predicted query is gender-neutral, add male and female tokens to query • O bir doktor -> {<2MALE> O bir doktor, <2FEMALE> O bir doktor}
  201. 3. Check for Accuracy Verify: 1. If the requested feminine translation is feminine. 2. If the requested masculine translation is masculine. 3. If the feminine and masculine translations are exactly equivalent with the exception of gender-related changes.
  202. Result: Reduced Gender Bias in Translate
  203. Responsible AI Case Studies at LinkedIn Krishnaram Kenthapadi Amazon AWS AI
  204. Fairness in AI @ LinkedIn Fairness-aware Talent Search Ranking* *Work done while at LinkedIn
  205. Guiding Principle: “Diversity by Design”
  206. Insights to Identify Diverse Talent Pools Representative Talent Search Results Diversity Learning Curriculum “Diversity by Design” in LinkedIn’s Talent Solutions
  207. Plan for Diversity
  208. Representative Ranking for Talent Search S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness- Aware Ranking in Search & Recommendation Systems with Application to LinkedIn Talent Search, KDD’19. [Microsoft’s AI/ML conference (MLADS’18). Distinguished Contribution Award] Building Representative Talent Search at LinkedIn (LinkedIn engineering blog)
  209. Intuition for Measuring and Achieving Representativeness Ideal: Top ranked results should follow a desired distribution on gender/age/… E.g., same distribution as the underlying talent pool Inspired by “Equal Opportunity” definition [Hardt et al, NeurIPS’16] Defined measures (skew, divergence) based on this intuition
  210. Measuring (Lack of) Representativeness Skew@k (Logarithmic) ratio of the proportion of candidates having a given attribute value among the top k ranked results to the corresponding desired proportion Variants: MinSkew: Minimum over all attribute values MaxSkew: Maximum over all attribute values Normalized Discounted Cumulative Skew Normalized Discounted Cumulative KL-divergence
  211. Fairness-aware Reranking Algorithm (Simplified) Partition the set of potential candidates into different buckets for each attribute value Rank the candidates in each bucket according to the scores assigned by the machine-learned model Merge the ranked lists, balancing the representation requirements and the selection of highest scored candidates Representation requirement: Desired distribution on gender/age/… Algorithmic variants based on how we choose the next attribute
  212. Architecture
  213. Validating Our Approach Gender Representativeness Over 95% of all searches are representative compared to the qualified population of the search Business Metrics A/B test over LinkedIn Recruiter users for two weeks No significant change in business metrics (e.g., # InMails sent or accepted) Ramped to 100% of LinkedIn Recruiter users worldwide
  214. Lessons learned • Post-processing approach desirable • Model agnostic • Scalable across different model choices for our application • Acts as a “fail-safe” • Robust to application-specific business logic • Easier to incorporate as part of existing systems • Build a stand-alone service or component for post-processing • No significant modifications to the existing components • Complementary to efforts to reduce bias from training data & during model training • Collaboration/consensus across key
  215. Evaluating Fairness Using Permutations Tests [DiCiccio, Vasudevan, Basu, Kenthapadi, Agarwal, KDD’20] • Is the measured discrepancy across different groups statistically significant? • Use statistical hypothesis tests! • Can we perform hypothesis tests in a metric-agnostic manner? • Non-parametric tests can help! • Permutation testing framework
  216. Brief Review of Permutation Tests Observe data from two populations: and Are the populations the same? A reasonable test statistic might be
  217. Brief Review of Permutation Tests (Continued) A p-value is the chance of observing a test statistic at least as “extreme” as the value we actually observed Permutation test approach: ●Randomly shuffle the population designations of the observations ●Recompute the test statistic T ●Repeat many times Permutation p-value: the proportion of permuted datasets resulting in a larger test statistic than the original value This test is exact!
  218. A Fairness Example Consider testing whether the true positive rate of a classifier is equal between two groups Test Statistic: difference in proportion of negative labeled observations that are classified as positive between the two groups Permutation test: Randomly reshuffle group labels, recompute test statistic
  219. Permutations Tests for Evaluating Fairness in ML Models • Issues with classical permutation test • Want to check: just equality of the fairness metric (e.g., false positive rate) across groups, and not if the two groups have identical distribution • Exact for the strong null hypothesis … • … but may not be valid (even asymptotically) for the weak null hypothesis • Our paper: A fix for this issue • Choose a pivotal statistic (asymptotically distribution-free; does not depend on the observed data’s distribution) • E.g., Studentize the test statistic
  220. Engineering for Fairness in AI Lifecycle S.Vasudevan, K. Kenthapadi, LiFT: A Scalable Framework for Measuring Fairness in ML Applications, CIKM’20 https://github.com/linkedin/LiFT
  221. LiFT System Architecture [Vasudevan & Kenthapadi, CIKM’20] •Flexibility of Use (Platform agnostic) •Ad-hoc exploratory analyses •Deployment in offline workflows •Integration with ML Frameworks •Scalability •Diverse fairness metrics •Conventional fairness metrics •Benefit metrics •Statistical tests
  222. Acknowledgements LinkedIn Talent Solutions Diversity team, Hire & Careers AI team, Anti-abuse AI team, Data Science Applied Research team Special thanks to Deepak Agarwal, Parvez Ahammad, Stuart Ambler, Kinjal Basu, Jenelle Bray, Erik Buchanan, Bee-Chung Chen, Fei Chen, Patrick Cheung, Gil Cottle, Cyrus DiCiccio, Patrick Driscoll, Carlos Faham, Nadia Fawaz, Priyanka Gariba, Meg Garlinghouse, Sahin Cem Geyik, Gurwinder Gulati, Rob Hallman, Sara Harrington, Joshua Hartman, Daniel Hewlett, Nicolas Kim, Rachel Kumar, Monica Lewis, Nicole Li, Heloise Logan, Stephen Lynch, Divyakumar Menghani, Varun Mithal, Arashpreet Singh Mor, Tanvi Motwani, Preetam Nandy, Lei Ni, Nitin Panjwani, Igor Perisic, Hema Raghavan, Romer Rosales, Guillaume Saint-Jacques, Badrul Sarwar, Amir Sepehri, Arun Swami, Ram Swaminathan, Grace Tang, Ketan Thakkar, Sriram Vasudevan, Janardhanan Vembunarayanan, James Verbus, Xin Wang, Hinkmond Wong, Ya Xu, Lin Yang, Yang Yang, Chenhui Zhai, Liang Zhang, Yani Zhang
  223. Privacy in AI @ LinkedIn PriPeARL: Framework to compute robust, privacy-preserving analytics
  224. Analytics & Reporting Products at LinkedIn Profile View Analytics 232 Content Analytics Ad Campaign Analytics All showing demographics of members engaging with the product
  225. Admit only a small # of predetermined query types Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn
  226. Admit only a small # of predetermined query types Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn E.g., Title = “Senior Director” E.g., Clicks on a given ad
  227. Privacy Requirements Attacker cannot infer whether a member performed an action E.g., click on an article or an ad Attacker may use auxiliary knowledge E.g., knowledge of attributes associated with the target member (say, obtained from this member’s LinkedIn profile) E.g., knowledge of all other members that performed similar action (say, by creating fake accounts)
  228. Possible Privacy Attacks 236 Targeting: Senior directors in US, who studied at Cornell Matches ~16k LinkedIn members → over minimum targeting threshold Demographic breakdown: Company = X May match exactly one person → can determine whether the person clicks on the ad or not Require minimum reporting threshold Attacker could create fake profiles! E.g. if threshold is 10, create 9 fake profiles that all click. Rounding mechanism E.g., report incremental of 10 Still amenable to attacks E.g. using incremental counts over time to infer individuals’ actions Need rigorous techniques to preserve member privacy (not reveal exact aggregate counts)
  229. Problem Statement Compute robust, reliable analytics in a privacy- preserving manner, while addressing the product needs.
  230. PriPeARL: A Framework for Privacy-Preserving Analytics K. Kenthapadi, T. T. L. Tran, ACM CIKM 2018 238 Pseudo-random noise generation, inspired by differential privacy ● Entity id (e.g., ad creative/campaign/account) ● Demographic dimension ● Stat type (impressions, clicks) ● Time range ● Fixed secret seed Uniformly Random Fraction ● Cryptographic hash ● Normalize to (0,1) Random Noise Laplace Noise ● Fixed ε True Count Noisy Count To satisfy consistency requirements ● Pseudo-random noise → same query has same result over time, avoid averaging attack. ● For non-canonical queries (e.g., time ranges, aggregate multiple entities) ○ Use the hierarchy and partition into canonical queries ○ Compute noise for each canonical queries and sum up the noisy counts
  231. PriPeARL System Architecture
  232. Lessons Learned from Deployment (> 1 year) Semantic consistency vs. unbiased, unrounded noise Suppression of small counts Online computation and performance requirements Scaling across analytics applications Tools for ease of adoption (code/API library, hands-on how-to tutorial) help! Having a few entry points (all analytics apps built over Pinot)  wider adoption
  233. Summary Framework to compute robust, privacy-preserving analytics Addressing challenges such as preserving member privacy, product coverage, utility, and data consistency Future Utility maximization problem given constraints on the ‘privacy loss budget’ per user E.g., noise with larger variance to impressions but less noise to clicks (or conversions) E.g., more noise to broader time range sub-queries and less noise to granular time range sub-queries Reference: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy- Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018.
  234. Acknowledgements Team: AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian Koeppe Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke Acknowledgements Deepak Agarwal, Igor Perisic, Arun Swami
  235. LinkedIn Salary
  236. LinkedIn Salary (launched in Nov, 2016)
  237. Data Privacy Challenges Minimize the risk of inferring any one individual’s compensation data Protection against data breach No single point of failure
  238. Problem Statement How do we design LinkedIn Salary system taking into account the unique privacy and security challenges, while addressing the product requirements? K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
  239. Title Region $$ User Exp Designer SF Bay Area 100K User Exp Designer SF Bay Area 115K ... ... ... Title Region $$ User Exp Designer SF Bay Area 100K De-identification Example Title Region Company Industry Years of exp Degree FoS Skills $$ User Exp Designer SF Bay Area Google Internet 12 BS Interactive Media UX, Graphics, ... 100K Title Region Industry $$ User Exp Designer SF Bay Area Internet 100K Title Region Years of exp $$ User Exp Designer SF Bay Area 10+ 100K Title Region Company Years of exp $$ User Exp Designer SF Bay Area Google 10+ 100K #data points > threshold? Yes ⇒ Copy to Hadoop (HDFS) Note: Original submission stored as encrypted objects.
  240. System Architecture
  241. Acknowledgements Team: AI/ML: Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul Jain, Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal Application Engineering: Ahsan Chudhary, Alan Yang, Alex Navasardyan, Brandyn Bennett, Hrishikesh S, Jim Tao, Juan Pablo Lomeli Diaz, Patrick Schutz, Ricky Yan, Lu Zheng, Stephanie Chou, Joseph Florencio, Santosh Kumar Kancha, Anthony Duerr Product: Ryan Sandler, Keren Baruch Other teams (UED, Marketing, BizOps, Analytics, Testing, Voice of Members, Security, …): Julie Kuang, Phil Bunge, Prateek Janardhan, Fiona Li, Bharath Shetty, Sunil Mahadeshwar, Cory Scott, Tushar Dalvi, and team Acknowledgements David Freeman, Ashish Gupta, David Hardtke, Rong Rong, Ram
  242. Key Takeaways Krishnaram Kenthapadi Amazon AWS AI
  243. Good ML Practices Go a Long Way Lots of low hanging fruit in terms of improving fairness simply by using machine learning best practices • Representative data • Introspection tools • Visualization tools • Testing 01 Fairness improvements often lead to overall improvements • It’s a common misconception that it’s always a tradeoff 02
  244. Breadth and Depth Required Looking End-to-End is critical • Need to be aware of bias and potential problems at every stage of product and ML pipelines (from design, data gathering, … to deployment and monitoring) 01 Details Matter • Slight changes in features or labeler criteria can change the outcome • Must have experts who understand the effects of decisions • Many details are not technical such as how labelers are hired 02
  245. Process Best Practices Identify product goals Get the right people in the room Identify stakeholders Select a fairness approach Analyze and evaluate your system Mitigate issues Monitor Continuously and Escalation Plans Auditing and Transparency Policy Technology
  246. Beyond Accuracy Performance and Cost Fairness and Bias Transparency and Explainability Privacy Security Safety Robustness
  247. Fairness, Explainability & Privacy: Opportunities
  248. Fairness in ML Application specific challenges Conversational AI systems: Unique bias/fairness/ethics considerations E.g., Hate speech, Complex failure modes Beyond protected categories, e.g., accent, dialect Entire ecosystem (e.g., including apps such as Alexa skills) Two-sided markets: e.g., fairness to buyers and to sellers, or to content consumers and producers Fairness in advertising (externalities) Tools for ensuring fairness (measuring & mitigating bias) in AI lifecycle Pre-processing (representative datasets; modifying features/labels) ML model training with fairness constraints Post-processing Experimentation & Post-deployment
  249. Key Open Problems in Applied Fairness What if you don’t have the sensitive attributes? When should you use what approach? For example, Equal treatment vs equal outcome? How to identify harms? Process for framing AI problems: Will the chosen metrics lead to desired results? How to tell if data generation and collection method is appropriate for a task? (e.g., causal structure analysis?) Processes for mitigating harms and misbehaviors quickly
  250. Explainability in ML Actionable explanations Balance between explanations & model secrecy Robustness of explanations to failure modes (Interaction between ML components) Application-specific challenges Conversational AI systems: contextual explanations Gradation of explanations Tools for explanations across AI lifecycle Pre & post-deployment for ML models Model developer vs. End user focused
  251. Privacy in ML Privacy for highly sensitive data: model training & analytics using secure enclaves, homomorphic encryption, federated learning / on- device learning, or a hybrid Privacy-preserving model training, robust against adversarial membership inference attacks (Dynamic settings + Complex data / model pipelines) Privacy-preserving mechanisms for data marketplaces
  252. Reflections “Fairness, Explainability, and Privacy by Design” when building AI products Collaboration/consensus across key stakeholders NYT / WSJ / ProPublica test :)
  253. Related Tutorials / Resources • ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) • AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) • Sara Hajian, Francesco Bonchi, and Carlos Castillo, Algorithmic bias: From discrimination discovery to fairness-aware data mining, KDD Tutorial, 2016. • Solon Barocas and Moritz Hardt, Fairness in machine learning, NeurIPS Tutorial, 2017. • Kate Crawford, The Trouble with Bias, NeurIPS Keynote, 2017. • Arvind Narayanan, 21 fairness definitions and their politics, FAccT Tutorial, 2018. • Sam Corbett-Davies and Sharad Goel, Defining and Designing Fair Algorithms, Tutorials at EC 2018 and ICML 2018. • Ben Hutchinson and Margaret Mitchell, Translation Tutorial: A History of Quantitative Fairness in Testing, FAccT Tutorial, 2019. • Henriette Cramer, Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miroslav Dudík, Hanna Wallach, Sravana Reddy, and Jean Garcia-Gathright, Translation Tutorial: Challenges of incorporating algorithmic fairness into industry practice, FAccT Tutorial, 2019.