Data Con LA 2022 - AI Ethics

Building AI That Works for Everyone
AI Ethics for Technical People

About Me
• Ph.D. Statistician
• Labor Economist
• Software Developer
• Artist
• Midwest Farm Girl
• Pronouns: she, her, hers
3/1/20XX SAMPLE FOOTER TEXT 2

About This Talk
Focused on “high-stakes AI”.
• Defined by Smbasivan, Highball, Akron, Parish, and Aroyo
(2021)
• I do recommend these exercises for everyone.
AI Ethics problems require input from technical people.
Many of our biggest issues come from manual verification of
automated systems.
When I say “AI that works for everyone,” I mean everyone.
• People using the model
• People affected by the model
• Data labelers
• Data engineers
• Machine learning engineers
• Data scientists

An Actual LinkedIn Poll from an AI Ethics Expert
Predicted
Cancer
Predicted No
Cancer
Has Cancer TP (True
Positive)
FP (False
Negative)
Does Not Have
Cancer
FP (False
Positive)
TN (True
Negative)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑡𝑖𝑒𝑛𝑡𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Which model would you rather have?
A black box cancer screening model with 99% accuracy?
An explainable cancer screening model with 90% accuracy?
This is the wrong question!
Predicted
Cancer
Predicted
No Cancer
Row
Percents
Has Cancer Has Cancer
More
Screening
Has Cancer
and Does
Not Know
1% of
Patients
Does Not
Have
Cancer
No Cancer
More
Screening
No Extra
Screening
No Cancer
99% of
Patients
Predicted
Cancer
Predicted No
Cancer
Row
Percents
Has Cancer TP (True
Positive)
FN (False
Negative)
1% of
Patients
Does Not
Have Cancer
FP (False
Positive)
TN (True
Negative)
99% of
Patients

Typical AI/ML Pipeline
Failure Analysis
Fairness Analysis
Impact Analysis
Feedback on model
performance in production is
the cornerstone of an AI Ethics
practice.
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data
and Model
produce decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM

Scoring Data
and Model
produce decisions
Typical AI/ML Pipeline In practice, anything that isn’t
model training or scoring is:
• Ad hoc
• Manual
• Prone to data errors
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Failure
Analysis
Fairness Analysis
Impact Analysis

Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance

Technical Pillars of Trustworthy AI
Does this model work
for everyone?
Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance
Prevention
of Harm
How often does the model
fail, and what is the
impact?
Are model failures the
same for everyone?
How do we know the
model is failing?
Fairness Analysis
Failure Analysis Failure Monitoring Impact Analysis

Typical AI/ML Pipeline
Failure Analysis
Fairness Analysis
Impact Analysis
Technical leaders and
individual contributors have a
role in each of these pillars.
People affected
by Decisions
give feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data and
Model produce
decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Human Agency
and Oversight
Prevention
of Harm
Fairness
Social and
Environmental
Well-Being
Privacy and
Data
Governance
Privacy and
Data
Governance
Accountability
Technical
Robustness and
Safety

Failure Analysis
Cancer
Screening
Predicted
Cancer
Predicted
No Cancer
Has Cancer Has Cancer
More
Screening
Has Cancer
and Does
Not Know
Does Not
Have
Cancer
No Cancer
More
Screening
No Extra
Screening
No Cancer
1. Find the cell in the confusion
matrix that causes the most harm
to the least advantaged group.
2. Analyze rates and outcomes for
that cell.
Fairness
Prevention
of Harm
Fraud
Screening
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit

Aequitas Fairness Tree
Is being predicted positive punitive or assistive?
Which group is harmed most by mistakes?
Can you intervene with most
people or just a subset?
# 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒
False
Discovery
Rate (FDR)
False
Positive
Rate
True Positive
Rate
# 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
False Negative
Rate (Recall)
False
Omission
Rate
Fairness Tree: Data Science and Public Policy, Carnegie Mellon University
http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/
Everyone
People who get
intervention
People who
do not get
intervention
Most
Subset
Everyone
People Not
Assisted
People with
Actual Need
Accountability
Technical
Robustness and
Safety

Failure Analysis: Pre-Deployment
• Failure analysis is often ad-hoc and depends heavily on
the data sources available.
• e.g. We may not know how many cancers human screeners
miss.
• Deployment should include automating failure analysis.
• Deployment should include plans for cadence of failure
analysis.
Accountability
Technical
Robustness and
Safety

Tools for Failure Analysis
• Every model will produce the statistics listed in the
fairness tree. (e.g. sklearn.metrics)
• It is up to the modeling team to decide which statistics
are the most important and to display them in a way that
communicates impact to stakeholders.
• Deciding on a set of metrics that should be monitored
post-deployment is part of the analysis.
• Once the analysis is done, it should be automated so it
can be re-done at regular intervals. These scripts are
usually tailored to the business problem.
• AWS Clarify has a nice set of tools for calculating and displaying
statistics.
Accountability
Technical
Robustness and
Safety

Failure Analysis Depends on Good Data
Failure
Analysis
Fairness Analysis
Impact Analysis
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data
and Model
produce decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Accountability
Technical
Robustness and
Safety

"Everyone wants to do the model work, not the data work": Data Cascades in
High-Stakes AI,
Nithya Sambasivan and Shivani Kapania and Hannah Highfill and Diana Akrong and
Praveen Kumar Paritosh and Lora Mois Aroyo
(2021)

Fairness Analysis
1. Focus on cell where most
harm occurs.
2. Compare performance for
underrepresented and/or
unprivileged groups.
Fairness
Prevention
of Harm
Fraud
Screening
Group A
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit
Fraud
Screening
Group B
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit
Fraud
Screening
Group C
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit

Aequitas Fairness Tree
Is being predicted positive punitive or assistive?
Can you intervene with most
people or just a subset?
True Positive
Rate
False Negative
Rate (Recall)
Fairness Tree: Data Science and Public Policy, Carnegie Mellon University
http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/
Everyone
People who get
intervention
People who
do not get
intervention
Most
Subset
Everyone
People Not
Assisted
People with
Actual Need
FP/GS
Parity
FDR Parity FPR Parity
Recall
Parity
FN/GS
Parity
FOR Parity FNR Parity
# 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
False
Discovery
Rate (FDR)
False
Positive
Rate
# 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
False
Omission
Rate
Fairness
Prevention
of Harm

Failure Monitoring
Failure
Analysis
Fairness Analysis
Impact Analysis
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data
and Model
produce decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Human Agency
and Oversight
Prevention
of Harm

How do we know the model is failing?
• What pipelines exist for people to give feedback on model
performance?
• Experts/Operators who are using the models.
• People who are affected by the model.
• How do we automate monitoring the most critical model
performance metrics?
• What outside data is available as a check against our assumptions
about the model?
• There are no great tools for checking failures.
• Cloud providers do offer some tools if you are using their cloud (e.g. AWS,
Azure, and Google).
Human Agency
and Oversight
Prevention
of Harm

Lowest Hanging Fruit: Automate All Data Pipelines
dbt Soda and SodaCL great-expectations deequ
Runs code pipeline
and data checks
Data checks only Data checks only Data checks only
Built-in tests and
SQL-based user
defined tests
Built-in tests and
SQL-based user
defined tests
Built-in tests for
Python
Build-in tests and
Spark/PySpark-
based user defined
tests.
SQL based with
open source dbt
core and
subscription-based
cloud option
SQL-based with
open source
SodaCore and
subscription-based
SodaCloud
Python-based Spark/PySpark-
based
Human Agency
and Oversight
Prevention
of Harm

Hardening Pipelines: Obvious Tests for Tabular Data
• Uniqueness: “This column/combination of columns
should be unique by row.”
• Correctness: “Only these values allowed in this column.”
• Missingness: “These columns should be populated for
X% of rows.”
• Range: “Nothing bigger/smaller than [a,b] should be in
this column.”
• … You get the picture.
Human Agency
and Oversight
Prevention
of Harm

Hardening Pipelines: Less Obvious Tests for Tabular Data
• Feature Drift: Are distributions of inputs changing?
• Model Drift: Are the model predictions changing?
• Kolmogorov-Smirnov: What is the probability of
observing the data we see today (or something weirder)
compared to what we think the data should look like?
• A p-value of 0.05 means this test alarms 5% of the time when
all is normal. Use False Discovery Rate to find true errors.
• KL Divergence (Population Stability Index)
• Sensitive to the bins you pick.
• These tests are sensitive to outliers. Outliers happen all
the time.
Human Agency
and Oversight
Prevention
of Harm

Data Pipelines Hardened? Automate the Workflow
Model-card-
toolkit
Metaflow deepchecks Luigi Airflow
Open source
systems for
creating model
cards.
Runs code
pipeline and
data checks.
Developed
specifically for
data science.
Data checks
and
performance
checks for full
model
pipeline.
Full featured
and let you
automate all of
your scripts for
everything.
Like Luigi but
automates
some of the
more tedious
parts.
Python based Python based Python based Python-based Python based.
DAG: Directed Acyclic Graph
• A collection of tasks and their dependencies.
• Directed: Each task that requires output from previous tasks knows its own
dependencies.
• Acyclic: A graph term. It means there’s no point where a task depends on output
from a task that can’t be performed before the current task.
Model Card
• Simplified explanation of model inputs, outputs, and assumptions.
Human Agency
and Oversight
Prevention
of Harm

Impact Analysis: AI That Works for Everyone
• The least technical part of AI Ethics.
• Arguably the part of AI Ethics that most needs technical
assistance.
• Part of the initial project plan.
• Local Impacts: This model’s impact on its stakeholders.
• Social Impacts: How does this model contribute to AI’s
larger issues?
• Mitigation Analysis: What can we do within the scope of
this project to mitigate negative impacts?
SAMPLE FOOTER TEXT
Social and
Environmental
Well-Being
Privacy and
Data
Governance

Local Impact of an AI Model
• Does this model improve working conditions for the people
who use it?
• e.g. An AI model that requires a lot of data input from nurses and
doctors may increase their job responsibilities without
compensating or rewarding them for extra effort.
• Does this model improve outcomes for people affected by the
model?
• e.g. A fraud detection model may speed payment for most
individuals.
• Does this model make things worse for some individuals?
• e.g. A fraud detection model may speed payment for most
individuals and slow payment for others to an unacceptable level.
• Are we collecting only the data we need? Are we keeping that
data safe?
• e.g. Does my word game really need my location?
Social and
Environmental
Well-Being
Privacy and
Data
Governance

Social Impact of an AI Model
• Environmental cost of an AI model is non-negligible:
https://openai.com/blog/ai-and-compute/
• We need efficient computation, and that is a technical problem.
• Many AI models profit from free or underpaid labor:
https://www.wired.com/story/foundations-ai-riddled-errors/
• Labeling software should be good software.
• Large-scale adoption of AI models has other effects.
• Never mind the trolly problem: Suppose 10% of the cars on the road are self-
driving. Now, suppose there’s a network outage during a heavy traffic period.
Social and
Environmental
Well-Being
Privacy and
Data
Governance

AI Ethics and Model Development
• Pre-Development
• Impact Analysis: Who will use the model and how?
• Failure Analysis: What is the most impactful failure? What is an acceptable
level of failure?
• Fairness Analysis: What are the underrepresented/unpriviledged groups?
• Failure Monitoring: What development is needed for Human-to-Model
feedback?
• Model Development
• Design and hardening of data pipelines, including privacy.
• Model’s ability to meet failure thresholds.
• Deployment
• Does the model meet criteria set during pre-development?
• Are the requirements in place?

Ethical AI is Good AI and Good AI is Ethical AI
• Ethical AI knows when it fails and the impact of those failures.
• Ethical AI fails in the same way for everyone.
• Ethical AI is monitored for failures and has strong feedback loops that
surface problems quickly.
• Ethical AI is designed for positive impact on the communities where it
is implemented and for society as a whole.
Who doesn’t want that?
Ellis-Lee, Mia. (2008) “Accessible Design is Good Design & Good Design is Accessible Design. Flywheel hosted blog.
https://www.flywheelstrategic.com/thinking/post/flywheel-blog/2018/04/06/accessible-design-is-good-design-good-
design-is-accessible-design

Data Con LA 2022 - AI Ethics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Data Con LA 2022 - AI Ethics

Ähnlich wie Data Con LA 2022 - AI Ethics (20)

Mehr von Data Con LA

Mehr von Data Con LA (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Con LA 2022 - AI Ethics

Hinweis der Redaktion