Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.
Generative AI on Enterprise Cloud with NiFi and Milvus
Data Con LA 2022 - AI Ethics
1. Building AI That Works for Everyone
AI Ethics for Technical People
2. About Me
• Ph.D. Statistician
• Labor Economist
• Software Developer
• Artist
• Midwest Farm Girl
• Pronouns: she, her, hers
3/1/20XX SAMPLE FOOTER TEXT 2
3. About This Talk
Focused on “high-stakes AI”.
• Defined by Smbasivan, Highball, Akron, Parish, and Aroyo
(2021)
• I do recommend these exercises for everyone.
AI Ethics problems require input from technical people.
Many of our biggest issues come from manual verification of
automated systems.
When I say “AI that works for everyone,” I mean everyone.
• People using the model
• People affected by the model
• Data labelers
• Data engineers
• Machine learning engineers
• Data scientists
3/1/20XX SAMPLE FOOTER TEXT 3
4. An Actual LinkedIn Poll from an AI Ethics Expert
3/1/20XX SAMPLE FOOTER TEXT 4
Predicted
Cancer
Predicted No
Cancer
Has Cancer TP (True
Positive)
FP (False
Negative)
Does Not Have
Cancer
FP (False
Positive)
TN (True
Negative)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑡𝑖𝑒𝑛𝑡𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Which model would you rather have?
A black box cancer screening model with 99% accuracy?
An explainable cancer screening model with 90% accuracy?
This is the wrong question!
Predicted
Cancer
Predicted
No Cancer
Row
Percents
Has Cancer Has Cancer
More
Screening
Has Cancer
and Does
Not Know
1% of
Patients
Does Not
Have
Cancer
No Cancer
More
Screening
No Extra
Screening
No Cancer
99% of
Patients
Predicted
Cancer
Predicted No
Cancer
Row
Percents
Has Cancer TP (True
Positive)
FN (False
Negative)
1% of
Patients
Does Not
Have Cancer
FP (False
Positive)
TN (True
Negative)
99% of
Patients
5. Typical AI/ML Pipeline
Failure Analysis
Fairness Analysis
Impact Analysis
Feedback on model
performance in production is
the cornerstone of an AI Ethics
practice.
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data
and Model
produce decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
6. Scoring Data
and Model
produce decisions
Typical AI/ML Pipeline In practice, anything that isn’t
model training or scoring is:
• Ad hoc
• Manual
• Prone to data errors
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Failure
Analysis
Fairness Analysis
Impact Analysis
8. Technical Pillars of Trustworthy AI
Does this model work
for everyone?
Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance
Prevention
of Harm
How often does the model
fail, and what is the
impact?
Are model failures the
same for everyone?
How do we know the
model is failing?
Fairness Analysis
Failure Analysis Failure Monitoring Impact Analysis
9. Typical AI/ML Pipeline
Failure Analysis
Fairness Analysis
Impact Analysis
Technical leaders and
individual contributors have a
role in each of these pillars.
People affected
by Decisions
give feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data and
Model produce
decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Human Agency
and Oversight
Prevention
of Harm
Fairness
Social and
Environmental
Well-Being
Privacy and
Data
Governance
Privacy and
Data
Governance
Accountability
Technical
Robustness and
Safety
10. Technical Pillars of Trustworthy AI
Does this model work
for everyone?
Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance
Prevention
of Harm
How often does the model
fail, and what is the
impact?
Are model failures the
same for everyone?
How do we know the
model is failing?
Fairness Analysis
Failure Analysis Failure Monitoring Impact Analysis
11. Failure Analysis
Cancer
Screening
Predicted
Cancer
Predicted
No Cancer
Has Cancer Has Cancer
More
Screening
Has Cancer
and Does
Not Know
Does Not
Have
Cancer
No Cancer
More
Screening
No Extra
Screening
No Cancer
1. Find the cell in the confusion
matrix that causes the most harm
to the least advantaged group.
2. Analyze rates and outcomes for
that cell.
Fairness
Prevention
of Harm
Fraud
Screening
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit
12. Aequitas Fairness Tree
Is being predicted positive punitive or assistive?
Which group is harmed most by mistakes?
Can you intervene with most
people or just a subset?
Which group is harmed most by mistakes?
# 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒
False
Discovery
Rate (FDR)
False
Positive
Rate
True Positive
Rate
# 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒
False Negative
Rate (Recall)
False
Omission
Rate
Fairness Tree: Data Science and Public Policy, Carnegie Mellon University
http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/
Everyone
People who get
intervention
People who
do not get
intervention
Most
Subset
Everyone
People Not
Assisted
People with
Actual Need
Accountability
Technical
Robustness and
Safety
13. Failure Analysis: Pre-Deployment
• Failure analysis is often ad-hoc and depends heavily on
the data sources available.
• e.g. We may not know how many cancers human screeners
miss.
• Deployment should include automating failure analysis.
• Deployment should include plans for cadence of failure
analysis.
3/1/20XX SAMPLE FOOTER TEXT 13
Accountability
Technical
Robustness and
Safety
14. Tools for Failure Analysis
• Every model will produce the statistics listed in the
fairness tree. (e.g. sklearn.metrics)
• It is up to the modeling team to decide which statistics
are the most important and to display them in a way that
communicates impact to stakeholders.
• Deciding on a set of metrics that should be monitored
post-deployment is part of the analysis.
• Once the analysis is done, it should be automated so it
can be re-done at regular intervals. These scripts are
usually tailored to the business problem.
• AWS Clarify has a nice set of tools for calculating and displaying
statistics.
Accountability
Technical
Robustness and
Safety
15. Failure Analysis Depends on Good Data
Failure
Analysis
Fairness Analysis
Impact Analysis
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data
and Model
produce decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Accountability
Technical
Robustness and
Safety
16. "Everyone wants to do the model work, not the data work": Data Cascades in
High-Stakes AI,
Nithya Sambasivan and Shivani Kapania and Hannah Highfill and Diana Akrong and
Praveen Kumar Paritosh and Lora Mois Aroyo
(2021)
17. Technical Pillars of Trustworthy AI
Does this model work
for everyone?
Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance
Prevention
of Harm
How often does the model
fail, and what is the
impact?
Are model failures the
same for everyone?
How do we know the
model is failing?
Fairness Analysis
Failure Analysis Failure Monitoring Impact Analysis
18. Fairness Analysis
1. Focus on cell where most
harm occurs.
2. Compare performance for
underrepresented and/or
unprivileged groups.
Fairness
Prevention
of Harm
Fraud
Screening
Group A
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit
Fraud
Screening
Group B
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit
Fraud
Screening
Group C
Predicted
Fraud
Predicted
No Fraud
Fraudulent
Account
Audit,
Model
Makes $
Fraud and
No Audit,
Model
Loses $
Honest
Account
No Fraud,
Customer
Audit
No Fraud
No Audit
19. Aequitas Fairness Tree
Is being predicted positive punitive or assistive?
Which group is harmed most by mistakes?
Can you intervene with most
people or just a subset?
Which group is harmed most by mistakes?
True Positive
Rate
False Negative
Rate (Recall)
Fairness Tree: Data Science and Public Policy, Carnegie Mellon University
http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/
Everyone
People who get
intervention
People who
do not get
intervention
Most
Subset
Everyone
People Not
Assisted
People with
Actual Need
FP/GS
Parity
FDR Parity FPR Parity
Recall
Parity
FN/GS
Parity
FOR Parity FNR Parity
# 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒
False
Discovery
Rate (FDR)
False
Positive
Rate
# 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒
False
Omission
Rate
Fairness
Prevention
of Harm
20. Technical Pillars of Trustworthy AI
Does this model work
for everyone?
Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance
Prevention
of Harm
How often does the model
fail, and what is the
impact?
Are model failures the
same for everyone?
How do we know the
model is failing?
Fairness Analysis
Failure Analysis Failure Monitoring Impact Analysis
21. Failure Monitoring
Failure
Analysis
Fairness Analysis
Impact Analysis
People
affected by
Decisions give
feedback
Operator
reviews
Decisions
Training Data and
Code produce a
model
Scoring Data
and Model
produce decisions
P.Yes P.No
A.Yes TrP FN
A.No FP TM
Human Agency
and Oversight
Prevention
of Harm
22. How do we know the model is failing?
• What pipelines exist for people to give feedback on model
performance?
• Experts/Operators who are using the models.
• People who are affected by the model.
• How do we automate monitoring the most critical model
performance metrics?
• What outside data is available as a check against our assumptions
about the model?
• There are no great tools for checking failures.
• Cloud providers do offer some tools if you are using their cloud (e.g. AWS,
Azure, and Google).
3/1/20XX SAMPLE FOOTER TEXT 22
Human Agency
and Oversight
Prevention
of Harm
23. Lowest Hanging Fruit: Automate All Data Pipelines
dbt Soda and SodaCL great-expectations deequ
Runs code pipeline
and data checks
Data checks only Data checks only Data checks only
Built-in tests and
SQL-based user
defined tests
Built-in tests and
SQL-based user
defined tests
Built-in tests for
Python
Build-in tests and
Spark/PySpark-
based user defined
tests.
SQL based with
open source dbt
core and
subscription-based
cloud option
SQL-based with
open source
SodaCore and
subscription-based
SodaCloud
Python-based Spark/PySpark-
based
Human Agency
and Oversight
Prevention
of Harm
24. Hardening Pipelines: Obvious Tests for Tabular Data
• Uniqueness: “This column/combination of columns
should be unique by row.”
• Correctness: “Only these values allowed in this column.”
• Missingness: “These columns should be populated for
X% of rows.”
• Range: “Nothing bigger/smaller than [a,b] should be in
this column.”
• … You get the picture.
3/1/20XX SAMPLE FOOTER TEXT 24
Human Agency
and Oversight
Prevention
of Harm
25. Hardening Pipelines: Less Obvious Tests for Tabular Data
• Feature Drift: Are distributions of inputs changing?
• Model Drift: Are the model predictions changing?
• Kolmogorov-Smirnov: What is the probability of
observing the data we see today (or something weirder)
compared to what we think the data should look like?
• A p-value of 0.05 means this test alarms 5% of the time when
all is normal. Use False Discovery Rate to find true errors.
• KL Divergence (Population Stability Index)
• Sensitive to the bins you pick.
• These tests are sensitive to outliers. Outliers happen all
the time.
Human Agency
and Oversight
Prevention
of Harm
26. Data Pipelines Hardened? Automate the Workflow
Model-card-
toolkit
Metaflow deepchecks Luigi Airflow
Open source
systems for
creating model
cards.
Runs code
pipeline and
data checks.
Developed
specifically for
data science.
Data checks
and
performance
checks for full
model
pipeline.
Full featured
and let you
automate all of
your scripts for
everything.
Like Luigi but
automates
some of the
more tedious
parts.
Python based Python based Python based Python-based Python based.
DAG: Directed Acyclic Graph
• A collection of tasks and their dependencies.
• Directed: Each task that requires output from previous tasks knows its own
dependencies.
• Acyclic: A graph term. It means there’s no point where a task depends on output
from a task that can’t be performed before the current task.
Model Card
• Simplified explanation of model inputs, outputs, and assumptions.
Human Agency
and Oversight
Prevention
of Harm
27. Technical Pillars of Trustworthy AI
Does this model work
for everyone?
Human Agency
and Oversight
Fairness
Accountability
Prevention
of Harm
Social and
Environmental
Well-Being
Technical
Robustness and
Safety
Privacy and
Data
Governance
Prevention
of Harm
How often does the model
fail, and what is the
impact?
Are model failures the
same for everyone?
How do we know the
model is failing?
Fairness Analysis
Failure Analysis Failure Monitoring Impact Analysis
28. Impact Analysis: AI That Works for Everyone
• The least technical part of AI Ethics.
• Arguably the part of AI Ethics that most needs technical
assistance.
• Part of the initial project plan.
• Local Impacts: This model’s impact on its stakeholders.
• Social Impacts: How does this model contribute to AI’s
larger issues?
• Mitigation Analysis: What can we do within the scope of
this project to mitigate negative impacts?
SAMPLE FOOTER TEXT
Social and
Environmental
Well-Being
Privacy and
Data
Governance
29. Local Impact of an AI Model
• Does this model improve working conditions for the people
who use it?
• e.g. An AI model that requires a lot of data input from nurses and
doctors may increase their job responsibilities without
compensating or rewarding them for extra effort.
• Does this model improve outcomes for people affected by the
model?
• e.g. A fraud detection model may speed payment for most
individuals.
• Does this model make things worse for some individuals?
• e.g. A fraud detection model may speed payment for most
individuals and slow payment for others to an unacceptable level.
• Are we collecting only the data we need? Are we keeping that
data safe?
• e.g. Does my word game really need my location?
Social and
Environmental
Well-Being
Privacy and
Data
Governance
30. Social Impact of an AI Model
• Environmental cost of an AI model is non-negligible:
https://openai.com/blog/ai-and-compute/
• We need efficient computation, and that is a technical problem.
• Many AI models profit from free or underpaid labor:
https://www.wired.com/story/foundations-ai-riddled-errors/
• Labeling software should be good software.
• Large-scale adoption of AI models has other effects.
• Never mind the trolly problem: Suppose 10% of the cars on the road are self-
driving. Now, suppose there’s a network outage during a heavy traffic period.
Social and
Environmental
Well-Being
Privacy and
Data
Governance
31. AI Ethics and Model Development
• Pre-Development
• Impact Analysis: Who will use the model and how?
• Failure Analysis: What is the most impactful failure? What is an acceptable
level of failure?
• Fairness Analysis: What are the underrepresented/unpriviledged groups?
• Failure Monitoring: What development is needed for Human-to-Model
feedback?
• Model Development
• Design and hardening of data pipelines, including privacy.
• Model’s ability to meet failure thresholds.
• Deployment
• Does the model meet criteria set during pre-development?
• Are the requirements in place?
32. Ethical AI is Good AI and Good AI is Ethical AI
• Ethical AI knows when it fails and the impact of those failures.
• Ethical AI fails in the same way for everyone.
• Ethical AI is monitored for failures and has strong feedback loops that
surface problems quickly.
• Ethical AI is designed for positive impact on the communities where it
is implemented and for society as a whole.
Who doesn’t want that?
Ellis-Lee, Mia. (2008) “Accessible Design is Good Design & Good Design is Accessible Design. Flywheel hosted blog.
https://www.flywheelstrategic.com/thinking/post/flywheel-blog/2018/04/06/accessible-design-is-good-design-good-
design-is-accessible-design
Hinweis der Redaktion
Deloitte:’s Trustworthy AI Framework: https://www2.deloitte.com/us/en/pages/deloitte-analytics/solutions/ethics-of-ai-framework.html, https://www.technologyreview.com/2020/03/25/950291/trustworthy-ai-is-a-framework-to-help-manage-unique-risk/
US ai.gov: https://www.ai.gov/strategic-pillars/advancing-trustworthy-ai/
OECD Publishing (2021) “Trustworthy AI: A Framework to Compare Implementation Tools for Trustworthy AI Systems”. https://www.oecd.org/science/tools-for-trustworthy-ai-008232ec-en.htm
Fang, Huanming, Hui Miao (2020) “Introducing the Model Card Toolkit for Easier Model Transparency and Reporting.” Google AI Blog. https://ai.googleblog.com/2020/07/introducing-model-card-toolkit-for.html
Tagliabue, J., Tuulos, V., Greco, C. and Dave, V., 2021. DAG Card is the new Model Card. arXiv preprint arXiv:2110.13601. https://arxiv.org/pdf/2110.13601.pdf
“Where State Farm Sees ‘a Lot of Fraud,’ Black Customers See Discrimination” https://www.nytimes.com/2022/03/18/business/state-farm-fraud-black-customers.html
“Aiming for truth, fairness, and equity in your company’s use of AI” https://www.ftc.gov/business-guidance/blog/2021/04/aiming-truth-fairness-equity-your-companys-use-ai
“Weighing Big Tech’s Promise to Black America” https://www.wired.com/story/big-techs-promise-to-black-america/
Self-driving cars will make you forget how to drive: Javadi, AH., Emo, B., Howard, L. et al. Hippocampal and prefrontal processing of network topology to simulate the future. Nat Commun 8, 14652 (2017). https://doi.org/10.1038/ncomms14652