This is my attempt at an introduction to data ethics for mathematicians. Mathematicians increasingly need to deal with these kinds of issues, but we don't have the tradition of ethics training from other disciplines.
I welcome comments on how to improve these slides. Did I miss any salient points? Do you want to offer a different perspective on any of these? Do you want to offer any counterpoints? (Please e-mail me directly with comments and suggestions.)
Eventually, I hope to develop these slides further into an article for a venue aimed at mathematical scientists, and of course I would love to have knowledgeable coauthors who can offer a different perspective from mine.
1. More generally: A discussion of ethics for data, research,
and publishing
Mason A. Porter (@masonporter)
Department of Mathematics
UCLA
2. § It’s important.
§ People need to be able to replicate our work.
§ Making sure their own code is correct
§ Natural self-correction in science (and ability to understand precisely every choice we
make in our work)
§ Not traditionally part of mathematical training, but increasingly we are using social
data — including potentially personal data — in our research
3. § We use a lot more real data nowadays, and in particular this includes a lot of human
(and animal) data.
§ Much less a part of the research (and thus training) tradition in mathematics than in other
disciplines
§ Other disciplines have thought a lot more about ethics than mathematics
§ In many cases, unfortunately, because they’ve messed up the ethics historically,
sometimes substantially, and we need to learn from the best practices they’ve developed
"Look, lady. Just because my grandfather didn't rape the
environment and exploit the workers doesn't make me a
peasant. And it's not that he didn't want to rape the
environment and exploit the workers; I'm sure he did. It's
just that as a barber, he didn't have that much opportunity."
– Roger Cobb [Steve Martin], All of Me (1984)
Thanks to Peter Mucha for
the quote suggestion (and
an excuse to allude to this
movie)
4. § Be honest and fair (obviously)
§ Design ethically thoughtful research
§ Explain your decisions to others
§ [Points 2 and 3 taken from slides by Matt Salganik]
5. FOUR PRINCIPLES
§ Respect for persons
§ (Note: Animal research also has thorny ethical issues!)
§ Beneficence
§ Justice
§ Respect for Law and Public Interest
How do you balance these four principles?
7. § If you are working with personal data, you need to check with your Institutional
Review Board (IRB) to ensure that you are doing the work in an ethical way.
§ They may tell you that you don’t need to submit a formal application, or they may tell you
that you do. Let them know briefly what data you have access to (or plan to acquire, and
how) and what you plan to do with it.
§ Different IRBs of course can rule differently.
§ Rules differ in different countries
§ Human data versus animal data
§ In these slides, I have human data in mind, but animal data and its acquisition of course also has
major ethical considerations.
§ Look through UCLA’s website for the Office of the Human Research Protection
Program (OHRPP): http://ora.research.ucla.edu/ohrpp/Pages/OHRPPHome.aspx
§ “IRB is a floor, not a ceiling” (from Matt Salganik’s slides)
8. § A well-known, heavily-used set of courses:
https://www.citiprogram.org/index.cfm?pageID=86
§ I found this from a link from UCLA’s OHRPP website.
§ Several years ago, I did some IRB training. (When preparing these slides, I couldn’t
find the specific online course I took.) In addition to helping to think about issues, if
something does go wrong, you do (from a practical point of view) want to be able to
say that you have appropriate ethics training.
§ Note:The training required/expected/available differs substantially across
countries.
§ Example: From my experience, my impression is that the UK appears to be less stringent
about human data than the US, but it appears to be more stringent about non-human
animals.
9.
10.
11.
12. § The more your research has the potential to violate personal privacy, the more
helpful for humanity the outcome needs to have the potential to be.
13. § Informed Consent
§ Understanding and managing informational risk
§ Privacy
§ Making decisions in the face of uncertainty
§ Other notes
§ Put yourself in everyone else’s shoes
§ Think of research ethics as continuous, not discrete (sliding scale)
Bullet points from Matt Salganik’s slides
14. § You must provide sufficient (and precise) detail for people to be able to replicate
your work!
§ Try to include it in your papers, but people are human, so if somebody e-mails you
to ask for a clarification, copy of code (even if poorly commented), or something
else, you should respond and send it to them, provide it’s something that you have
the right to send them.
15. § To the extent possible, you should publish your data and usable (and well-commented) code
along with your work.
§ There can be tension between these ideals and issues of personal privacy, nondisclosure agreements, and
so on.
§ If using synthetic data, publish code to generate the data and the generated examples that you
used in your paper.
§ Supplementary material for the paper on the journal website, Github, Figshare, and other venues
§ Likely relevant for literally all of you
§ E.g., if you are doing any numerical computations at all, this is desirable
§ E.g., adjacency matrices for graphs in a definition–theorem–proof paper is also useful for readers (though
level of necessity depends on how large the graphs are)
§ Admission: I have been trying to get better about this over the years. I am very good about
responding to e-mail queries, and the goal (though there exist practical considerations) is to be
precise about all of my steps and to put as much online as feasible.
16. § For empirical data, if you have permission to post something (e.g., does the data
“belong” to somebody else?) and it doesn’t invade privacy, you should post it
because that promotes good science.
17. § Alternative name:“replication crisis”
§ https://en.wikipedia.org/wiki/Replication_crisis
Take a look, e.g., at the work of Victoria Snodden:
http://web.stanford.edu/~vcs/
18. § Be explicit about anything you did, so that others can know what choices you made
and evaluate whether they think it is the best procedure for your analysis
§ E.g., sampling biases change properties of data
§ There are many reasons that one makes choices, so it’s not that you shouldn’t make
them, but it’s part of your scientific procedure, so tell people exactly what you did
so they know exactly what these choices were. (They may want to make different
choices.)
§ “Manipulating” is a loaded word; here I mean it in a neutral way (i.e.,“changes”),
rather than in a negative one.
19. § When are things actually “anonymous”
§ Is “full” anonymization even possible?
23. § Acknowledge all sources of data
§ Include precise means of how you got data and how somebody else can get the
data (e.g., who do they contact?), especially if there is a reason that you are unable
to post the data itself
§ Be generous when acknowledging people in papers: useful discussions, ideas, etc.
§ Be fair and appropriate when discussing work by authors in past papers
§ You are standing on the shoulders of giants. :) Given credit where it is due.
§ Difference between somebody “showing” something in a past paper versus “reporting” it.
The former is a statement of verifying validity; the latter is a historical fact (assuming what
you write is accurate).
24. § There can be complications in posting data to the public, no matter how well-
intentioned.
§ This is a great data set to advance several avenues of research in network science, and my goal
is for people to be able to do that.
§ Learning the hard way
§ Urgently arranging a phone meeting with the head of Facebook’s Data Science team
§ An important learning experience for me
§ A small chapter in the long story of data privacy
§ A blog entry that is very critical of me (though this differs from my side of the story):
http://www.michaelzimmer.org/2011/02/15/facebook-data-of-1-2-million-users-from-2005-
released/
§ Led to my learning much more about these issues (though under very stressful circumstances),
a page about research using human data in Oxford’s Mathematical Institute, etc.
§ https://www.maths.ox.ac.uk/members/policies/data-protection/research-using-data-involving-humans
25. § Research in collaboration with companies or government:What is it ok to include
in a publication or post online?
§ Tension between open data and personal privacy
§ Terms-of-service agreements and nondisclosure agreements
§ In what sense can you replicate work if you can’t post everything?
§ “Softer” replication: do you observe similar phenomena in circumstances that have some
similarities but are not the same?
§ E.g., human behavior in different social networks
26. § See, e.g., the discussion around this paper:
http://science.sciencemag.org/content/early/2015/05/06/science.aaa1160.full
§ Eytan Bakshy, Solomon Messing, & Lada Adamic, Exposure to ideologically diverse news
and opinion on Facebook, Science, 2015
§ They can’t tell us Facebook’s sampling algorithm, so how are we as scientists going to go
about “replicating” their work?
§ Note: Do their insights apply to other online social networks? One should be able to do a weaker
form of replication such that the most interesting qualitative results are not merely a property of
specifics on Facebook
§ Also:What about this work being public versus being entirely within Facebook and
us never seeing any of it?
27. § A. D. I. Kramer, J. E. Guillory, and J.T. Hancock. Experimental evidence of massive-
scale emotional contagion through social networks. Proceedings of the National
Academy of Sciences of the United States of America, 111(24):8788–8790, 2014
§ Look up articles on this one
§ Experiments on Facebook with changes in people’s feeds
§ Also:What about this work being public versus being entirely within Facebook and
us never seeing any of it?
Note: Academic researchers have IRBs that need to approve a study before it starts, whereas
Facebook has a publication review board to approve publication of a study after it's already been
done.Thus, we know that this study occurred because FB concluded that it could be published.
We don’t know about what stuff is done with our data from FB and other companies when it
doesn’t get published.
29. § You can apply this comment generally to “data science” if you like, though the
property of connectivity in networks provides substantial additional issues beyond
just data science (and “Big Data”, etc.).
30. § Short essay by Johan Ugander (Management Science & Engineering, Stanford):
https://medium.com/@jugander/truth-lies-and-an-ethics-of-personalization-
e4ccfa7f2b84#.rzap3hm70
§ As an example, he discusses “Cambridge Analytica, identified by the NY Times as
the hired guns behind Trump’s online targeting.”
§ Alexander Nix (CEO of CA) gave the following example in a video. Quoting
Ugander’s essay:“if you own a private beach, he notes, you’d have more success
keeping people off your beach by putting up a “Warning: sharks beyond this point”
sign vs. a “private property” sign.The problem is: he recommends this strategy —
and personalized versions of it — without any consideration to whether there
actually are any sharks, advocating “behavioral communication” that is completely
detached from any truth about reality. In fewer words: crafting lies, and then
targeting them.”
31. § http://callingbullshit.org
§ Full title:“Calling Bullshit in the Age of Big Data”
§ A course designed by Carl Bergstrom and Jevin West (University of Washington)
§ Excellent syllabus and reading materials
§ Various parts of it relate to ethics, and they also have a unit directly about ethics:
http://callingbullshit.org/syllabus.html#Ethics
32. § Targeted advertising (different trailers for people of different races) for the movie
"Straight outta Compton": http://www.businessinsider.com/why-straight-outta-
compton-had-different-trailers-for-people-of-different-races
§ Different levels of prior familiarity with gangsta rap pioneers N.W. A. (Ice Cube, Dr. Dre, etc.)
§ Papers by Arvind Narayanan and collaborators, including:
§ http://senglehardt.com/papers/ccs16_online_tracking.pdf
§ https://5harad.com/papers/twivacy.pdf
§ J. Su et al.,“De-anonymizing Web Browsing Data with Social Networks”, 2016
§ C. Kanich et al.,“Spamalytics: An Empirical Analysis of Spam Marketing Conversion”
(2008):
§ http://www.umiacs.umd.edu/~tdumitra/courses/ENEE757/Fall14/papers/Kanich08.pdf
§ B. Markines et al.,“Social spam detection” (2009):
§ http://dl.acm.org/citation.cfm?doid=1531914.1531924
33. § “Tastes,Ties, and Time” Facebook data set
§ One discussion about the controversy associated with this data set:
http://www.chronicle.com/article/Harvards-Privacy-Meltdown/128166/
§ Research by Sinan Aral and collaborators on manipulation of voting on social
media sites
§ One discussion: https://techcrunch.com/2013/08/11/reddit-science-herd/
34. § Mathematicians are relatively new to using human data, but we don’t yet have the
ethics training to help us deal with the thorny issues
§ Learn from the best practices (and past mistakes) from other disciplines
§ As in those other disciplines, mathematicians should be getting ethics training
§ Read about — and think about and discuss — various controversies and other
studies.We all may set our bars in a different place, but we need to do it
conscientiously.
§ It’s a sliding bar: the more potential for invasion of personal privacy, the more valuable
the potential outcome has to be for humanity
§ IRB approval is only a lower bound
35. § While I have more training and experience with these issues than most
mathematicians, I am very much an amateur on data ethics compared to people
from the social and human sciences, for whom this is a standard part of the training
from the beginning of their education.
§ With this in mind, please contact me with any suggestions on these slides. Did I
miss any salient points? Do you disagree with any of the discussed points? Are
there any other studies that are especially crucial to bring up?
§ Eventually, I hope to develop these slides further into an article for a venue in the
mathematical sciences. Let me know if you are interested in being involved in
writing this article.
36. § Several suggestions for resources from Johan Ugander
§ Several comments on my slides and suggestions for resources from Peter Mucha
§ Website from Matt Salganik’s class on Computational Social Science (Fall 2016):
http://www.princeton.edu/~mjs3/soc596_f2016/
§ I drew some material and ideas from his slides on ethics
§ It would be pretty ironic if I plagiarized these slides, wouldn’t it?