Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

The REAL Impact of Big Data on Privacy

946 Aufrufe

Veröffentlicht am

The awesome promise of Big Data is tempered by the need to protect personal information. Data scientists must expertly navigate the legislative waters and acquire the skills to protect privacy and security. This talk provides enterprise leaders with answers and suggests questions to ask when the time comes to consider the vast opportunities offered by big data.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

The REAL Impact of Big Data on Privacy

  1. 1. LET‟S GET REAL THE REAL IMPACT OF BIG DATA ClaudiuPopa @datarisk
  2. 2. THE DEFINITION  Short for big data analytics  The trend towards larger data sets allowing correlations to be found to spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions
  3. 3. THE BUZZWORD  Marketing term Increasingly used to evoke computational power and large scale algorithms that can turn mountain of data into usable information and intelligent business decisions  Modern day version of Big Brother. Diverse transactional data that can impact privacy
  5. 5. The Promise of Big Data Sexiest Job of the 21st Century? Predicting crime/Fraud detection Timing investments Mining astro data for E.T. Productivity loss of stressful travel Call centre analytics Dynamic ticket pricing Guessing demand for better service E-com & customer service Capture recurring revenue Using sensor data for efficiency Barack Obama‟s campaign Medical research/treatment Reduced pushback campaigns Crowdfund / Crowdsource Open pit data mining Mapping the sex trade Disease eradication Tracking endangered species
  6. 6. Big benefits Source: Fast Company. Photos: Jeff Brown. Illustrations: Justin Mezzell
  7. 7. can‟t we agree on a definition? 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
  8. 8. More than Mining: The Reality of Big Data
  9. 9. More fun than mining for BitCoins Full graph back to the 1500s saved at http://Popa.ca/GraphInfo
  10. 10. Some get it...
  11. 11. others, not so much.
  12. 12. “ ” Seeking clarity? Given the signal to noise ratio, big data itself appears to be telling us that working raw numbers at ever greater scale to force out some answers may not be the real achievement here. Instead, it is perhaps in the elegance with which it steers us towards the right questions to ask.
  13. 13. big data is often personal 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
  14. 14. but socmed is a big part of the risk
  15. 15. Do custodians have a responsibility?  Google Person Finder  Flu Trends  Dengue Trends  Crisis Maps  MOOC metrics  Sex trade tracking
  16. 16. big data growing big time, and fast Source #1: www.sourcelink.com/blog/guest- author-series/2012/08/13/if-you-build-it-they-will-come
  17. 17. the 3 dimensions of Big Data  volume is there enough data? Too much?  variety how diverse is the data?  velocity how rapidly must it be processed?
  18. 18. the 6 new aspects of Big Data  validity what is the threshold for applicability?  verifiability will we continue to trust the input source?  variability what‟s the tolerance for variation?  veracity is it useful for predicting business value?  value does it offer valuable insight?  visualization can it be meaningfully represented?
  19. 19. Big Value  silos are departmental silos good or bad?  timely exploitation volume and velocity = complexity  system integration the real enablers need to „get‟ privacy
  20. 20. “ ” Visualization is the most likely V to save big data by making personalized information available to owners and not custodians in natural, frictionless ways.
  21. 21. Full disclosure: why I care  integrity if you can‟t trust the data, immense effort and expense are lost  confidentiality the impact of potential breaches is vastly increased  availability input availability impacts value and velocity in particular  privacy impacted by any breach of data sets/clusters with PII Hint: it’s not because Gartner said the space is worth up to $3.8 Trillion
  22. 22. The First 4 Vs: Where‟s the Privacy? Source: Data Science Central
  23. 23. too valuable to discard? Source: http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/
  24. 24. Visualizing Privacy in Big Data
  25. 25. Are we there yet?  Utah Data Center will handle yottabytes of data  TB: 1012 bytes, Exa:1018, Zetta: 1021, Yotta: 1024  Each Boeing jet engine creates 20TB/hr  Facebook grows by 500TB/day. Ref: Visual.ly infographic on big data
  26. 26. Scale  PB=1015: All printed material in existence in 1995 or ONE SECOND of data generated at CERN/LHC  1 exa: All data created on the Internet each day  SKA will collect „a few exabytes‟ of data each day processing 10PB/hr and producing up to 100x CERN‟s output each year
  27. 27.  Using terabyte drives, a yotta centre would be as large as the states of Delaware and Rhode Island  Using SDXC cards, it would be as large as the Great Pyramid of Giza  NASA already stores 32pb of climate data Not just about processing
  28. 28.  ...except in the hands of the first indiscriminate seller and unscrupulous buyer  unless all elements are preserved the fear is that it will be more difficult to find gold  and legislation may force companies to reveal what they hold and what they share Data without analysis is useless Ref: California’s Right to Know bill AB 1291 demands accountability from huge data hoarding firms
  29. 29.  Big Data in medicine often revolves around gee sequencing and biosamples, vital records, insurance claims. Data use and reuse has created grave concerns about privacy and informed consent.  geotagging and social media fuel the debate over big data privacy. Does informed mean consent?  personalized searches are automatically recorded and mined in every way possible. Crowd wisdom and individual preferences create a climate of unease among web search users Overanalyzed data creates concerns Source: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup#
  30. 30.  Internet Census 2012 was an unauthorized deep dive into the security of all Internet connected devices  It infected vulnerable devices with a custom binary and used their processing power to expand the scale of scans  It found over 35million vulnerable devices and millions of others that should not be online at all  The research data was analyzed and anonymously published online after the 2-year project was completed  9+ TB collected and analyzed, 52bn ICMP pings, 180bn service probes, 71 bn ports tested Unauthorized Internet scale analysis Source: http://internetcensus2012.bitbucket.org image: 420,000 Carna Botnet locations
  31. 31.  anticipatory systems like Google Now already have a positive impact on individual productivity  crowdfunding has a global economic impact and an even bigger innovative footprint  crowdsourcing assists with investigations and research, especially as small data is tapped Observing people through shared data
  32. 32.  20 million geolocated tweets during 4-day event  grocery shopping peaks night before hurricane  night life up after it  Manhattan skew  Impact area shows lowest tweet rates  signal problem may undermine big data value Social media & major event correlation Source: Rutgers Twitter/Foursquare Sandy Study 2012 http://popa.ca/SandyBigData
  33. 33.  global data doubles every 2 years, but only 0.5% is ever analyzed  strength comes from pooling the data, but value is in individualizing findings  how can personal analytics be custom-fitted to benefit individuals without first impacting privacy? Big data‟s promise is not in aggregation
  34. 34.  Hundreds of millions of devices vulnerable globally  95% unpatched and vulnerable  5% of those patched are still vulnerable to zero-days  study shows that 75% are at least 6 mths behind  but it also shows that the focus isn‟t just on one aspect. It‟s a massive systemic issue that was allowed to grow into a global threat.  Government issued a public recommendation to discontinue the use of Java because it is unsafe A sobering look at the Java threat Source: 2013 Websense http://popa.ca/JavaSecurityPie
  35. 35. “ ” If you can crunch it, more data means better results. The caveats are that you get proportionately less information by volume and quality tends to decrease over time.
  36. 36. The Opportunity of Big Data  Since data storage will reach a practical asymptotic maximum, we can distribute resources  This will help with data quality, input filtering, metrics and statistics, layered privacy filtering, reporting filters, data siloing and segregation, neural net-style learning to maximize efficiency, etc.
  37. 37. Redefining organized, as in crime  every year, hundreds of millions of records are siphoned from diverse databases globally  SIN/SSNs, Credit Card Data, home addresses all amount to one thing: identities  The vast majority of that data has to date gone unexploited, likely due to analytic challenges
  38. 38. IBRF: ID business requirements 1st! 1. initial focus on customer-centric outcomes 2. enterprise-wide big data blueprint 3. get near-term results from existing data 4. build analytics capabilities on business priorities 5. create a business case on measurable outcomes 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
  39. 39. Who isn‟t already mining it?
  40. 40. Thepicturechangeseveryminute
  41. 41. Responsible Visualization of Big Data  Data discovery is ideally positioned to identify PII, create filters & create healthy correlations  Data quality visuals are opportunities to form segments or hierarchies & create aggregate logic  Storytelling often identifies outliers. Expertly narrate correlations and patterns, de-identify exceptions
  42. 42. Where to look for Privacy in Big Data  Dashboards are used to present meaningful data and should be adequately tested for compliance  Tools can be tailored to audiences and should be specialized to eliminate undesirable inferences  Trends and predictions result from proper data analysis. This is where meaning becomes evident
  43. 43. “ ” Hyped indiscriminately and handled inappropriately, big data analytics can be more of a liability than an opportunity to derive rich information through intelligent refinement.
  44. 44. The Challenge of Big Data  Does logging in with social media accounts constitute consent?  Will aggregation and data masking still lead to personally identifiable information?  How can data privacy filtering be guaranteed?
  45. 45. Big Data example: the Click dataset  Objective: “To study the structure and dynamics of Web traffic networks”  53.5 billion click anonymized dataset @IndianaU  Data collected includes referrer, timestamp, URL  Does sanitization == anonymization == privacy? Source: CNetS: http://cnets.indiana.edu
  46. 46. Open, Distributed DIY Big Data Tools  D3: Data Driven Documents  GitHub | SourceForge  Hadoop/MapReduce  Amazon cloud  Open source grids  Mechanical Turk? Source: Wikipedia Recent Changes Map and wikistream projects
  47. 47. Canadian Big Data: The Source  Detailed metrics showed preference for high end products  Move away from $150 items to $650 ones increased sales 40% in high end electronics  Notorious for overcollection, the Source actually does „the consent bit‟ adequately well
  48. 48. Privacy enjoys safe sets  intrinsic safety in: meteorology, environmental, physics, astronomy and other sciences  innate risk in connectomics, biological, Internet, behavioural and sensory data sets
  49. 49. “ ” We simply cannot afford to entertain the notion that the proliferation of scattered data sources is the last bastion of privacy protection.
  50. 50.  Build privacy into the input data sets  Use simple filtering for large data sets & output  Build algorithms to ensure irreversibility of privacy  Try to break it! Technical solution of Big Data privacy
  51. 51. Privacy must be tack[l]ed [head-]on
  52. 52. “ ” Data with the potential of being personally identifiable should be treated with the same veracity as dirty input.
  53. 53. 7 Steps to Building your own Healthy Information Ecosystem Articulate your vision Put your stop orders in place Assign roles and accountabilities Create processes to manage it Build controls and code standards from the bottom up Prioritize data ownership, integrity and classification Implement layered and automated audits
  54. 54. Big Data Leadership  Embrace openness, build on what works  Adopt standards for process and technology  Draft progressive legislation (Model CASL, PCI and even CP laws)  Encourage awareness, promote accountability  Applaud and showcase responsible innovation  Put forward important notions of information life cycle, data ownership, and privacy compliance
  55. 55. Big Data Links  Free course http://popa.ca/BigDataCourse (Coursera)  More from http://bigdatauniversity.com/courses/  This presentation: http://linkedIn.ClaudiuPopa.com  Visualization gallery: http://datavis.ca  InformationisBeautiful.net + Awards  HowBigReally.com / HowManyReally.com  The signal and the noise (Nate Silver @ amazon.ca)
  56. 56.  Sharing data mining policies  Demonstrating fair use using audits  Caring about the purveyor spectrum  When it gets easy, it may be too late Big Data open discussion
  57. 57. Follow Twitter.ClaudiuPopa.com Read Subscribe.ClaudiuPopa.com Connect LinkedIn.ClaudiuPopa.com