Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Data fingerprinting

196 Aufrufe

Veröffentlicht am

Title: Fingerprinting Latent Structure in Data

Abstract: As data-hungry algorithms find wide spread applications, there is an increased interest in exploring these algorithms in the context of small-data. In many niche industrial applications, data is not only held in secrecy for reasons of privacy and competitive advantage, but is also limited in volume and variety. Under these constraints, data-driven algorithms are expected to exhibit low application misbehavior. If one can discard data that does not match capabilities of underlying algorithms, there is better control over how unexpected data can influence the application behavior. In this talk, data fingerprinting techniques are presented in the context of small-data and application behavior. Capturing and representing a latent structure in data as a fingerprint helps evolve algorithm complexity, thereby improving application reliability. As an illustration, problems involving question answering, cell structure detection, and recognition of classes of short textual messages will be discussed.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Data fingerprinting

  1. 1. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Fingerprinting Latent Structure in Data MRITYUNJAY KUMAR & GUNTUR RAVINDRA TECHNOLOGY EXCELLENCE GROUP TALENTICA SOFTWARE PRESENTED AT DAIR (DATA ANALYTICS AND INTELLIGENCE RESEARCH ,INDIAN INSTITUTE OF TECHNOLOGY, DELHI)
  2. 2. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Agenda  Challenge with building data-driven algorithms  Small-data  Introduction to data fingerprinting  Two problem statements  Solving a Question complexity problem  Solving an Image recognition problem  Fingerprinting the structure in data  Extracting structure  Representing structure as a signature  Other complex problems
  3. 3. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. What is data fingerprinting  A method to represent a block of data as an entity  Applications: Easy validation, proof of originality, tamper detection, DLP  Classical techniques  Bloom filters, cryptographic hashes  Main issues with fingerprinting  Do not capture data semantics  Large number of fingerprints  complexity
  4. 4. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Two Problems
  5. 5. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Recognizing question complexity
  6. 6. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Recognizing question complexity
  7. 7. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Recognizing structural deformation in cells Data source: https://www.kaggle.com/c/data-science-bowl-2018/data
  8. 8. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Data-driven algorithms with Small- Data  Need for problem-specific data  Rule-based approaches  Rule-based approaches are easy to implement  Not all data characteristics can be captured as rules  Does not automatically adapt to the data  Machine learning approach  ML approaches need large amounts of data  Generic models and open-source data are not suitable for application-specific needs  Can build complex structures and designs
  9. 9. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Architecting a solution • Knowledge has a latent structure • Sequence, Geometry • There can be a hierarchies of structures • convert structure to a computational representation • Objective: context of application capabilities Influences computational representation
  10. 10. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Problem Formulation A set of elements : images, questions, Text messages An objective A subset of structures relevant to an objective How do we define and how do we find Transformation of elements into a structure and hence a computational entity A human in the loop
  11. 11. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Structures in Data How many buses are plying in Mumbai on a route originating at Dadar and ending at Vashi? How many students are in the class?
  12. 12. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Structures in Data Intensity Projections Oriented gradients
  13. 13. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Problem Formulation For computational ease we make A function that maps a structure to vector The inverse of the function results in one of many structures a binary bit-vector Goal is to find so as to satisfy the constraints This is a constrained optimization formulation
  14. 14. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Solution : Optimization formulation  Based on the problem formulation  We have an optimization formulation that has an inverse that results in the variable itself or a subset of variables  A related function is a neural auto-encoder  Solution boils down to  Training an auto-encoder with one class of data  Recognizing data class involves  Data clustering  Human intelligence/visual inspection to mark clusters  Data in clusters used to train the auto-encoder
  15. 15. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Recognition : Cell Structure
  16. 16. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Recognition : Question Complexity How much can the SP alter income tax in Scotland? What is stage 1 in the life of a bill? Who is the President of Egypt? Why do some people purposely resist officers of the law? Why is the need for acceptance of punishment needed? Why would one plead guilty to a crime involving civil disobedience? Why is giving a defiant speech sometimes more harmful for the individual? Why did Harvard end its early admission program?
  17. 17. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved.  The auto-encoder output has distortions  Detect the distortion  Quantify the distortion Solution : Recognition
  18. 18. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Building Complexity  Incremental addition of data classes  Using stacking  Unique binary code injected in each stacked layer  Collapse stacked layers into a classification model  redeploy
  19. 19. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Data Type Test Cases True Positive False Positive True False Negative With classes like in training data 1781 1774 NA NA 7 With classes not like in training data 8789 NA 13 8776 NA
  20. 20. Copyright © 2018 Talentica Software (I) Pvt Ltd. All rights reserved. Summary  A large number of applications are still small-data applications  Data has latent structure  Extraction is objective based and data specific  We can harness data-hungry algorithms for small-data applications  Use structures instead of raw data  Auto-encoders are powerful tools  Build incremental complexity

×