SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
January 2013
               at
     University of Brighton

http://meetup.com/Big-Data-Brighton
Agenda
•   Miltos Petridis, Professor of Computer Science, University
    of Brighton

•   Dr Patricia Roberts, Senior Lecturer & Researcher in
    database design, development and management,
    University of Brighton - Structured vs Unstructured Data:
    why structure matters.

•   Simon Wibberley, PhD student in computational linguistics
    at the Text Analytics Group at the University of Sussex.
    Real-time text stream analysis, event detection, and entity
    recognition. Event detection on Twitter.

•   Kevin Long, Teradata - Summary and Business context
Big Data

“A  new  generation  of  technologies  and  
    architectures, designed to economically
    extract value from very large volumes of a
    wide variety of data, by enabling high-speed
    capture,  discovery  and/or  analysis”1
New investment initiatives are coming, such as
    in the US in 2012:
“more  than  $200  million  in  new  funding  
    through six agencies and departments to
    improve  the  nation’s   ability to extract
    knowledge and insights from large and
    complex collections  of  digital  data”  2
Knowledge and insights... hmm
Before companies rush to use the technologies
    they should be asking some questions:


• Can we make any assumptions about the
  quality of the data we are using?

• Is there a significant difference between
  structured and unstructured data?

• Can the underlying structure of the data
  affect what you can do with it?
In this brief talk, I will be examining these
   questions with reference to my research and
   recent trends
Can we make any assumptions about
 the quality of the data we are using?
• One of the problems about the recent explosion
  in the amount of data is that some data
  (particularly collected from social networking
  sites) is of dubious quality
   – A straw pole of my students found that 1 in 5
     deliberately enter incorrect data about themselves
     online to protect their identity
• We might not have any assurance that the data is
  true or that it is correctly linked to metadata
   – Is data typed?
   – Is the data related to other data? How is it related?
   – Are relationships between data and its meaning
     being lost?
3
A view of different data models
Is there a significant difference
    between structured and unstructured
                     data?
• How is data structured?
• Does the underlying data model matter?
• What are the options for a data model?
• Over the years many models of data have
  evolved and most are still in use
• Data models used give insights into
  assumptions about the semantics of the data
Finding  meaning  from  ‘flat’  data

• A  problem  with  ‘flat’  or  unstructured  data  
  representations is that it has traditionally
  been difficult to aggregate and present to
  users in a way that they can understand
• In contrast, structured data can be
  summarised easily and its structure
  represents the meaning of data within an
  organization
• Data analytics are changing this by
  presenting  accessible  information  from  ‘flat’  
  data
Can the underlying structure of the
data affect what you can do with it?
• The short answer from my research is
  ‘YES’
• How it affects what you can do with the
  data is the long answer
   – It is really easy to store a piece of data but
     retrieving it (intact with its meaning and
     its relationships to other data) is more
     difficult
   – When  ‘Big  Data’  technologies  are  used  to  
     knowledge and insights from the data we
     should be sure that the technology is not
     introducing new problems
Impedance mismatch problems

• Moving data from one paradigm to another
  often causes the meaning to be lost
• Can cause problems for developers who
  move data from one paradigm to another
• Also a problem for end users who may lose
  the connections
A way forward
• Working out goals in your data management
• Understanding the structure of the data you
  are using, wherever it comes from
• Getting assurance about the quality of the
  data
• Then having confidence that the knowledge
  and insights are based in firm foundations
Thank you

Any questions?
References
1.   Carter, P (2011) , Big Data Analytics: Future
     Architectures, Skills and Roadmaps for the CIO, SAS
     White paper, IDC Go-to-Market Services
2.   E. Gianchandani. Obama administration unveils
     $200m big data r&d initiative. In The Computing
     Community Consortium (CCC) Blog, 2012.
3.   Renzo Angles and Claudio Gutierrez. 2008. Survey of
     graph database models. ACM Comput. Surv. 40, 1,
     Article 1 (February 2008)
Event Detec	on on Twier

      Simon Wibberley
     Text Analy	cs Group
     University of Sussex
    simon.wibberley@sussex.ac.uk
What are Events?   We just don’t know.
Event Categories
Well Reported
                   Relatively Easy       Interesting




                   Interesting           Very Tricky
Poorly Reported


                  Constrained        Unconstrained
Algorithms
• Query Driven
   –   Volume / rate analysis of matching data
   –   Addresses constrained event type
• Data Driven
   –   Mine stream for interes	ng data
   –   Addresses unconstrained event type
GB Dressage Gold
London Riots
London Riots
Event Characterisa	on
• Fill in unknowns
• Self explanatory for (very) constrained events
• Select representa	ve / well formed Tweet[s]
• Term relevance / clustering
• Topic analysis
• Geo-loca	on / En	ty extrac	on
CASM
• Centre for the Analysis of Social Media
• Collabora	on between DEMOS and TAG
• Applying text analy	cs to social media to
  answer sociological ques	ons
• OSI funded EU sen	ment anaylsis pilot project
   hp://www.demos.co.uk/projects/casm/
Ethics
Identity
Preserving    Judiciary             Stasi




              Social Science        Me!
 Anonymous


             Narrow                 Broad
                                            Reffin, J (2012)

Weitere ähnliche Inhalte

Kürzlich hochgeladen

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Empfohlen

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Empfohlen (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Big Data Brighton | Big Data in Academia | Jan 2013

  • 1. January 2013 at University of Brighton http://meetup.com/Big-Data-Brighton
  • 2. Agenda • Miltos Petridis, Professor of Computer Science, University of Brighton • Dr Patricia Roberts, Senior Lecturer & Researcher in database design, development and management, University of Brighton - Structured vs Unstructured Data: why structure matters. • Simon Wibberley, PhD student in computational linguistics at the Text Analytics Group at the University of Sussex. Real-time text stream analysis, event detection, and entity recognition. Event detection on Twitter. • Kevin Long, Teradata - Summary and Business context
  • 3.
  • 4. Big Data “A  new  generation  of  technologies  and   architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-speed capture,  discovery  and/or  analysis”1 New investment initiatives are coming, such as in the US in 2012: “more  than  $200  million  in  new  funding   through six agencies and departments to improve  the  nation’s   ability to extract knowledge and insights from large and complex collections  of  digital  data”  2
  • 5. Knowledge and insights... hmm Before companies rush to use the technologies they should be asking some questions: • Can we make any assumptions about the quality of the data we are using? • Is there a significant difference between structured and unstructured data? • Can the underlying structure of the data affect what you can do with it?
  • 6. In this brief talk, I will be examining these questions with reference to my research and recent trends
  • 7. Can we make any assumptions about the quality of the data we are using? • One of the problems about the recent explosion in the amount of data is that some data (particularly collected from social networking sites) is of dubious quality – A straw pole of my students found that 1 in 5 deliberately enter incorrect data about themselves online to protect their identity • We might not have any assurance that the data is true or that it is correctly linked to metadata – Is data typed? – Is the data related to other data? How is it related? – Are relationships between data and its meaning being lost?
  • 8. 3 A view of different data models
  • 9. Is there a significant difference between structured and unstructured data? • How is data structured? • Does the underlying data model matter? • What are the options for a data model? • Over the years many models of data have evolved and most are still in use • Data models used give insights into assumptions about the semantics of the data
  • 10. Finding  meaning  from  ‘flat’  data • A  problem  with  ‘flat’  or  unstructured  data   representations is that it has traditionally been difficult to aggregate and present to users in a way that they can understand • In contrast, structured data can be summarised easily and its structure represents the meaning of data within an organization • Data analytics are changing this by presenting  accessible  information  from  ‘flat’   data
  • 11. Can the underlying structure of the data affect what you can do with it? • The short answer from my research is ‘YES’ • How it affects what you can do with the data is the long answer – It is really easy to store a piece of data but retrieving it (intact with its meaning and its relationships to other data) is more difficult – When  ‘Big  Data’  technologies  are  used  to   knowledge and insights from the data we should be sure that the technology is not introducing new problems
  • 12. Impedance mismatch problems • Moving data from one paradigm to another often causes the meaning to be lost • Can cause problems for developers who move data from one paradigm to another • Also a problem for end users who may lose the connections
  • 13. A way forward • Working out goals in your data management • Understanding the structure of the data you are using, wherever it comes from • Getting assurance about the quality of the data • Then having confidence that the knowledge and insights are based in firm foundations
  • 15. References 1. Carter, P (2011) , Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO, SAS White paper, IDC Go-to-Market Services 2. E. Gianchandani. Obama administration unveils $200m big data r&d initiative. In The Computing Community Consortium (CCC) Blog, 2012. 3. Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1, Article 1 (February 2008)
  • 16. Event Detec on on Twier Simon Wibberley Text Analy cs Group University of Sussex simon.wibberley@sussex.ac.uk
  • 17. What are Events? We just don’t know.
  • 18. Event Categories Well Reported Relatively Easy Interesting Interesting Very Tricky Poorly Reported Constrained Unconstrained
  • 19. Algorithms • Query Driven – Volume / rate analysis of matching data – Addresses constrained event type • Data Driven – Mine stream for interes ng data – Addresses unconstrained event type
  • 23. Event Characterisa on • Fill in unknowns • Self explanatory for (very) constrained events • Select representa ve / well formed Tweet[s] • Term relevance / clustering • Topic analysis • Geo-loca on / En ty extrac on
  • 24. CASM • Centre for the Analysis of Social Media • Collabora on between DEMOS and TAG • Applying text analy cs to social media to answer sociological ques ons • OSI funded EU sen ment anaylsis pilot project hp://www.demos.co.uk/projects/casm/
  • 25. Ethics Identity Preserving Judiciary Stasi Social Science Me! Anonymous Narrow Broad Reffin, J (2012)