Four talks about Big Data in Academia at Big Data Brighton Jan 2013. Two of the talks' slides are here. I'll upload Miltos' slides when I receive them.
Dr Patricia Roberts, Senior Lecturer & Researcher in database design, development and management, University of Brighton - Structured vs Unstructured Data: why structure matters.
Simon Wibberley, PhD student in computational linguistics at the Text Analytics Group at the University of Sussex. Real-time text stream analysis, event detection, and entity recognition. Event detection on Twitter.
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Big Data Brighton | Big Data in Academia | Jan 2013
1. January 2013
at
University of Brighton
http://meetup.com/Big-Data-Brighton
2. Agenda
• Miltos Petridis, Professor of Computer Science, University
of Brighton
• Dr Patricia Roberts, Senior Lecturer & Researcher in
database design, development and management,
University of Brighton - Structured vs Unstructured Data:
why structure matters.
• Simon Wibberley, PhD student in computational linguistics
at the Text Analytics Group at the University of Sussex.
Real-time text stream analysis, event detection, and entity
recognition. Event detection on Twitter.
• Kevin Long, Teradata - Summary and Business context
3.
4. Big Data
“A new generation of technologies and
architectures, designed to economically
extract value from very large volumes of a
wide variety of data, by enabling high-speed
capture, discovery and/or analysis”1
New investment initiatives are coming, such as
in the US in 2012:
“more than $200 million in new funding
through six agencies and departments to
improve the nation’s ability to extract
knowledge and insights from large and
complex collections of digital data” 2
5. Knowledge and insights... hmm
Before companies rush to use the technologies
they should be asking some questions:
• Can we make any assumptions about the
quality of the data we are using?
• Is there a significant difference between
structured and unstructured data?
• Can the underlying structure of the data
affect what you can do with it?
6. In this brief talk, I will be examining these
questions with reference to my research and
recent trends
7. Can we make any assumptions about
the quality of the data we are using?
• One of the problems about the recent explosion
in the amount of data is that some data
(particularly collected from social networking
sites) is of dubious quality
– A straw pole of my students found that 1 in 5
deliberately enter incorrect data about themselves
online to protect their identity
• We might not have any assurance that the data is
true or that it is correctly linked to metadata
– Is data typed?
– Is the data related to other data? How is it related?
– Are relationships between data and its meaning
being lost?
9. Is there a significant difference
between structured and unstructured
data?
• How is data structured?
• Does the underlying data model matter?
• What are the options for a data model?
• Over the years many models of data have
evolved and most are still in use
• Data models used give insights into
assumptions about the semantics of the data
10. Finding meaning from ‘flat’ data
• A problem with ‘flat’ or unstructured data
representations is that it has traditionally
been difficult to aggregate and present to
users in a way that they can understand
• In contrast, structured data can be
summarised easily and its structure
represents the meaning of data within an
organization
• Data analytics are changing this by
presenting accessible information from ‘flat’
data
11. Can the underlying structure of the
data affect what you can do with it?
• The short answer from my research is
‘YES’
• How it affects what you can do with the
data is the long answer
– It is really easy to store a piece of data but
retrieving it (intact with its meaning and
its relationships to other data) is more
difficult
– When ‘Big Data’ technologies are used to
knowledge and insights from the data we
should be sure that the technology is not
introducing new problems
12. Impedance mismatch problems
• Moving data from one paradigm to another
often causes the meaning to be lost
• Can cause problems for developers who
move data from one paradigm to another
• Also a problem for end users who may lose
the connections
13. A way forward
• Working out goals in your data management
• Understanding the structure of the data you
are using, wherever it comes from
• Getting assurance about the quality of the
data
• Then having confidence that the knowledge
and insights are based in firm foundations
15. References
1. Carter, P (2011) , Big Data Analytics: Future
Architectures, Skills and Roadmaps for the CIO, SAS
White paper, IDC Go-to-Market Services
2. E. Gianchandani. Obama administration unveils
$200m big data r&d initiative. In The Computing
Community Consortium (CCC) Blog, 2012.
3. Renzo Angles and Claudio Gutierrez. 2008. Survey of
graph database models. ACM Comput. Surv. 40, 1,
Article 1 (February 2008)
16. Event Detec on on Twier
Simon Wibberley
Text Analy cs Group
University of Sussex
simon.wibberley@sussex.ac.uk
18. Event Categories
Well Reported
Relatively Easy Interesting
Interesting Very Tricky
Poorly Reported
Constrained Unconstrained
19. Algorithms
• Query Driven
– Volume / rate analysis of matching data
– Addresses constrained event type
• Data Driven
– Mine stream for interes ng data
– Addresses unconstrained event type
23. Event Characterisa on
• Fill in unknowns
• Self explanatory for (very) constrained events
• Select representa ve / well formed Tweet[s]
• Term relevance / clustering
• Topic analysis
• Geo-loca on / En ty extrac on
24. CASM
• Centre for the Analysis of Social Media
• Collabora on between DEMOS and TAG
• Applying text analy cs to social media to
answer sociological ques ons
• OSI funded EU sen ment anaylsis pilot project
hp://www.demos.co.uk/projects/casm/