Big Data Brighton | Big Data in Academia | Jan 2013

January 2013
at
University of Brighton

http://meetup.com/Big-Data-Brighton

Agenda
• Miltos Petridis, Professor of Computer Science, University
of Brighton

• Dr Patricia Roberts, Senior Lecturer & Researcher in
database design, development and management,
University of Brighton - Structured vs Unstructured Data:
why structure matters.

• Simon Wibberley, PhD student in computational linguistics
at the Text Analytics Group at the University of Sussex.
Real-time text stream analysis, event detection, and entity
recognition. Event detection on Twitter.

• Kevin Long, Teradata - Summary and Business context

Big Data

“A new generation of technologies and
architectures, designed to economically
extract value from very large volumes of a
wide variety of data, by enabling high-speed
capture, discovery and/or analysis”1
New investment initiatives are coming, such as
in the US in 2012:
“more than $200 million in new funding
through six agencies and departments to
improve the nation’s ability to extract
knowledge and insights from large and
complex collections of digital data” 2

Knowledge and insights... hmm
Before companies rush to use the technologies
they should be asking some questions:

• Can we make any assumptions about the
quality of the data we are using?

• Is there a significant difference between
structured and unstructured data?

• Can the underlying structure of the data
affect what you can do with it?

In this brief talk, I will be examining these
questions with reference to my research and
recent trends

Can we make any assumptions about
the quality of the data we are using?
• One of the problems about the recent explosion
in the amount of data is that some data
(particularly collected from social networking
sites) is of dubious quality
– A straw pole of my students found that 1 in 5
deliberately enter incorrect data about themselves
online to protect their identity
• We might not have any assurance that the data is
true or that it is correctly linked to metadata
– Is data typed?
– Is the data related to other data? How is it related?
– Are relationships between data and its meaning
being lost?

3
A view of different data models

Is there a significant difference
between structured and unstructured
data?
• How is data structured?
• Does the underlying data model matter?
• What are the options for a data model?
• Over the years many models of data have
evolved and most are still in use
• Data models used give insights into
assumptions about the semantics of the data

Finding meaning from ‘flat’ data

• A problem with ‘flat’ or unstructured data
representations is that it has traditionally
been difficult to aggregate and present to
users in a way that they can understand
• In contrast, structured data can be
summarised easily and its structure
represents the meaning of data within an
organization
• Data analytics are changing this by
presenting accessible information from ‘flat’
data

Can the underlying structure of the
data affect what you can do with it?
• The short answer from my research is
‘YES’
• How it affects what you can do with the
data is the long answer
– It is really easy to store a piece of data but
retrieving it (intact with its meaning and
its relationships to other data) is more
difficult
– When ‘Big Data’ technologies are used to
knowledge and insights from the data we
should be sure that the technology is not
introducing new problems

Impedance mismatch problems

• Moving data from one paradigm to another
often causes the meaning to be lost
• Can cause problems for developers who
move data from one paradigm to another
• Also a problem for end users who may lose
the connections

A way forward
• Working out goals in your data management
• Understanding the structure of the data you
are using, wherever it comes from
• Getting assurance about the quality of the
data
• Then having confidence that the knowledge
and insights are based in firm foundations

References
1. Carter, P (2011) , Big Data Analytics: Future
Architectures, Skills and Roadmaps for the CIO, SAS
White paper, IDC Go-to-Market Services
2. E. Gianchandani. Obama administration unveils
$200m big data r&d initiative. In The Computing
Community Consortium (CCC) Blog, 2012.
3. Renzo Angles and Claudio Gutierrez. 2008. Survey of
graph database models. ACM Comput. Surv. 40, 1,
Article 1 (February 2008)

Event Detec on on Twier

Simon Wibberley
Text Analy cs Group
University of Sussex
simon.wibberley@sussex.ac.uk

What are Events? We just don’t know.

Event Categories
Well Reported
Relatively Easy Interesting

Interesting Very Tricky
Poorly Reported

Constrained Unconstrained

Algorithms
• Query Driven
– Volume / rate analysis of matching data
– Addresses constrained event type
• Data Driven
– Mine stream for interes ng data
– Addresses unconstrained event type

Event Characterisa on
• Fill in unknowns
• Self explanatory for (very) constrained events
• Select representa ve / well formed Tweet[s]
• Term relevance / clustering
• Topic analysis
• Geo-loca on / En ty extrac on

CASM
• Centre for the Analysis of Social Media
• Collabora on between DEMOS and TAG
• Applying text analy cs to social media to
answer sociological ques ons
• OSI funded EU sen ment anaylsis pilot project
hp://www.demos.co.uk/projects/casm/

Ethics
Identity
Preserving Judiciary Stasi

Social Science Me!
Anonymous

Narrow Broad
Reffin, J (2012)

Big Data Brighton | Big Data in Academia | Jan 2013

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Big Data Brighton | Big Data in Academia | Jan 2013