Thu 1400 cagle_kurt_color

Your Trusted
Web Presence Partner

BIG DATA
Semantic Data
© 2011 Avalon Consulting, LLC 1

Your Speaker

• Kurt Cagle is an Information Architect for
Avalon Consulting.
• Author of 18 books on XML, Web
Development and the Semantic Web
• Managing Editor of XMLToday.org
• Email: caglek@avalonconsult.com

© 2011 Avalon Consulting, LLC

Perspectives
• In 1945, the cost to acquire a byte of data was high
~ $1/kB in 2011 USD.
• By 1960, that same dollar could get 1000 times as
much data. This is another expression of Moore's
Law.
• At 15 years per 1000x increase, the cost to acquire
a kByte in 2011 is ~ $0.000000001/kB.
• By 2050, this will be
$0.00000000000000000000001/kB, or
10,000,000,000,000,000,000,000 kB/$.


Record Data
• The fundamental unit of data is the record.
• A record has an identity, a unique code (within its
context) that differentiates it from all other
records.
• A record has zero or more properties that describe
specific characteristics of the record.
• Some of those properties may be pointers to other
records that have a given identity.
• The combination of properties and identity also
share a semantic cohesiveness.


From Record to Resource
• A resource is an abstract entity that is both
unique and addressable.
• A representation of that resource is a (potentially
structured) bag of properties that describe
characteristics of that resource.
• If that resource is part of a collection, then the
representation of that resource is a record in that
collection.
• Note that the record is NOT the resource – it is
only a description of the state of the resource at
the moment that it is queried.


Addressibility
• A resource is addressible if, for a given collection,
there is a key that can be used to retrieve a
representation of the resource from the collection.
• A collection in turn, can be thought of as the
context or namespace of all addresses within that
collection.
• Conversely, if there is no collection for which a
key exists to retrieve a resource then it is not a
resource.


Resources and Time
• Most resources are state machines – they change
their (internal) state over time.
• Depending upon the requested representation
format, a record may be
– the most recent representation of a resource,
– the delta of changes from the last request,
– or may be a log of all changes.
• Through 2000 or so, most resources changed
slowly. That is no longer necessarily true.
• This means that all resources are services.


BIG DATA and REST
• REST – Representational State Transfer – is
becoming the primary services representation.
• The period from 2005 to 2020 will be concerned
with making previously non-addressable data
RESTful.
• REST imputes CRUD (Create, Read, Update,
Delete) semantics to addressable networks.
• REST also implies that collections are resources.


Representations
• A collection processor (aka, a server) is an
abstraction layer between internal and external
data representations.
• Any time a request is made, there is almost always
a transformation that maps between an internal
entity and the requested output.
• Representations do not have to be (and most
times are not) in the same format as the
underlying resource – a resource internally shown
as XML can be rendered as HTML, JSON, zipped
files, graphics, PDF, etc.


Collections
• A collection establishes a constrained set of
resources.
• A collection can also be thought of as a category,
with each resource key as a term in the category's
taxonomy.
• A resource may belong to more than one
collection.
• Collections determine available representations
for resources.


Collections and Search
• A search is the invocation of a parametric
function on a collection that returns a set of
resource representations (typically with links to
more extensive representations).
• It is possible for a collection to have multiple
search functions bound to that collection – in that
regard a search function may itself be a resource
in a different collection.
• Search is the bridge between addressable RESTful
retrieval and imperative web services.


Resources and Data Models
• A resource is an abstraction of a physical entity or
process, and as such it is, itself, conformant to a
data model.
• A representation of a resource is a transformation
of that resource within the context of its
collection.
• This means that relationships that are only
inferred within the internal model may be made
explicit within the representation.


BIG DATA Stage 1: Hadoop
• Hadoop can take context poor or legacy
structured data and create from it
contextual richer data records, which can in
turn inform the development of resources.
• While Hadoop can be used for performing
queries, these will be high latency searches
compared to most other systems.
• Hadoop's real value comes from its ability
to process data into more structurally
queryable or manipulatible forms.

BIG DATA Stage 2: SQL
• SQL relational databases work best with
single dimensional views based upon
primary/foreign key relationships, ideal for
many data models that have relatively rigid
structure but somewhat richer semantics.
• SQL is still ideal for many forms of real
world data processing, but generally has
both non-standardized streaming
mechanisms and constrained procedure
semantics.

BIG DATA Stage 3: Hash Databases

• MongoDB and similar tools extend hash
tables to representing more complex data
models, as the combination of hashes and
sequences can readily represent objects that
have the same core namespace context.
• These are becoming popular as producers
and consumers of JSON, and typically
employ a JavaScript stack for most
transactions.

BIG DATA Stage 4: XML Databases

• XML Databases store XML representations
of objects, which best encode narrative
structures, hierarchical entities that cross
namespace boundaries, and provide very
sophisticated tools for building RESTful
web application services.
• XML Databases are optimal for documents
and hybrid document/data structures.
• XML Databases are standardizing upon
XQuery as the common stack language.

BIG DATA Stage 5: RDF Triple Stores

• Triple stores encode relationships between
resources via RDF n-tuples for query by
SPARQL. Triple stores work best for
working with distributed data where
relationships between resources are as or
more important than the actual contents of
the resources themselves.


Complementary Technologies

• There is a tendency to see NoSQL
technologies as competitive. This is both
wrong and dangerous.
• The technologies have evolved to handle
different phases of data acquisition,
processing and search, going from data
with poor semantics and rigid structure to
data with rich semantics and flexible
structure.

Generation 5 Data Stores

• The next wave of databases will likely incorporate
two or more generation 4 technologies – such as
combining an XML or JSON database with an
RDF triple store to be able to handle inferences
about stored content, or by adding Hadoop
processing support to a SQL database capable of
providing RESTful representations in XML or
JSON of internal views built from SQL tables.


Query Unification

• The central challenge of the next twenty
years will be in finding the commonalities
across these various platforms in order to
build a twenty first century SQL.
• This will require a deeper understanding of
the characteristics of data algebras, which
provides the underpinnings for the most
optimal expression of data structures.


XQuery?

• XQuery may be a good candidate to handle
this unification, for several reasons:
– Solid extensibility model
– SQL-like syntax, but capable of handling
RESTful services inbound and outbound
– XQuery 3.0 supports maps, which are
hash/sequence constructs ideal for encoding
JSON or MongoDB structures.
– Works well in a distributed context.
– Can integrate Sparql or SQL scripts readily.
– Standards based, and written by SQL author

Contact Us

Avalon Consulting, LLC

Dallas Office - HQ Washington, DC Office
5600 Tennyson Parkway 527 Maple Avenue East
Suite 230 Suite 200
Plano, TX 75024 Vienna, VA 22180
469-424-3449 703-635-3302

www.avalonconsult.com
info@avalonconsult.com

© 2011 Avalon Consulting, LLC 22

Thu 1400 cagle_kurt_color

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Thu 1400 cagle_kurt_color

Ähnlich wie Thu 1400 cagle_kurt_color (20)

Mehr von DATAVERSITY

Mehr von DATAVERSITY (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Thu 1400 cagle_kurt_color