This presentation will show how an outdoor advertising company used the Oracle Big Data environment to provide real time statistics and high-value insights to their customers. Using data from providers such as Pinsight Media and Perconix data from Axciom, they are able to accurately show the demographics of the consumers in the viewshed of their billboards and other digital assets. With the need to get useful information out of the terabyte of data they were receiving, our client use Oracle BDCS, specifically, Hive to create external tables connecting to flat files or MongoDB and Impala to analyze the data. The data was then loaded into Oracle DBCS for to be accessed by OACS for further analysis and dashboarding.
3. Confidential – Restricted Slide 3
Who we @re?
BA, short for Business Analytics, is the technology used by companies to investigate business performance by
analyzing their data. BA encompasses three major areas Business Intelligence (BI), Enterprise Performance
Management (EPM) and Big Data Analytics. BA in this instance indicates that we will go beyond EPM and also
address topics/technologies that pertain to BI and Big Data.
Wikipedia describes soapbox as a raised platform on which one stands to make an impromptu speech, often
about a political subject. The term originates from the days when speakers would elevate themselves by
standing on a wooden crate originally used for shipment of soap or other dry goods. I want this is be not
only a place for me to share tips and tricks, but also a platform where industry practitioners can come to
share their views and expertise.
I am not an expert! I say this because I have had the pleasure through the years of working with people I would
consider experts in specific products. Instead, I see myself as a data integration specialist or enthusiast if you
will. I am someone who is passionate about the field and have made considerable investment in the space.
If you are like me or find the thought of maintaining a website a bit daunting, this may be your opportunity to
share your thoughts and educate others.
So if you have a topic that you would like to contribute, please let me know at BASoapbox.com.
4. Confidential – Restricted Slide 4
The Abstract
This presentation will show how an outdoor advertising company used the Oracle Big Data environment
to provide real time statistics and high value insights to their customers. Using data from providers such
as Pinsight Media and Perconix data from Acxiom, they are able to accurately show the demographics
of the consumers in the viewshed of their billboards and other digital assets. With the need to get
useful information out of the terabyte of data they were receiving, our client use Oracle BDCS,
specifically, Hive to create external tables connecting to flat files or MongoDB and Impala to analyze the
data.
The data was then loaded into Oracle DBCS for to be accessed by OACS for further analysis and
dashboarding.
In the session, I will:
Go over the overall objective of the project
Go over some of the key Hive and Impala queries
Show how ODICS was instrumental with the automation, but also with the interaction
with MongoDB
Go over some of the lessons learned
5. Confidential – Restricted Slide 5
The Big Question
Can we identify by audience size and segment the consumer population exposed to a billboard by
time of day and day of week?
12:00 AM 11:59 PM7:00 AM 12:00 PM 5:00 PM
Out of the billion of aggregated anonymized observed devices in a geographical location per day:
1. We need curated assets (billboards)
2. We need to get a total count of the devices we observe within the viewshed of our assets
(billboards)
3. We need to take a look at home and secondary locations to define the segmentation of the
consumers near our assets (billboards)
6. Confidential – Restricted Slide 6
Viewshed Impressions
Once a viewshed is constructed, an
impression then becomes the
intersection of consumers with the
asset.
Wikipedia describes viewshed as the geographical area that is visible from a location. It includes all
surrounding points that are in line-of-sight with that location and excludes points that are beyond
the horizon or obstructed by terrain and other features (e.g., buildings, trees).
7. Confidential – Restricted Slide 7
A Few Key Terms…
Geofence as
Polygon
Pivot
Dwell Time
Route
Lat/Lon
Asset
(Digital Billboard)
x
Lat/Lon
Lat/Lon
Lat/LonLat/Lon
Lat/Lon
Lat/Lon
Visit
Term Definition
Pivot
When a user cross a polygon or geofence
boundary (i.e. New Jersey to New York City).
Dwell Time
When a user cross a polygon or geofence
boundary. New Jersey to New York City. SoHo
to LES. Outside to Inside.
Visit
dwell time at a location that is material to
intent
Geofence
A designated boundary around a geometry
that, if crossed, initiates a notification.
Geofences are often used in real-time route
Web applications.
Polygon
On a map, a closed shape defined by a
connected sequence of x,y coordinate pairs,
where the first and last coordinate pair are the
same and all other pairs are unique.
8. Confidential – Restricted Slide 8
Final Answer
9 AM to 10
AM Work
Learn
Shop & Eat
Play
Audience
Identity
Audience
Behavioral
Graph
Audience
Size
Time
• Friday 9 AM to 10 AM
Audience Size
• 10,432 impressions
Audience Segmentation
• 42% college crowd
• 38 % mobile mixers
• 20% city mixers
For impression seen at certain time, say between 9 AM to 10 AM, we are able to travel
backward and forward in time to understand home location type and secondary location
type. This all informs audience segmentation.
9. Confidential – Restricted Slide 9
Primary Data Sources
• Owned by Sprint
• Parquet file representing the observed population pivoting through
Assets’ viewsheds.
• Contains FIPS/Block Group data for each viewshed impression
FIPS / Block Group
• Data which will provide personal characteristics such as demographics, income, ethnicity, etc. to give us
profile of the users that are passing through each asset.
• PersonicX is a segmentation system which places U.S. households into one of 70 distinctive segments
and 21 life stage groups based upon specific consumer behavior and demographic characteristics.
• Asset inventory are maintained in a MongoDB application called GSP
• GSP holds Point of Interest (POI) data, asset data (e.g., location of asset, latitude, and longitude),
geospatial data, real estate, wireless, transit, and taxonomy
• Output data from Big Data Cloud Service (BDCS) will be loaded back into GSP
10. Confidential – Restricted Slide 10
Data Flow
+Big Data Manager
ODI to orchestrate Big Data Manager (BDM). BDM will be used to move Pinsight data from AWS S3
to BDCS
+Big Data Manager
ODI to orchestrate Big Data Manager (BDM). BDM will be used to move files from Acxiom to BDCS.
ODI to orchestrate mongodump and mongoimport functions to load Asset information from
MongoDB to BDCS.
12. Confidential – Restricted Slide 12
Oracle Data Integrator (ODI)
• ODI was used in the following
ways on the project:
• As a means to orchestrate
certain processes
• Performing E-L-T type
integrations
• The use of ODI made it easier to
not only streamline, but
automate certain processes.
13. ODI Process Orchestration
• This insert shows how we can have ODI perform
the follow orchestration, in a streamlined and
automated fashion:
1. Call the EDQ Workflow that will do the IMS data
cleansing.
2. If the process fails, then email the responsible
parties. Otherwise, call the EDQ Workflow that will
export the IMS (cleansed) data from EDQ to staging
DB or flat file format
3. Email responsible parties that all processes ran
successfully
14. Confidential – Restricted Slide 14
The Data Lake
Big Data Cloud Service (BDCS) was used as a central data repository, divided into ponds for raw
data, canonical model and semantic layer. Large cellular data from Pinsight, consumer insights
from Acxiom and asset data from MongoDB were loaded into the Hive table in the raw data pond.
Data Ingestion Pond
Data is stored as-is
Recent History available on HDFS
Older file moved to Oracle Object Storage
Common Meanings
Additional standardized field and values
Optimized storage format
Ideal layer for exploratory data analysis (EDA)
Business Ready
Finalized Model
Data ready for consumption
Conforming to the business’ needs
Acxiom MongoDB Pinsight
18. Confidential – Restricted Slide 18
Acxiom Sample File 1. ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.RegexSerDe" Since the
file has a fixed format, we use a SerDe to tell hive to use the RegexSerDe class
to serialize and deserialize.
2. "input.regex" is used to deserialize the rows read from the table data. So this
regex pattern is applied to the row value read from the file to split up into
different columns defined in the meta data for this hive table.
3. "output.format.string" is used to serialize the rows being written out to this
table data. This value is used as a format to generate a row value (from its
column values) that is to be written back to the output file for this hive table.
Regex Description
. any character except for line break
{n} n number of repetition of code
that precedes it
FIPS county code. The Federal Information Processing Standard Publication 6-4 (FIPS 6-4) was a five-digit Federal Information Processing Standards code which uniquely identified counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.
The FIPS State Code is a unique two-digit numeric code used to identify the 50 states, the District of Columbia, and the outlying areas of the United States.
The FIPS County Code is a three-digit numeric code that, when used with the FIPS State Code, provides a unique identifier for each county and statistically equivalent entity of the 50 states, the District of Columbia, and the possessions and associated areas of the United States.
Census tracts are the basic county subdivision used by the Census Bureau for census collection purposes and contain an average of 1,500 households. The tract code is made up of a 4-byte prefix and a 2-byte suffix (the next element) often separated by a decimal. Here, the tract element is provided as one 6-byte element.
A subdivision of a census tract, a block group is another county subdivision used by the Census Bureau for census collection purposes and refers to an area that contains an average of 500 households.
FIPS county code. The Federal Information Processing Standard Publication 6-4 (FIPS 6-4) was a five-digit Federal Information Processing Standards code which uniquely identified counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.
The FIPS State Code is a unique two-digit numeric code used to identify the 50 states, the District of Columbia, and the outlying areas of the United States.
The FIPS County Code is a three-digit numeric code that, when used with the FIPS State Code, provides a unique identifier for each county and statistically equivalent entity of the 50 states, the District of Columbia, and the possessions and associated areas of the United States.
Census tracts are the basic county subdivision used by the Census Bureau for census collection purposes and contain an average of 1,500 households. The tract code is made up of a 4-byte prefix and a 2-byte suffix (the next element) often separated by a decimal. Here, the tract element is provided as one 6-byte element.
A subdivision of a census tract, a block group is another county subdivision used by the Census Bureau for census collection purposes and refers to an area that contains an average of 500 households.
Create a data platform in Big Data Cloud Service (BDCS) that will do the following:
Data hub for data integration from Pinsight, Acxiom, GSP (MongoDB) and IMS
Serve as landing of raw data, enriched and modeled as work in progress, and finalized for consumption
Cleanse the data, especially IMS data that is coming from a DB2 database
Creating some semblance of Master Data Management
Providing beginnings of ad-hoc data model for analytics
Large data sets from cellular and GPS sources are needed for audience size by location
Large data sets from consumer insight data and social media are needed for audience segment from size and location
POI and geofencing data sources are needed to understand origin and destination segmentation
Billboard inventory is needed for associating billboard with origin, destination, POI, and neighborhood
Create a data platform in Big Data Cloud Service (BDCS) that will do the following:
Data hub for data integration from Pinsight, Acxiom, GSP (MongoDB) and IMS
Serve as landing of raw data, enriched and modeled as work in progress, and finalized for consumption
Cleanse the data, especially IMS data that is coming from a DB2 database
Creating some semblance of Master Data Management
Providing beginnings of ad-hoc data model for analytics
Large data sets from cellular and GPS sources are needed for audience size by location
Large data sets from consumer insight data and social media are needed for audience segment from size and location
POI and geofencing data sources are needed to understand origin and destination segmentation
Billboard inventory is needed for associating billboard with origin, destination, POI, and neighborhood
The FIPS State Code is a unique two-digit numeric code used to identify the 50 states, the District of Columbia, and the outlying areas of the United States.
The FIPS County Code is a three-digit numeric code that, when used with the FIPS State Code, provides a unique identifier for each county and statistically equivalent entity of the 50 states, the District of Columbia, and the possessions and associated areas of the United States.
Census tracts are the basic county subdivision used by the Census Bureau for census collection purposes and contain an average of 1,500 households. The tract code is made up of a 4-byte prefix and a 2-byte suffix (the next element) often separated by a decimal. Here, the tract element is provided as one 6-byte element.
A subdivision of a census tract, a block group is another county subdivision used by the Census Bureau for census collection purposes and refers to an area that contains an average of 500 households.
FIPS county code. The Federal Information Processing Standard Publication 6-4 (FIPS 6-4) was a five-digit Federal Information Processing Standards code which uniquely identified counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.
The FIPS State Code is a unique two-digit numeric code used to identify the 50 states, the District of Columbia, and the outlying areas of the United States.
The FIPS County Code is a three-digit numeric code that, when used with the FIPS State Code, provides a unique identifier for each county and statistically equivalent entity of the 50 states, the District of Columbia, and the possessions and associated areas of the United States.
Census tracts are the basic county subdivision used by the Census Bureau for census collection purposes and contain an average of 1,500 households. The tract code is made up of a 4-byte prefix and a 2-byte suffix (the next element) often separated by a decimal. Here, the tract element is provided as one 6-byte element.
A subdivision of a census tract, a block group is another county subdivision used by the Census Bureau for census collection purposes and refers to an area that contains an average of 500 households.