The document discusses big data and the need for a new data stack, Data Stack 3.0, to handle the variety, volume, and velocity of data. It notes that internet companies have already built their own data platforms to address similar problems at massive scale using open source software like Hadoop. Persistent Systems is presented as having expertise in big data through contributions to open source projects, pre-built solutions, and professional services to help enterprises implement big data solutions.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Customer summit - big data (final)
1. BIG DATA Defined:
Data Stack 3.0
Persistent Systems
June 2012
24 July 2012 1
2. The Data Revolution is Happening Now
The growing need for large-volume, multi-
structured “Big Data” analytics,
as well as … “Fast Data”, have positioned the
industry at the cusp of the most radical
revolution in database architectures in 20
years.
We believe that the economics of data will
increasingly drive competitive advantage.
Source: Credit Suisse Research, Sept 2011
24 July 2012 2
3. Enterprise Value is Shifting to Data
Data
Apps
ERP
Database
Operating
Systems
Mainframe
24 July 2012 1975 1985 1995 2006 2013 3
4. What Data Can Do For You
Organizational leaders want analytics
to exploit their growing data and
computational power to get smart,
and get innovative, in ways they never
could before.
Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics
and the Path From Insights to Value By Steve LaValle, Eric Lesser,
Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz
December 21, 2010
24 July 2012 4
5. Determining Shopping Patterns
British Grocer, Tesco Uses Big Data
by Applying Weather Results to Predict
Demand and Increase Sales
Britain often conjures images of unpredictable weather, with downpours sometimes followed
by sunshine within the same hour — several times a day.
Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own
software that calculates how shopping patterns change “for every degree of temperature and
every hour of sunshine.”
Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier
http://www.nytimes.com/2009/09/02/b usiness/global/02wea ther.html
24 July 2012 5
6. Tracking Customers in Social Media
Glaxo Smith Kline Uses Big Data
to Efficiently Target Customers
GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using
social media as a base for research and multichannel marketing. Targeted offers and
promotions will drive people to particular brand websites where external data is integrated
with information already held by the marketing teams.
Source: Big data: Embracing the elephant in the room By Steve Hemsley
http://www.marketing week.co.uk/big-da ta-embracing -the-elepha nt-in-the-room/3030939.article
24 July 2012 6
7. What does India Think?
Persistent enables Aamir Khan Productions and Star Plus use
Big Data to know how people react to some of the most
excruciating social issues.
http://www.satyamevjayate.in/
Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the
interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS,
Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This
data is being analyzed and delivered in real-time to allow the producers to understand the
pulse of the viewers, to gauge the appreciation for the show and most importantly to spread
the message. Harnessing the truth from all this data is a key component of the show’s success.
24 July 2012 7
9. WE ALREADY HAVE DATABASES.
WHY DO WE NEED TO DO ANYTHING
DIFFERENT?
24 July 2012 9
10. Relational Database Systems for
Operational Store
● Transaction processing capabilities
ideally suited for transaction-oriented
operational stores.
● Data types – numbers, text, etc.
● SQL as the Query language
● De-facto standard as the operational
store for ERP and mission critical
systems.
● Interface through application programs
and query tools
24 July 2012 10
11. Enterprise Data Warehouse for Decision
Support
● Operational data stores store on-line
transactions
– Many writes, some reads.
● Large fact table, multiple dimension
tables
● Schema has a specific pattern – star
schema
● Joins are also very standard and create
cubes
● Queries focus on aggregates.
● Users access data through tools such as
Cognos, Business Objects, Hyperion etc.
24 July 2012 11
12. Standard Enterprise Data Architecture
Presentation Layer Relational
Databases
Optimized Loader
Extraction
ERP Cleansing
Application Logic Systems (ETL)
Data Warehouse
Engine Analyze
Purchased Query
Data
Relational Databases Legacy
Data Metadata Repository
Data Stack 1.0: Data Stack 2.0:
Operational Data Systems Enterprise Data Warehouse
Systems
24 July 2012 12
13. Despite the two data stacks ..
One in two
business
executives
believe that they
do not have
sufficient
information
across their
organization to
do their job
Source: IBM Institute for Business Value
24 July 2012 13
14. Data has Variety
Less than 40% of
the Enterprise
Data is stored in
Data Stack 1.0 or
Data Stack 2.0.
24 July 2012 14
15. Beyond the Operational Systems, data
required for decision making is scattered
within and beyond the enterprise
Weather forecasts
Expense Twitter
Email Systems Management Feeds
Collaboration Vendor Demographic
System
/Wiki Sites Collaboration Data
Organizational Systems Maps
Employee Surveys
Workflow
Document Repositories Supply Chain Economic Data
ERP Systems Systems
Customer Call Social
CRM Systems Location and
Center Records Networking
Presence Data
Enterprise Sensor Data
Data Warehouse Project artifacts Data CRM Systems
Structured Unstructured Cloud Public
Data Sources Data Sources Data Sources Data Sources
24 July 2012 15
16. Data Volumes are Growing
5 Exabytes of information was
created between the dawn of
civilization through 2003, but that
much information is now created
every 2 days, and the pace is
increasing
Eric Schmidt
(1 exabyte = 1018 bytes ) at the Techonomy Conference,
August 4, 2010
24 July 2012 16
17. The Continued Explosion of Data in the
Enterprise and Beyond
80% of new information growth is
unstructured content –
90% of that is currently unmanaged
2020
35 zettabytes
44x as much
Data and Content
2009 Over Coming Decade
800,000 petabytes 1990 2000 2010 2020
Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
24 July 2012 17
18. What comes first -- Structure or data?
Schema/
Data
Structure
Structure First is Constraining
24 July 2012 18
19. Time to create a new data
stack for unstructured data.
Data Stack 3.0.
24 July 2012 19
20. The Path to Data Stack 3.0:
Must support Variety, Volume and Velocity
Data Stack 1.0 Data Stack 2.0 Data Stack 3.0
Relational Database Systems Enterprise Data Warehouse Dynamic Data Platform
Recording Business Events Support for Decision Making Uncovering Key Insights
Highly Normalized Data Un-normalized Dimensional Model Schema less Approach
GBs of Data TBs of Data PBs of Data
End User Access through Ent Apps End User Access Through Reports End User Direct Access
Structured Structured Structured + Semi Structured
24 July 2012 20
21. Can Data Stack 3.0 Address Real Problems?
Large Data Diverse Data Queries that Answer Queries
Volume at Low beyond Are Difficult to that No One
Price Structured Data Answer Dare Ask
24 July 2012 21
22. Time-out!
Internet companies
have already
addressed the same
problems.
24 July 2012 22
23. Internet Companies have to deal with large
volumes of unstructured real-time data.
● Twitter has 140 million active users and more than 400
million tweets per day.
● Facebook has over 900 million active users and an average
of 3.2 billion Likes and Comments are generated by
Facebook users per day.
● 3.1 billion email accounts in 2011, expected to rise to over 4
billion by 2015.
● There were 2.3 billion internet users (2,279,709,629)
worldwide in the first quarter of 2012, according to Internet
World Stats data updated 31st March 2012.
24 July 2012 23
24. Their data loads and pricing requirements
do not fit traditional relational systems
● Hosted service
● Large cluster (1000s of nodes) of low-cost
commodity servers.
● Very large amounts of data -- Indexing
billions of documents, video, images etc..
● Batch updates.
● Fault tolerance.
● Hundreds of Million users,
● Billions of queries every day.
24 July 2012 24
25. They built their own systems
● It is the platform that distinguishes them from everyone else.
● They required:
– high reliability across data centers
– scalability to thousands of network nodes
– huge read/write bandwidth requirements
– support for large blocks of data which are gigabytes in size.
– efficient distribution of operations across nodes to reduce bottlenecks
Relational databases were not suitable and would have been
cost prohibitive.
24 July 2012 25
26. Internet Companies have open-sourced the
source code they created for their own use.
Companies have
created business
models to support
and enhance this
software.
24 July 2012 26
29. Enterprises Always had Data.
Now there is a way to handle it!
Allows for analysis of massive volumes of
information
• Structured and Unstructured
• External and Internal
Thousands of users, millions of files,
terabytes of data needs to be handled
Commoditized hardware can be used
to reduce costs
Big Data can and should integrate
with existing enterprise information
architecture
24 July 2012 29
Only Big Data makes it possible!
31. Persistent Systems has an
experienced team of Big Data Experts that
has created the technology building blocks
to help you implement a Big Data Solution
that offers a direct path to unlock the value
in your data.
32. Big Data Expertise at Persistent
● 10+ projects executed with Leading ISVs and Enterprise Customers
● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem
(formed 3 years ago)
● Engaged with the Big Data Ecosystem, including leading ISVs and
experts
• Preferred Big Data Services Partner of IBM and Microsoft
24 July 2012
33. Big Data Leadership and Contributions
● Code Contributions to Big Data Open Source Projects, including:
– Hadoop, Hive, and SciDB
● Dedicated Hadoop cluster in Persistent
● Created PeBAL – Persistent Big Data Analytics Library
● Created Visual Programming Environment for Hadoop
● Created Data Connectors for Moving Data
● Pre-built Solutions to Accelerate Big Data Projects
24 July 2012 33
34. Persistent’s Big Data Offerings
1. Setting up and Maintaining Big Data Platform
2. Data Analytics on Big Data Platform
3. Building Applications on Big Data
Technology Assets People Assets
Persistent Pre-built Persistent Pre-built Persistent Pre-built Big Data Custom
Industry Solution: Industry Solution: Industry Solution: Services
Retail Banking Telco
Extension of
Persistent Pre-built Horizontal Solutions Your Team
Visual Programming
(Email, Text, IT Analytics, … ) Discovery Workshop
Training for Your Team
Persistent Platform Enhancement IP
Tools
(PeBAL Analytics Library, Data Connectors)
Methodology
Foundational Infrastructure and Platform Team Formation Process
(Built Upon Selected 3rd Party Big Data Platforms and Technologies;
Cluster Sizing/Config
Cluster of Commodity Hardware)
24 July 2012 34
35. Persistent Next Generation Data Architecture
Reports
BI Tools
& Alerts
Email
Email
Connector Framework Media
Connector Framework
Server
Server Admin App
Web Proxy
Web Proxy
Solutions
IBM Tivoli Workflow Integration Persistent Analytics Library (PEBAL)
NoSQL
Graph Fn Set Fn …. ….. ….. Text Analytics Fn
BBCA
Text Analytics/
Social
PIG/Jqal Hive
Connector
GATE/SystemT
Twitter, RDBMS
Facebook MapReduce and HDFS
Cluster Monitoring
Data
DW Warehouse
Commercial/ Open
Persistent IP External Data source
Source Product
24 July 2012 35
36. Persistent Big Data Analytics Library
WHY PEBAL
• Lots of common problems – not all of them are solved in Map Reduce
• PigLatin, Hive, JAQL are languages and not libraries – something is
needed to run on top that is not tied to SQL like interaces
FEATURES
• Organized as JAQL functions, PeBAL implements several graph, set, text
extraction, indexing and correlation algorithms.
• PeBAL functions are schema agnostic.
• All PeBAL functions are tried and tested against well defined use cases.
BENEFITS OF A READY MADE SOLUTION
• Proven – well written and tested
• Reuse across multiple applications
• Quicker implementation of map reduce applications
24 July 2012
• High performance 36
37. Web
Analytics
Text Inverted
Analytics Lists
Set
Graph Statistics
24 July 2012 37
38. Visual Programming Environment
ADOPTION BARRIERS
• Steep Learning Curve
• Difficult to Code
• Ad-hoc reporting can’t always be done by writing programs
• Limited tooling available
VISUAL PROGRAMMING ENVIRONMENT
• Use Standard ETL tool as the UI environment for generating PIG scripts
BENEFITS
• ETL Tools are widely used in Enterprises
• Can leverage large pool of skilled people who are experts in ETL and BI
tools
• UI helps in iterative and rapid data analysis
• More people will start using it
24 July 2012 38
39. Visual Programming Environment for
Hadoop
Data
Sources ETL Tool
Data Flow UI
Metadata
PIG Convertor
PIG code
PIG UDF Library
HDFS/ Hive
Data Data
HDFS
HDFS
Big Data Platform
Persistent IP
24 July 2012 39
40. Persistent Connector Framework
20+ WHY CONNECTOR FRAMEWORK
Years • Pluggable Architecture
OUT OF THE BOX
• Database, Data Warehouse
• Microsoft Exchange
• Web proxy
• IBM Tivoli
• BBCA
• Generic Push connector for *any* content
FEATURES
• Bi-directional connector (as applicable)
• Supports Push/Pull mechanism
• Stores data on HDFS in an optimized format
24 July 2012 • Supports masking of data 40
42. Persistent’s Breadth of Big Data Capabilities
Tooling
Horizontal and Vertical Pre-built Solutions • RDBMS/DWH to import/export data
• Text Analytics libraries
• Data Visualization using Web2.0 and reporting tools
Big Data Platform (PeBAL) analytics - Cognos, Microstrategy
libraries and Connectors • Ecosystem tools like - Nutch, Katta, Lucene
• Job configuration, management and monitoring with BIgInsight’s job
IT Management scheduler (MetaTracker)
• Job failure and recovery management
Big Data Application
Programming • Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs,
Integration of third party tools/libraries, Performance tuning, ETL using
JAQL
• HDFS Distributed
• Expertise in MR programming - PIG, Hive, Java MR
• IBM GPFS File Systems
• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution
(AQL + SystemT)
• Platform Setup on multi- Cluster
node clusters, Layer • Statistical Analytics - R, SPSS, BigInsights Integration with R
monitoring, VM based
setup Persistent IP for Big Data Solutions
• Product Deployment
24 July 2012 Big Data Platform Components 42
43. Persistent Roadmap to Big Data
Improve Knowledge Base 1. Learn Discover and
and Shared Big Data Platform Define Use Cases
5. Manage 2. Initiate
Measure Effectiveness Validate with
and Business Value a POC
4. Measure 3. Scale
Upgrade to Production
if Successful
24 July 2012 43
44. Customer Analytics
Identifying your most
influential customers ? Target these
customers for
Identify promotions.
influential
Overlay sales customers
data on the using network
Build a social graph analysis Few thousand
graph of all > 1billion transactions Influential customers
customers over twenty years
70 million customers
Targeting influential customers is best way to
24 July 2012
improve campaign ROI! 44
45. Overview of Email Analytics
● Key Business Needs
– Ensure compliance with respect to a variety of business and IT communications and
information sharing guidelines.
– Provide an ongoing analysis of customer sentiment through email communications.
● Use Cases
– Quickly identify if there has been an information breach or if the information is being shared in
ways that is not in compliance with organizational guidelines.
– Identify if a particular customer is not being appropriately managed.
● Benefits
– Ability to proactively manage email analytics and communications across the organization in a
cost-effective way.
– Reduce the response time to manage a breach and proactively address issues that emerge
through ongoing analysis of email.
24 July 2012 45
46. Using Email to Analyze Customer
Sentiment
Sense the mood of your customers
through their emails
Carry out detailed analysis on customer
team interactions and response times
24 July 2012 46
47. Analyzing Prescription Data
1.5 million patients are
harmed by medication
errors every year
Identifying erroneous prescriptions can save lives!
24 July 2012 Source: Center for Medication Safety & Clinical Improvement 47
48. Overview of IT Analytics
● Key Business Needs
– Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring
analysis of data from various systems.
– Information may be in different formats, locations, granularity, data stores.
– System outages have a negative impact on short-term revenue, as well as long-term credibility and
reliability.
– The ability to quickly identify if a particular system is unstable and take corrective action is imperative.
● Use Cases
– Identify security threats and isolate the corresponding external factors quickly.
– Identify if an email server is unstable, determine the priority and take preventative action before a
complete failure occurs.
● Benefits
– Reduced maintenance cost
– Higher reliablity and SLA compliance
24 July 2012 48
49. Consumer Insight from Social Media
Find out what the customers are
talking about your organization or
product in the social media
24 July 2012 49
50. Insights for Satyamev Jayate – Variety of
sources Web/TV Viewer
Response to Pledge
multiple choice
Web, Social Media
2. Unstructured Analysis 1. Structured Analysis
questions
(unstructured)
Responses to following questions Responses to Pledge,
Social Media
(Structured)
• Share your story multiple choice Web, emails, IVR/Calls
• Ask a question to Aamir questions Individual blogs
SMS
• Send a message of hope
IVR
• Share your solution Social widgets
Videos
Content Filtering Rating Tagging
System (CFRTS)
…
L0, L1, L2 phased analytics 3. Impact Analysis
Crawling general internet for measuring the
before & after scenario on a particular topic
51. Rigorous Weekly
Operation Cycle
producing instant
analytics
Killer combo of Human+Software to
analyze the data efficiently Topic opens on
Sunday
Episode Tags are
refined and Live Analytics
messages are re- report is sent
ingested for during the show
another pass
Featured content Data capture
is delivered thrice from SMS, phone
a day all through calls, social
out the week. media, website,
JSONs are
created for the System runs L0
external and Analysis, L1, L2
internal Analysts continue
dashboards
53. Thank you
Anand Deshpande (anand@persistent.co.in)
http://in.linkedin.com/in/ananddeshpande
Persistent Systems Limited
www.persistentsys.com
24 July 2012 53
54. Next Generation Sequencing
Sequencing machines are getting affordable
Running cost of sequencing is going down
NGS machines generate TBs of data per week.
Need to analyze this data in time
Analysis results are critical for human life, personalized medicines
24 July 2012 54