3. Agenda
Big Data - definition and explanation
Cloud Computing
Data Science
Business Strategy Models
Case Studies in Insurance
4. Big Data
What is Big Data?
"Big data" is a term applied to data sets whose size is beyond the ability of
commonly used software tools to capture, manage, and process the data within a
tolerable elapsed time.
Examples include web logs, RFID, sensor networks, social networks, social data
(due to the social data revolution), Internet text and documents, Internet search
indexing, call detail records, astronomy, atmospheric science, genomics,
biogeochemical, biological, and other complex and often interdisciplinary scientific
research, military surveillance, medical records, photography archives, video
archives, and large-scale e-commerce.
5. Big Data
What is Big Data?
"extremely large data sets that may be analysed computationally to reveal
patterns, trends, and associations, especially relating to human behaviour and
interactions.
1. "much IT investment is going towards managing and maintaining big data"
https://en.wikipedia.org/wiki/Big_data Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer,
visualization, querying, updating and information privacy.
6. Big Data: Statistics
IBM- http://www-01.ibm.com/software/data/bigdata/
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data
in the world today has been created in the last two years alone. This data comes
from everywhere: sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction records, and cell
phone GPS signals to name a few. This data is big data.
7. Big Data: Moving Fast
IBM- https://www.ibm.com/big-data/us/en/
Big data is being generated by everything around us at all times. Every digital
process and social media exchange produces it. Systems, sensors and mobile
devices transmit it. Big data is arriving from multiple sources at an alarming
velocity, volume and variety. To extract meaningful value from big data, you need
optimal processing power, analytics capabilities and skills.
8. 4V of BIG DATA
http://www.ibmbigdatahub.com
/infographic/four-vs-big-data
18. Who uses Big Data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Banking
With large amounts of information streaming in from countless sources, banks are faced with finding new and innovative ways to
manage big data. While it’s important to understand customers and boost their satisfaction, it’s equally important to minimize risk and
fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step
ahead of the game with advanced analytics.
Education
Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big
data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for
evaluation and support of teachers and principals.
Government
When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to
managing utilities, running agencies, dealing with traffic congestion or preventing crime. But while there are many advantages to big
data, governments must also address issues of transparency and privacy.
19. Who uses Big Data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Health Care
Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly,
accurately – and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed
effectively, health care providers can uncover hidden insights that improve patient care.
Manufacturing
Armed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste – processes that are
key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they
can solve problems faster and make more agile business decisions.
Retail
Customer relationship building is critical to the retail industry – and the best way to manage that is to manage big data. Retailers need
to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back
lapsed business. Big data remains at the heart of all those things.
20. Big Data: Hadoop Stack
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be
prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
http://hadoop.apache.org/
21. Big Data: Hadoop Stack
Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for
Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a
dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually
alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that
supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to
execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™,
Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace
Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
25. NoSQL
A NoSQL (Not-only-SQL) database is one that has been designed to store,
distribute and access data using methods that differ from relational databases
(RDBMS’s). NoSQL technology was originally created and used by Internet
leaders such as Facebook, Google, Amazon, and others who required database
management systems that could write and read data anywhere in the world, while
scaling and delivering performance across massive data sets and millions of
users.
28. How NoSQL Databases Differ From Each Other
https://www.datastax.com/nosql-databases
There are a variety of different NoSQL databases on the market with the key differentiators between them
being the following:
Architecture: Some NoSQL databases like MongoDB are architected in a master/slave model in somewhat
the same way as many RDBMS’s. Others (like Cassandra) are designed in a ‘masterless’ fashion where all
nodes in a database cluster are the same. The architecture of a NoSQL database greatly impacts how well
the database supports requirements such as constant uptime, multi-geography data replication, predictable
performance, and more.
Data Model: NoSQL databases are often classified by the data model they support. Some support a wide-
row tabular store, while others sport a model that is either document-oriented, key-value, or graph.
Data Distribution Model: Because of their architecture differences, NoSQL databases differ on how they
support the reading, writing, and distribution of data. Some NoSQL platforms like Cassandra support writes
and reads on every node in a cluster and can replicate / synchronize data between many data centers and
cloud providers.
Development Model: NoSQL databases differ on their development API’s with some supporting SQL-like
languages (e.g. Cassandra’s CQL).
30. Cloud Computing
Cloud computing is a model for enabling ubiquitous,
convenient, on-demand network access to a shared pool of
configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly
provisioned and released with minimal management effort or
service provider interaction. This cloud model is composed of
five essential characteristics, three service models, and four
deployment models.
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
--National Institute of Standards and Technology
31. Cloud Computing: Types
five essential characteristics
1. On demand self service
2. Broad Network Access
3. Resource Pooling
4. Rapid Elasticity
5. Measured Service
32. Cloud Computing
1. the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a
local server or a personal computer.
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
34. Cloud Computing: Types
four deployment models (private, public, community and hybrid).
Key enabling technologies include:
1. fast networks,
2. inexpensive computers, and
3. virtualization for commodity hardware.
35. Cloud Computing: Types
major barriers to broader cloud adoption are
security, interoperability, and portability
For a layman to be explained in simple short terms, cloud computing is a lot of
scalable and custom computing power available by rent/by hour and accessible
remotely. It can help in doing more computing at a fraction of the cost
36. Data Driven Decision Making
- using data and trending historical data
- validating assumptions if any
- using champion challenger to test scenarios
- using experiments
- use baselines
- continuous improvement
- customer experiences
- costs
- revenues
If you can't measure it, you can't manage it -Peter Drucker
37. BCG Matrix for Product Lines
BCG Matrix is best used to analyze your own or target organization’s product portfolio- applicable for companies
with multiple products
To help corporations with analyzing their business units
or product lines. This helps the company allocate resources
38. Porter’s 5 Forces Model for Industries
It draws upon industrial organization (IO) economics
to derive five forces that determine the competitive intensity
and therefore attractiveness of a market.
Attractiveness in this context refers to the overall industry
profitability. An “unattractive” industry is one in which
the combination of these five forces acts to drive down
overall profitability. A very unattractive industry would be
one approaching “pure competition”, in which available
profits for all firms are driven to normal profit.
39. Porter’s Diamond Model
an economical model developed by Michael Porter in his book The Competitive Advantage of Nations, where he
published his theory of why particular industries become competitive in particular locations.
40. McKinsey 7S Framework
To check which teams work and which teams done (within an organization) use this framework by the famous
consulting company-a strategic vision for groups, to include businesses, business units, and teams. The 7S are
structure, strategy, systems, skills, style, staff and shared values. The model is most often used as a tool to assess
and monitor changes in the internal situation of an organization.
41. Greiner Model for Organizational Growth
Developed by Larry E. Greiner it is helpful when
examining the problems associated with growth on
organizations and the impact of change on employees.
It can be argued that growing organizations move
through five relatively calm periods of evolution, each
of which ends with a period of crisis and revolution.
Each evolutionary period is characterized by the
dominant management style used to achieve
growth, while
Each revolutionary period is characterized by the
dominant management problem that must be
42. Marketing Model
4P and 4 C model helps you identify marketing mix
Products Price Promotion Place
Consumers Cost Communication Convenience
43. Business Canvas Model
The Business Model Canvas is a strategic management template for developing new or documenting existing
business models. It is a visual chart with elements describing a firm’s value proposition, infrastructure, customers,
and finances. It assists firms in aligning their activities by illustrating potential trade-offs.
44. Motivation Models
Hertzberg motivation-hygiene theory
job satisfaction and job dissatisfaction act independently of each other
Leading to satisfaction
Achievement
Recognition
Work itself
Responsibility
Advancement
Leading to dissatisfaction
Company policy
Supervision
Relationship with boss
Work conditions
Salary
Relationship with peers
46. Business Strategy Models
http://decisionstats.com/2013/12/19/business-strategy-models/
1. Porters 5 forces Model-To analyze industries
2. Business Canvas
3. BCG Matrix- To analyze Product Portfolios
4. Porters Diamond Model- To analyze locations
5. McKinsey 7 S Model-To analyze teams
6. Gernier Theory- To analyze growth of organization
7. Herzberg Hygiene Theory- To analyze soft aspects of individuals
47. Data Science
What is a data scientist? A data
scientist is one who had inter
disciplinary skills in both
programming, statistics and
business domains to create
actionable insights based on
experiments or summaries from
data.
48. Data Science
On a daily basis, a data scientist is simply a person
who can write some code
in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR)
for
data storage, querying, summarization, visualization efficiently, and in time
on
databases, on cloud, servers and understand enough statistics to derive insights from data
so business can make decisions
What should a data scientist know? He should know how to get data, store
it, query it, manage it, and turn it into actionable insights.
49. Big Data Social Media Analysis
https://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/
Social Network Analysis
50. How does information propagate through a
social network?
http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
51. Fraud Analysis
anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an
expected pattern or other items in a dataset.
52. How they affect you :Financial Profitability
Data Storage is getting cheaper but the way it is stored is changing ( from
company servers to external cloud)
Big Data helps to store every interaction, transaction, with customer but this also
increases complexity of data
Data Science is getting cheaper ( open source) but more skilled professionals in
analytics required
53. How they affect you :Sales and Marketing
Which customers to target and who not to target ( traditional propensity models)
Where to target ( geocoded)
When to target
Forecast Demand
54. How they affect you :Operations
Optimize cost and logistics
Maximize output per resource
Can also be combined with IoT
55. How they affect you :Human Resources
Which employee is like to leave first
Which skill is most likely to be crucial next 12 24 months
Forecast for skills, employees
56. Insurance Examples
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-
guns-progressive-insurance-35951-1.html
Agents increasingly want mobile enablement, and not just the
ability to quote, but to bind and sell policies on smartphones and
tablets. -Progressive
progressive snapshot
https://www.progressive.com/auto/snapshot/
To participate you attach the Snapshot device to the computer in
your car, which collects data about your driving habits. According
to Progressive, the device records your vehicle identification
number (VIN), how many miles you drive each day and how often
you drive between between midnight and 4 a.m.
After driving with Snapshot for 30 days, you return it to Progressive
and, depending on your driving habits, the company says you can
get a discount up to 30%
57. Insurance Examples
Mass Mutual http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-massmutual-35952-1.html
Created Haven Life, an online insurance agency that uses an algorithmic underwriting tool and series of related decisions
that was created in collaboration with team of data scientists.
insurance companies are vast decision-making engines that take and manage risk. The inputs into this engine are data, and
the capabilities created by the field of data science can and will impact every process in the company — from underwriting
to claims management to security,
58. Insurance Examples
CNA is applying big data technology to workers compensation claims and adjusters’ notes.
“That is a classic, unstructured big data kind of problem,” says Nate Root, SVP of CNA’s shared service organization. “We
have hundreds of thousands of workers compensation claims, and claims adjuster notes, and there is tremendous value in
those notes.”
Root says the insurer recently began identifying workers’ compensation claims that have the potential to turn into a total
disability, or partial permanent disability, without the right sort of attention. By examining the unstructured data, CNA has
developed a hundred different variables that can predict a propensity for a claim to become serious, and then assign a
nurse case manager to help the insured get necessary treatments for a better patient outcome, get them back to work and
lower the overall cost of coverage. For example, the program can find people who are missing appointments or who are not
engaged with physical therapy and should be.
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-cna-35959-1.html
59. Insurance Examples
American Family Insurance licensed APT’s Test & Learn software
(http://www.predictivetechnologies.com/products/test-learn.aspx ) to enhance
customer engagement and increase support for agents. “This is a statistical tool
that enables us to create and analyze statistical tests,”
For example, call-routing techniques affect wait times and, ultimately claims
satisfaction. The insurer also tracks how claims are handled, and by whom, and
whether agents are involved in resolution. Using APT, the insurer can isolate
variables and accurately determine the success of one design vs. another for
various products, geographies or demographics,
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-
american-family-insurance-35953-1.html .
60. Insurance Examples
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-american-family-insurance-35953-1.html .
American Family Insurance Unstructured data, such as that collected in call center transcripts, also can be studied to
better understand what approaches are best for different situations, he says. “Hadoop and other tools enable natural-
language processing and sentiment analysis,” Cruz says. “We can look for key words or patterns in those words, do counts
and build models off textual indicators that enable us to identify three things:
1. when there could be fraud involved,
2. where there might be severity issues,
3. or how we can get ahead of that and plan for it,”
Customer communication, web design and direct mail are other areas the insurer is, or soon will be, using APT,
1. Do we see greater lift in these geographies vs. those? Or,
61. Insurance Examples
Like MassMutual, Nationwide has partnered with a local college — Ohio State University, the university with the third-
largest enrollment in the country. The Nationwide Center for Advanced Customer Insights (NCACI) gives OSU students in
advanced degree programs the ability to work with real-world data to solve some of the biggest insurance business
problems. Faculty and students from the marketing, statistics, psychology, economics and computer science departments
work with Nationwide to develop predictive models and data mining techniques aimed at improving
1. marketing and distribution,
2. identifying consumer behavior patterns, and
3. increasing customer satisfaction and
4. lifetime value.
62. Insurance Examples
John Hancock
his team set out to find a way to leverage the wealth of data collected by wearable technologies, including the popular FitBit
and recently released Apple Watch, to give something back to their customers. The end result was John Hancock Vitality, a
new life insurance product that offers up to a 15 percent premium discount to customers who track their healthy habits with
wearables and turn that information over to the insurance company. New buyers even get their own FitBit to begin tracking.
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-john-hancock-35954-1.html
Fitbit Inc. is an American company known for its products of the same name, which are activity trackers,
wireless-enabled wearable technology devices that measure data such as the number of steps walked,
heart rate, quality of sleep, steps climbed, and other personal metrics.
63. Insurance Examples
Swiss Re is using more public data to improve underwriting results and decrease the number of questions the insurer has
to ask consumers to underwrite them. Swiss Re is looking at big data in terms of two major streams. In the first, big data is
being used to help reduce costs and improve the efficiency of current processes throughout the insurance value chain,
including claims and fraud management, cyber risk, customer management, pricing, risk assessment and selection,
distribution and service management, product innovation, and research and development.
In the second stream, big data also offers a new framework to think bigger in terms of market disruption. Swiss Re has
created more than 100 prototypes internally, and that as a result the entire organization sees the value and importance of
big data and smart analytics.
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-swiss-re-35957-1.html
64. Insurance Examples
‘How do you take that operationally efficient data and turn it into a customer/household view and understand all the
products attached to a person?’”
Allstate has focused heavily on master data management and data governance creating party and household IDs for
data. The company is also building a team to work across business areas on analytics projects rather than siloing big data
projects within certain units.
“Something meant for a single purpose often leads to other insights. We know, for example based on some call-volume
analysis in our call center, how often customers defect.”We have an application in claims, QuickFoto, where a policyholder
that isn’t in a major accident can snap a picture of the damage and send it to us. But whereas in the past, that would’ve
gone into a physical folder and then a filing cabinet, now I have all those pictures of cars in a database, and there’s a lot
more that I can do.”
68. Data Science Approach
On a daily basis, a data scientist is simply a person
who can write some code
in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR)
for
data storage, querying, summarization, visualization efficiently, and in time
on
databases, on cloud, servers and understand enough statistics to derive insights from data so
business can make decisions
69. Data Science Approach
What should a data scientist know? He should know how to get data, store it,
query it, manage it, and turn it into actionable insights. The following approach
elaborates on this simple and sequential premise.
70. Where to get Data
A data scientist needs data to do science on, right! Some of the usual sources of data for a data scientist are-
APIs- API is an acronym for Application Programming Interface.We cover APIs in detail in Chapter 6. APIs is how the current big data
paradigm is enabled, as it enables machines to talk and fetch data from each other programmatically. For a list of articles written by the
same author on APIs- see https://www.programmableweb.com/profile/ajayohri.
Internet Clickstream Logs- Internet clickstream logs refer to the data generated by humans when they click specific links within a
webpage. This data is time stamped, and the uniqueness of the person clicking the link can be established by IP address. IP
addresses can be parsed by registries like https://www.arin.net/whois or http://www.apnic.net/whois for examining location (country and
city). internet service provider and owner of the address (for website owners this can be done using the website http://who.is/). In
Windows using the command ipconfig and in Linux systems using ifconfig can help us examine IP Address. You can read this for
learning more on IP addresses http://en.wikipedia.org/wiki/IP_address. Software like Clicky from (http://getclicky.com) and Google
Analytics( www.google.com/analytics) also help us give data which can then be parsed using their APIs. (See
https://code.google.com/p/r-google-analytics/ for Google Analytics using R).
Machine Generated Data- Machines generate a lot of data especially for sensors to ensure that the machine is working properly. This
data can be logged and can used with events like cracks or failures to have predictive asset maintance of M2M (Machine to Machine)
Analytics.
71. Where to get Data
Surveys- Surveys are mostly questionaries filled by humans. They used to be administed manually over paper, but online surveys are
now the definitive trend. Surveys reveal valuable data about current preferences of current and potential customers. They do suffer
from the bias inherent from design of questions by the creator. Since customer preferences evolve surveys help in getting primary data
about current preferences. Coupled with stratified random sampling, they can be a powerful method for collecting data. SurveyMonkey
is one such company that helps create online questionaries (https://www.surveymonkey.com/pricing/)
Commercial Databases- Commercial Databases are properietary databases that have been collected over time and are sold /rented
by vendors. They can be used for prospect calling, appending information to existing database, and refining internal database quality.
Credit Bureaus- Credit bureaus collect financial information about people, and this information is then available for marketing
organizations (subject to legal and privacy guideliness). The cost of such information is balanced by the added information about
customers.
Social Media- Social media is a relatively new source of data and offers powerful insights albiet through a lot of unstructured data.
Companies like Datasift offer social media data, and companies like Salesforce/Radian6 offer social media tools
(http://www.salesforcemarketingcloud.com/). Facebook has 829 million daily active users on average in June 2014 with 1.32 billion
monthly active users . Twitter has 255 million monthly active users and 500 million Tweets are sent per day. That generates a lot of
data about what current and potential customers are thinking and writing about your products.
72. Where to process data?
Now you have the data. We need computers to process it.
Local Machine - Benefits of storing the data in local machine are ease of access. The potential risks
include machine outages, data recovery, data theft (especially for laptops) and limited scalability. A
local machine is also much more expensive in terms of processing and storage and gets obsolete
within a relatively short period of time.
Server- Servers respond to requests across networks. They can be thought of as centralized resources
that help cut down cost of processing and storage. They can be an intermediate solution between
local machines and clouds, though they have huge capital expenditure upfront. Not all data that can
fit on a laptop should be stored on a laptop. You can store data in virtual machines on your server
and connected through thin shell clients with secure access.
Cloud- The cloud can be thought of a highly scalable, metered service that allows requests from remote
networks. They can be thought of as a large bank of servers but that is a simplistic definition.
hindrance to adoption to the cloud is resistance within existing IT department whose members are not
trained to transition and maintain the network over cloud as they used to do for enterprise networks.
73. Cloud Computing Providers
We exapnd on the cloud processing part.
Amazon EC2 - Amazon Elastic Compute Cloud (Amazon EC2) provides scalable processing power in the cloud. It has a web based
management console, has a command line tool , and offers resources for Linux and Windows virtual images. Further details are
available at http://aws.amazon.com/ec2/ . Amazon EC2 is generally considered the industry leader.For beginners a 12 month
basic preview is available for free at http://aws.amazon.com/free/ that can allow practioners to build up familiarity.
Google Compute- https://cloud.google.com/products/compute-engine/
Microsoft Azure - https://azure.microsoft.com/en-us/pricing/details/virtual-machines / Azure Virtual Machines enable you to deploy a
Windows Server, Linux, or third-party software images to Azure. You can select images from a gallery or bring your own
customized images. Charge for Virtual Machines is by the minute. Discounts can range from 205 to 32 % depending if you pre
pay 6 months or 12 month plans and based on usage tier.
IBM shut down its SmartCloud Enterprise cloud computing platform by Jan. 31, 2014 and will migrate those customers to its
SoftLayer cloud computing platform, which was an IBM acquired company https://www.softlayer.com/virtual-servers
Oracle Oracle's plans for the cloud are still in preview for enterprise customers a https://cloud.oracle.com/compute
74. Where to store data
The need to store data in a secure and reliable environment for speedy and
repeated access. There is a cost of storing this data, and there is a cost of losing
the data due to some technical accident.
You can store data in the following way
csv files, spreadsheet and text files locally espeially for smaller files. Note while
this increases ease of access, it also creates problems of version control as
well as security of confidential data.
relational databases (RDBMS) and data warehouses
hadoop based storage
75. Where to store data
noSQL databases- are non-relational, distributed, open-source and horizontally
scalable. A complete list of NoSQL databases is at http://nosql-database.org/ .
Notable NoSQL databases are MongoDB, couchDB et al.
key value store -Key-value stores use the map or dictionary as their fundamental data model. In
this model, data is represented as a collection of key-value pairs, such that each possible key
appears at most once in the collection
Redis -Redis is an open source, BSD licensed, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes, lists, sets and
sorted sets (http://redis.io/).
Riak is an open source, distributed database. http://basho.com/riak/.
MemcacheDB is a persistence enabled variant of memcached,
column oriented databases
cloud storage
76. Cloud Storage
Amazon- Amazon Simple Storage Services (S3)- Amazon S3 provides a simple web-services interface that can be used to store
and retrieve any amount of data, at any time, from anywhere on the web. http://aws.amazon.com/s3/ . Cost is a maximum of 3
cents per GB per month. There are three types of storage Standard Storage, Reduced Redundancy Storage, Glacier Storage.
Reduced Redundancy Storage (RRS) is a storage option within Amazon S3 that enables customers to reduce their costs by
storing non-critical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. Amazon Glacier stores
data for as little as $0.01 per gigabyte per month, and is optimized for data that is infrequently accessed and for which retrieval
times of 3 to 5 hours are suitable. These details can be seen at http://aws.amazon.com/s3/pricing/
Google - Google Cloud Storage https://cloud.google.com/products/cloud-storage/ . It also has two kinds of storage. Durable
Reduced Availability Storage enables you to store data at lower cost, with the tradeoff of lower availability than standard Google
Cloud Storage.. Prices are 2.6 cents for Standard Storage (GB/Month) and 2 cents for Durable Reduced Availability (DRA)
Storage (GB/Month). They can be seen at https://developers.google.com/storage/pricing#storage-pricing
Azure- Microsoft has different terminology for it's cloud infrastructure. Storage is classified in three types with a fourth type (Files)
being available as a preview. There are three levels of redundancy Locally Redundant Storage (LRS),Geographically
Redundant Storage (GRS) ,Read-Access Geographically Redundant Storage (RA-GRS): You can see details and prices at
https://azure.microsoft.com/en-us/pricing/details/storage/
Oracle Storage is available at https://cloud.oracle.com/storage and costs around 30$ / TB per month
77. Databases on the Cloud- Amazon
Amazon RDS -Managed MySQL, Oracle and SQL Server databases. http://aws.amazon.com/rds/ While relational
database engines provide robust features and functionality, scaling requires significant time and expertise.
DynamoDB - Managed NoSQL database service. http://aws.amazon.com/dynamodb/ Amazon DynamoDB focuses on
providing seamless scalability and fast, predictable performance. It runs on solid state disks (SSDs) for low-latency
response times, and there are no limits on the request capacity or storage size for a given table. This is because
Amazon DynamoDB automatically partitions your data and workload over a sufficient number of servers to meet the
scale requirements you provide.
Redshift - It is a managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently
analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour and
scale to a petabyte or more for $1,000 per terabyte per year. http://aws.amazon.com/redshift/
SimpleDB- It is highly available and flexible non-relational data store that offloads the work of database administration.
Developers simply store and query data items via web services requests http://aws.amazon.com/simpledb/. a table in
Amazon SimpleDB has a strict storage limitation of 10 GB and is limited in the request capacity it can achieve
(typically under 25 writes/second); it is up to you to manage the partitioning and Gre-partitioning of your data over
additional SimpleDB tables if you need additional scale. While SimpleDB has scaling limitations, it may be a good fit
for smaller workloads that require query flexibility. Amazon SimpleDB automatically indexes all item attributes and thus
supports query flexibility at the cost of performance and scale.
78. Databases on the Cloud - Others
Google
Google Cloud SQL -Relational Databases in Google's Cloud https://developers.google.com/cloud-
sql/
Google Cloud Datastore - Managed NoSQL Data Storage Service
https://developers.google.com/datastore/
Google Big Query- Enables you to write queries on huge datasets. BigQuery uses a columnar
data structure, which means that for a given query, you are only charged for data processed
in each column, not the entire table https://cloud.google.com/products/bigquery/
Azure SQL Database https://azure.microsoft.com/en-in/services/sql-database/ SQL Database is a
relational database service in the cloud based on the Microsoft SQL Server engine, with mission-
critical capabilities. Because it’s based on the SQL Server engine, SQL Database supports existing
SQL Server tools, libraries and APIs, which makes it easier for you to move and extend to the
cloud.
79. Basic Statistics
Some of the basic statistics that every data scientist should know are given here. This assumes rudimentary basic knowledge of
statistics ( like measures of central tendency or variation) and basic familiarity with some of the terminology used by statisticians.
Random Sampling- In truly random sampling,the sample should be representative of the entire data. RAndom sampling remains of
relevance in the era of Big Data and Cloud Computing
Distributions- A data scientist should know the distributions ( normal, Poisson, Chi Square, F) and also how to determine the
distribution of data.
Hypothesis Testing - Hypothesis testing is meant for testing assumptions statistically regarding values of central tendency (mean,
median) or variation. A good example of an easy to use software for statistical testing is the “test” tab in the Rattle GUI in R.
Outliers- Checking for outliers is a good way for a data scientist to see anomalies as well as identify data quality. The box plot
(exploratory data analysis) and the outlierTest function from car package ( Bonferroni Outlier Test) is how statistical rigor can be
maintained to outlier detection.
80. Basic Techniques
Some of the basic techniques that a data scientist must know are listed as follows-
Text Mining - In text mining , text data is analyzed for frequencies, associations and corelation for predictive purposes. The tm
package from R greatly helps with text mining.
Sentiment Analysis- In sentiment analysis the text data is classified based on a sentiment lexicography ( eg which says happy is less
positive than delighted but more positive than sad) to create sentiment scores of the text data mined.
Social Network Analysis- In social network analysis, the direction of relationships, the quantum of messages and the study of
nodes,edges and graphs is done to give insights..
Time Series Forecasting- Data is said to be auto regressive with regards to time if a future value is dependent on a current value for
a variable. Technqiues such as ARIMA and exponential smoothing and R packages like forecast greatly assist in time series
forecasting.
Web Analytics
Social Media Analytics
Data Mining or Machine Learning
81. Data Science Tools
- R
- Python
- Tableau
- Spark with ML
- Hadoop (Pig and Hive)
- SAS
- SQL
82. R
R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical
techniques, and is highly extensible.
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. It includes an effective data handling and storage facility, a suite
of operators for calculations on arrays, in particular matrices, a large, coherent,
integrated collection of intermediate tools for data analysis, graphical facilities for
data analysis and display either on-screen or on hardcopy, and a well-developed,
simple and effective programming language
https://www.r-project.org/about.html
85. Big Data: Hadoop Stack with Spark
http://spark.apache.org/ Apache Spark™ is a fast and general engine for large-scale data processing.
86. Big Data: Hadoop Stack with Mahout
https://mahout.apache.org/
The Apache Mahout™ project's goal is to build an environment for quickly creating
scalable performant machine learning applications.
Apache Mahout Samsara Environment includes
Distributed Algebraic optimizer
R-Like DSL Scala API
Linear algebra operations
Ops are extensions to Scala
IScala REPL based interactive shell
Integrates with compatible libraries like MLLib
Runs on distributed Spark, H2O, and Flink
Apache Mahout Samsara Algorithms included
Stochastic Singular Value Decomposition (ssvd, dssvd)
Stochastic Principal Component Analysis (spca, dspca)
87. Big Data: Hadoop Stack with Mahout
https://mahout.apache.org/
Apache Mahout software provides three major features:
A simple and extensible programming environment and framework for building scalable algorithms
A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink
Samsara, a vector math experimentation environment with R-like syntax which works at scale
88. Data Science Techniques
- Machine Learning
- Regression
- Logistic Regression
- K Means Clustering
- Association Analysis
- Decision Trees
- Text Mining
89. What is an algorithm
a process or set of rules to be followed in calculations or other problem-
solving operations, especially by a computer.
a self-contained step-by-step set of operations to be performed
a procedure or formula for solving a problem, based on conducting a
sequence of specified action
a procedure for solving a mathematical problem (as of finding the greatest
common divisor) in a finite number of steps that frequently involves
repetition of an operation; broadly : a step-by-step procedure for solving a
problem or accomplishing some end especially by a computer.
90. Machine Learning
Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning
system could be trained on email messages to learn to distinguish between spam and non-spam messages
Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a
set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a
desired output value (also called the supervisory signal).
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a
training set of correctly identified observations is available.
In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the
examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes
unsupervised learning from supervised learning
The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories
based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional
vector space).
93. Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a
new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership
is known.
The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features,
etc.
These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"),
integer-valued (e.g. the number of occurrences of a part word in an email) or
real-valued (e.g. a measurement of blood pressure).
Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups
(e.g. less than 5, between 5 and 10, or greater than 10).
94. Regression
regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the relationship between
a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable')
changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent
variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent
variables.
97. Association Rules
http://en.wikipedia.org/wiki/Association_rule_learning
Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between
products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.
For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing
activities such as, e.g., promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining,
association rule learning typically does not consider the order of items either within a transaction or across transactions
Conecpts- Support, Confidence, Lift
In R
apriori() in arules package
In Python
http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/
98. Gradient Descent
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.
http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html
Start at some x value, use derivative at that value to tell
us which way to move, and repeat. Gradient descent.
http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
102. Random Forest
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of
the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the
classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1.If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This
sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out
of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two things:
The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the
individual trees decreases the forest error rate.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
103. Bagging
Bagging, aka bootstrap aggregation, is a relatively simple way to increase the
power of a predictive statistical model by taking multiple random samples(with
replacement) from your training data set, and using each of these samples to
construct a separate model and separate predictions for your test set. These
predictions are then averaged to create a, hopefully more accurate, final
prediction value.
http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
104. Boosting
Boosting is one of several classic methods for creating ensemble models,
along with bagging, random forests, and so forth. Boosting means that each
tree is dependent on prior trees, and learns by fitting the residual of the trees
that preceded it. Thus, boosting in a decision tree ensemble tends to improve
accuracy with some small risk of less coverage.
XGBoost is a library designed and optimized for boosting trees algorithms.
XGBoost is used in more than half of the winning solutions in machine learning
challenges hosted at Kaggle.
http://xgboost.readthedocs.io/en/latest/model.html#
And http://dmlc.ml/rstats/2016/03/10/xgboost.html
105. Data Science Process
By Farcaster at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=40129394
106. LTV Analytics
Life Time Value (LTV) will help us answer 3
fundamental questions:
1. Did you pay enough to acquire
customers from each marketing
channel?
2. Did you acquire the best kind of
customers?
3. How much could you spend on
keeping them sweet with email and
social media?
107. LTV Analytics :Case Study
https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
113. Pareto principle
The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity)
states that, for many events, roughly 80% of the effects come from 20% of the causes
80% of a company's profits come from 20% of its customers
80% of a company's complaints come from 20% of its customers
80% of a company's profits come from 20% of the time its staff spend
80% of a company's sales come from 20% of its products
80% of a company's sales are made by 20% of its sales staff
Several criminology studies have found 80% of crimes are committed by 20% of criminals.
114. RFM Analysis
RFM is a method used for analyzing customer value.
Recency - How recently did the customer purchase?
Frequency - How often do they purchase?
Monetary Value - How much do they spend?
A method
Recency = 10 - the number of months that have passed since the customer last purchased
Frequency = number of purchases in the last 12 months (maximum of 10)
Monetary = value of the highest order from a given customer (benchmarked against $10k)
Alternatively, one can create categories for each attribute. For instance, the Recency attribute might be broken into three
categories: customers with purchases within the last 90 days; between 91 and 365 days; and longer than 365 days. Such
categories may be arrived at by applying business rules, or using a data mining technique, to find meaningful breaks.
A commonly used shortcut is to use deciles. One is advised to look at distribution of data before choosing breaks.