SlideShare ist ein Scribd-Unternehmen logo
1 von 115
Downloaden Sie, um offline zu lesen
Analytics Talk
By Ajay Ohri at Allianz
Trivandrum
9 October 2016
Analytics Session
Introduction to Big Data, Cloud
Computing, Data Science and How They
Affect You
Agenda
Big Data - definition and explanation
Cloud Computing
Data Science
Business Strategy Models
Case Studies in Insurance
Big Data
What is Big Data?
"Big data" is a term applied to data sets whose size is beyond the ability of
commonly used software tools to capture, manage, and process the data within a
tolerable elapsed time.
Examples include web logs, RFID, sensor networks, social networks, social data
(due to the social data revolution), Internet text and documents, Internet search
indexing, call detail records, astronomy, atmospheric science, genomics,
biogeochemical, biological, and other complex and often interdisciplinary scientific
research, military surveillance, medical records, photography archives, video
archives, and large-scale e-commerce.
Big Data
What is Big Data?
"extremely large data sets that may be analysed computationally to reveal
patterns, trends, and associations, especially relating to human behaviour and
interactions.
1. "much IT investment is going towards managing and maintaining big data"
https://en.wikipedia.org/wiki/Big_data Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer,
visualization, querying, updating and information privacy.
Big Data: Statistics
IBM- http://www-01.ibm.com/software/data/bigdata/
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data
in the world today has been created in the last two years alone. This data comes
from everywhere: sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction records, and cell
phone GPS signals to name a few. This data is big data.
Big Data: Moving Fast
IBM- https://www.ibm.com/big-data/us/en/
Big data is being generated by everything around us at all times. Every digital
process and social media exchange produces it. Systems, sensors and mobile
devices transmit it. Big data is arriving from multiple sources at an alarming
velocity, volume and variety. To extract meaningful value from big data, you need
optimal processing power, analytics capabilities and skills.
4V of BIG DATA
http://www.ibmbigdatahub.com
/infographic/four-vs-big-data
VOLUME
http://www.ibmbigdatahub.com/
infographic/four-vs-big-data
VELOCITY
http://www.ibmbigdatahub.com/
infographic/four-vs-big-data
VARIETY
http://www.ibmbigdatahub.com/
infographic/four-vs-big-data
VERACITY
http://www.ibmbigdatahub.com/
infographic/four-vs-big-data
VALUE
http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
Veracity and Variety
http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
Volume and Velocity
http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
Example
Source- https://www.renesas.com/en-
sg/about/web-magazine/edge/global/13-big-
data.html
Example
Source- https://www.renesas.com/en-
sg/about/web-magazine/edge/global/13-big-
data.html
Who uses Big Data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Banking
With large amounts of information streaming in from countless sources, banks are faced with finding new and innovative ways to
manage big data. While it’s important to understand customers and boost their satisfaction, it’s equally important to minimize risk and
fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step
ahead of the game with advanced analytics.
Education
Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big
data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for
evaluation and support of teachers and principals.
Government
When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to
managing utilities, running agencies, dealing with traffic congestion or preventing crime. But while there are many advantages to big
data, governments must also address issues of transparency and privacy.
Who uses Big Data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Health Care
Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly,
accurately – and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed
effectively, health care providers can uncover hidden insights that improve patient care.
Manufacturing
Armed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste – processes that are
key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they
can solve problems faster and make more agile business decisions.
Retail
Customer relationship building is critical to the retail industry – and the best way to manage that is to manage big data. Retailers need
to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back
lapsed business. Big data remains at the heart of all those things.
Big Data: Hadoop Stack
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be
prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
http://hadoop.apache.org/
Big Data: Hadoop Stack
Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for
Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a
dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually
alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that
supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to
execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™,
Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace
Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
Big Data: Hadoop Stack
Big Data: Hadoop Stack
Big Data: Hadoop Stack
NoSQL
A NoSQL (Not-only-SQL) database is one that has been designed to store,
distribute and access data using methods that differ from relational databases
(RDBMS’s). NoSQL technology was originally created and used by Internet
leaders such as Facebook, Google, Amazon, and others who required database
management systems that could write and read data anywhere in the world, while
scaling and delivering performance across massive data sets and millions of
users.
NoSQL
https://www.datastax.com/nosql-databases
NoSQL
https://www.datastax.com/nosql-databases
How NoSQL Databases Differ From Each Other
https://www.datastax.com/nosql-databases
There are a variety of different NoSQL databases on the market with the key differentiators between them
being the following:
Architecture: Some NoSQL databases like MongoDB are architected in a master/slave model in somewhat
the same way as many RDBMS’s. Others (like Cassandra) are designed in a ‘masterless’ fashion where all
nodes in a database cluster are the same. The architecture of a NoSQL database greatly impacts how well
the database supports requirements such as constant uptime, multi-geography data replication, predictable
performance, and more.
Data Model: NoSQL databases are often classified by the data model they support. Some support a wide-
row tabular store, while others sport a model that is either document-oriented, key-value, or graph.
Data Distribution Model: Because of their architecture differences, NoSQL databases differ on how they
support the reading, writing, and distribution of data. Some NoSQL platforms like Cassandra support writes
and reads on every node in a cluster and can replicate / synchronize data between many data centers and
cloud providers.
Development Model: NoSQL databases differ on their development API’s with some supporting SQL-like
languages (e.g. Cassandra’s CQL).
Big Data Strategy
Cloud Computing
Cloud computing is a model for enabling ubiquitous,
convenient, on-demand network access to a shared pool of
configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly
provisioned and released with minimal management effort or
service provider interaction. This cloud model is composed of
five essential characteristics, three service models, and four
deployment models.
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
--National Institute of Standards and Technology
Cloud Computing: Types
five essential characteristics
1. On demand self service
2. Broad Network Access
3. Resource Pooling
4. Rapid Elasticity
5. Measured Service
Cloud Computing
1. the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a
local server or a personal computer.
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
Cloud Computing: Types
three service models (SaaS, PaaS and IaaS)
Cloud Computing: Types
four deployment models (private, public, community and hybrid).
Key enabling technologies include:
1. fast networks,
2. inexpensive computers, and
3. virtualization for commodity hardware.
Cloud Computing: Types
major barriers to broader cloud adoption are
security, interoperability, and portability
For a layman to be explained in simple short terms, cloud computing is a lot of
scalable and custom computing power available by rent/by hour and accessible
remotely. It can help in doing more computing at a fraction of the cost
Data Driven Decision Making
- using data and trending historical data
- validating assumptions if any
- using champion challenger to test scenarios
- using experiments
- use baselines
- continuous improvement
- customer experiences
- costs
- revenues
If you can't measure it, you can't manage it -Peter Drucker
BCG Matrix for Product Lines
BCG Matrix is best used to analyze your own or target organization’s product portfolio- applicable for companies
with multiple products
To help corporations with analyzing their business units
or product lines. This helps the company allocate resources
Porter’s 5 Forces Model for Industries
It draws upon industrial organization (IO) economics
to derive five forces that determine the competitive intensity
and therefore attractiveness of a market.
Attractiveness in this context refers to the overall industry
profitability. An “unattractive” industry is one in which
the combination of these five forces acts to drive down
overall profitability. A very unattractive industry would be
one approaching “pure competition”, in which available
profits for all firms are driven to normal profit.
Porter’s Diamond Model
an economical model developed by Michael Porter in his book The Competitive Advantage of Nations, where he
published his theory of why particular industries become competitive in particular locations.
McKinsey 7S Framework
To check which teams work and which teams done (within an organization) use this framework by the famous
consulting company-a strategic vision for groups, to include businesses, business units, and teams. The 7S are
structure, strategy, systems, skills, style, staff and shared values. The model is most often used as a tool to assess
and monitor changes in the internal situation of an organization.
Greiner Model for Organizational Growth
Developed by Larry E. Greiner it is helpful when
examining the problems associated with growth on
organizations and the impact of change on employees.
It can be argued that growing organizations move
through five relatively calm periods of evolution, each
of which ends with a period of crisis and revolution.
Each evolutionary period is characterized by the
dominant management style used to achieve
growth, while
Each revolutionary period is characterized by the
dominant management problem that must be
Marketing Model
4P and 4 C model helps you identify marketing mix
Products Price Promotion Place
Consumers Cost Communication Convenience
Business Canvas Model
The Business Model Canvas is a strategic management template for developing new or documenting existing
business models. It is a visual chart with elements describing a firm’s value proposition, infrastructure, customers,
and finances. It assists firms in aligning their activities by illustrating potential trade-offs.
Motivation Models
Hertzberg motivation-hygiene theory
job satisfaction and job dissatisfaction act independently of each other
Leading to satisfaction
Achievement
Recognition
Work itself
Responsibility
Advancement
Leading to dissatisfaction
Company policy
Supervision
Relationship with boss
Work conditions
Salary
Relationship with peers
Motivation Models
Maslow Hierarchy of Needs
Business Strategy Models
http://decisionstats.com/2013/12/19/business-strategy-models/
1. Porters 5 forces Model-To analyze industries
2. Business Canvas
3. BCG Matrix- To analyze Product Portfolios
4. Porters Diamond Model- To analyze locations
5. McKinsey 7 S Model-To analyze teams
6. Gernier Theory- To analyze growth of organization
7. Herzberg Hygiene Theory- To analyze soft aspects of individuals
Data Science
What is a data scientist? A data
scientist is one who had inter
disciplinary skills in both
programming, statistics and
business domains to create
actionable insights based on
experiments or summaries from
data.
Data Science
On a daily basis, a data scientist is simply a person
who can write some code
in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR)
for
data storage, querying, summarization, visualization efficiently, and in time
on
databases, on cloud, servers and understand enough statistics to derive insights from data
so business can make decisions
What should a data scientist know? He should know how to get data, store
it, query it, manage it, and turn it into actionable insights.
Big Data Social Media Analysis
https://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/
Social Network Analysis
How does information propagate through a
social network?
http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
Fraud Analysis
anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an
expected pattern or other items in a dataset.
How they affect you :Financial Profitability
Data Storage is getting cheaper but the way it is stored is changing ( from
company servers to external cloud)
Big Data helps to store every interaction, transaction, with customer but this also
increases complexity of data
Data Science is getting cheaper ( open source) but more skilled professionals in
analytics required
How they affect you :Sales and Marketing
Which customers to target and who not to target ( traditional propensity models)
Where to target ( geocoded)
When to target
Forecast Demand
How they affect you :Operations
Optimize cost and logistics
Maximize output per resource
Can also be combined with IoT
How they affect you :Human Resources
Which employee is like to leave first
Which skill is most likely to be crucial next 12 24 months
Forecast for skills, employees
Insurance Examples
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-
guns-progressive-insurance-35951-1.html
Agents increasingly want mobile enablement, and not just the
ability to quote, but to bind and sell policies on smartphones and
tablets. -Progressive
progressive snapshot
https://www.progressive.com/auto/snapshot/
To participate you attach the Snapshot device to the computer in
your car, which collects data about your driving habits. According
to Progressive, the device records your vehicle identification
number (VIN), how many miles you drive each day and how often
you drive between between midnight and 4 a.m.
After driving with Snapshot for 30 days, you return it to Progressive
and, depending on your driving habits, the company says you can
get a discount up to 30%
Insurance Examples
Mass Mutual http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-massmutual-35952-1.html
Created Haven Life, an online insurance agency that uses an algorithmic underwriting tool and series of related decisions
that was created in collaboration with team of data scientists.
insurance companies are vast decision-making engines that take and manage risk. The inputs into this engine are data, and
the capabilities created by the field of data science can and will impact every process in the company — from underwriting
to claims management to security,
Insurance Examples
CNA is applying big data technology to workers compensation claims and adjusters’ notes.
“That is a classic, unstructured big data kind of problem,” says Nate Root, SVP of CNA’s shared service organization. “We
have hundreds of thousands of workers compensation claims, and claims adjuster notes, and there is tremendous value in
those notes.”
Root says the insurer recently began identifying workers’ compensation claims that have the potential to turn into a total
disability, or partial permanent disability, without the right sort of attention. By examining the unstructured data, CNA has
developed a hundred different variables that can predict a propensity for a claim to become serious, and then assign a
nurse case manager to help the insured get necessary treatments for a better patient outcome, get them back to work and
lower the overall cost of coverage. For example, the program can find people who are missing appointments or who are not
engaged with physical therapy and should be.
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-cna-35959-1.html
Insurance Examples
American Family Insurance licensed APT’s Test & Learn software
(http://www.predictivetechnologies.com/products/test-learn.aspx ) to enhance
customer engagement and increase support for agents. “This is a statistical tool
that enables us to create and analyze statistical tests,”
For example, call-routing techniques affect wait times and, ultimately claims
satisfaction. The insurer also tracks how claims are handled, and by whom, and
whether agents are involved in resolution. Using APT, the insurer can isolate
variables and accurately determine the success of one design vs. another for
various products, geographies or demographics,
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-
american-family-insurance-35953-1.html .
Insurance Examples
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-american-family-insurance-35953-1.html .
American Family Insurance Unstructured data, such as that collected in call center transcripts, also can be studied to
better understand what approaches are best for different situations, he says. “Hadoop and other tools enable natural-
language processing and sentiment analysis,” Cruz says. “We can look for key words or patterns in those words, do counts
and build models off textual indicators that enable us to identify three things:
1. when there could be fraud involved,
2. where there might be severity issues,
3. or how we can get ahead of that and plan for it,”
Customer communication, web design and direct mail are other areas the insurer is, or soon will be, using APT,
1. Do we see greater lift in these geographies vs. those? Or,
Insurance Examples
Like MassMutual, Nationwide has partnered with a local college — Ohio State University, the university with the third-
largest enrollment in the country. The Nationwide Center for Advanced Customer Insights (NCACI) gives OSU students in
advanced degree programs the ability to work with real-world data to solve some of the biggest insurance business
problems. Faculty and students from the marketing, statistics, psychology, economics and computer science departments
work with Nationwide to develop predictive models and data mining techniques aimed at improving
1. marketing and distribution,
2. identifying consumer behavior patterns, and
3. increasing customer satisfaction and
4. lifetime value.
Insurance Examples
John Hancock
his team set out to find a way to leverage the wealth of data collected by wearable technologies, including the popular FitBit
and recently released Apple Watch, to give something back to their customers. The end result was John Hancock Vitality, a
new life insurance product that offers up to a 15 percent premium discount to customers who track their healthy habits with
wearables and turn that information over to the insurance company. New buyers even get their own FitBit to begin tracking.
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-john-hancock-35954-1.html
Fitbit Inc. is an American company known for its products of the same name, which are activity trackers,
wireless-enabled wearable technology devices that measure data such as the number of steps walked,
heart rate, quality of sleep, steps climbed, and other personal metrics.
Insurance Examples
Swiss Re is using more public data to improve underwriting results and decrease the number of questions the insurer has
to ask consumers to underwrite them. Swiss Re is looking at big data in terms of two major streams. In the first, big data is
being used to help reduce costs and improve the efficiency of current processes throughout the insurance value chain,
including claims and fraud management, cyber risk, customer management, pricing, risk assessment and selection,
distribution and service management, product innovation, and research and development.
In the second stream, big data also offers a new framework to think bigger in terms of market disruption. Swiss Re has
created more than 100 prototypes internally, and that as a result the entire organization sees the value and importance of
big data and smart analytics.
http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-swiss-re-35957-1.html
Insurance Examples
‘How do you take that operationally efficient data and turn it into a customer/household view and understand all the
products attached to a person?’”
Allstate has focused heavily on master data management and data governance creating party and household IDs for
data. The company is also building a team to work across business areas on analytics projects rather than siloing big data
projects within certain units.
“Something meant for a single purpose often leads to other insights. We know, for example based on some call-volume
analysis in our call center, how often customers defect.”We have an application in claims, QuickFoto, where a policyholder
that isn’t in a major accident can snap a picture of the damage and send it to us. But whereas in the past, that would’ve
gone into a physical folder and then a filing cabinet, now I have all those pictures of cars in a database, and there’s a lot
more that I can do.”
Questions?
Data Science Tools and Techniques for
extracting maximum value from
Customer Data and Interactions
Agenda
Data Science Approach
Data Science Tools
Data Science Techniques
Data Science Approach
On a daily basis, a data scientist is simply a person
who can write some code
in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR)
for
data storage, querying, summarization, visualization efficiently, and in time
on
databases, on cloud, servers and understand enough statistics to derive insights from data so
business can make decisions
Data Science Approach
What should a data scientist know? He should know how to get data, store it,
query it, manage it, and turn it into actionable insights. The following approach
elaborates on this simple and sequential premise.
Where to get Data
A data scientist needs data to do science on, right! Some of the usual sources of data for a data scientist are-
APIs- API is an acronym for Application Programming Interface.We cover APIs in detail in Chapter 6. APIs is how the current big data
paradigm is enabled, as it enables machines to talk and fetch data from each other programmatically. For a list of articles written by the
same author on APIs- see https://www.programmableweb.com/profile/ajayohri.
Internet Clickstream Logs- Internet clickstream logs refer to the data generated by humans when they click specific links within a
webpage. This data is time stamped, and the uniqueness of the person clicking the link can be established by IP address. IP
addresses can be parsed by registries like https://www.arin.net/whois or http://www.apnic.net/whois for examining location (country and
city). internet service provider and owner of the address (for website owners this can be done using the website http://who.is/). In
Windows using the command ipconfig and in Linux systems using ifconfig can help us examine IP Address. You can read this for
learning more on IP addresses http://en.wikipedia.org/wiki/IP_address. Software like Clicky from (http://getclicky.com) and Google
Analytics( www.google.com/analytics) also help us give data which can then be parsed using their APIs. (See
https://code.google.com/p/r-google-analytics/ for Google Analytics using R).
Machine Generated Data- Machines generate a lot of data especially for sensors to ensure that the machine is working properly. This
data can be logged and can used with events like cracks or failures to have predictive asset maintance of M2M (Machine to Machine)
Analytics.
Where to get Data
Surveys- Surveys are mostly questionaries filled by humans. They used to be administed manually over paper, but online surveys are
now the definitive trend. Surveys reveal valuable data about current preferences of current and potential customers. They do suffer
from the bias inherent from design of questions by the creator. Since customer preferences evolve surveys help in getting primary data
about current preferences. Coupled with stratified random sampling, they can be a powerful method for collecting data. SurveyMonkey
is one such company that helps create online questionaries (https://www.surveymonkey.com/pricing/)
Commercial Databases- Commercial Databases are properietary databases that have been collected over time and are sold /rented
by vendors. They can be used for prospect calling, appending information to existing database, and refining internal database quality.
Credit Bureaus- Credit bureaus collect financial information about people, and this information is then available for marketing
organizations (subject to legal and privacy guideliness). The cost of such information is balanced by the added information about
customers.
Social Media- Social media is a relatively new source of data and offers powerful insights albiet through a lot of unstructured data.
Companies like Datasift offer social media data, and companies like Salesforce/Radian6 offer social media tools
(http://www.salesforcemarketingcloud.com/). Facebook has 829 million daily active users on average in June 2014 with 1.32 billion
monthly active users . Twitter has 255 million monthly active users and 500 million Tweets are sent per day. That generates a lot of
data about what current and potential customers are thinking and writing about your products.
Where to process data?
Now you have the data. We need computers to process it.
Local Machine - Benefits of storing the data in local machine are ease of access. The potential risks
include machine outages, data recovery, data theft (especially for laptops) and limited scalability. A
local machine is also much more expensive in terms of processing and storage and gets obsolete
within a relatively short period of time.
Server- Servers respond to requests across networks. They can be thought of as centralized resources
that help cut down cost of processing and storage. They can be an intermediate solution between
local machines and clouds, though they have huge capital expenditure upfront. Not all data that can
fit on a laptop should be stored on a laptop. You can store data in virtual machines on your server
and connected through thin shell clients with secure access.
Cloud- The cloud can be thought of a highly scalable, metered service that allows requests from remote
networks. They can be thought of as a large bank of servers but that is a simplistic definition.
hindrance to adoption to the cloud is resistance within existing IT department whose members are not
trained to transition and maintain the network over cloud as they used to do for enterprise networks.
Cloud Computing Providers
We exapnd on the cloud processing part.
Amazon EC2 - Amazon Elastic Compute Cloud (Amazon EC2) provides scalable processing power in the cloud. It has a web based
management console, has a command line tool , and offers resources for Linux and Windows virtual images. Further details are
available at http://aws.amazon.com/ec2/ . Amazon EC2 is generally considered the industry leader.For beginners a 12 month
basic preview is available for free at http://aws.amazon.com/free/ that can allow practioners to build up familiarity.
Google Compute- https://cloud.google.com/products/compute-engine/
Microsoft Azure - https://azure.microsoft.com/en-us/pricing/details/virtual-machines / Azure Virtual Machines enable you to deploy a
Windows Server, Linux, or third-party software images to Azure. You can select images from a gallery or bring your own
customized images. Charge for Virtual Machines is by the minute. Discounts can range from 205 to 32 % depending if you pre
pay 6 months or 12 month plans and based on usage tier.
IBM shut down its SmartCloud Enterprise cloud computing platform by Jan. 31, 2014 and will migrate those customers to its
SoftLayer cloud computing platform, which was an IBM acquired company https://www.softlayer.com/virtual-servers
Oracle Oracle's plans for the cloud are still in preview for enterprise customers a https://cloud.oracle.com/compute
Where to store data
The need to store data in a secure and reliable environment for speedy and
repeated access. There is a cost of storing this data, and there is a cost of losing
the data due to some technical accident.
You can store data in the following way
csv files, spreadsheet and text files locally espeially for smaller files. Note while
this increases ease of access, it also creates problems of version control as
well as security of confidential data.
relational databases (RDBMS) and data warehouses
hadoop based storage
Where to store data
noSQL databases- are non-relational, distributed, open-source and horizontally
scalable. A complete list of NoSQL databases is at http://nosql-database.org/ .
Notable NoSQL databases are MongoDB, couchDB et al.
key value store -Key-value stores use the map or dictionary as their fundamental data model. In
this model, data is represented as a collection of key-value pairs, such that each possible key
appears at most once in the collection
Redis -Redis is an open source, BSD licensed, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes, lists, sets and
sorted sets (http://redis.io/).
Riak is an open source, distributed database. http://basho.com/riak/.
MemcacheDB is a persistence enabled variant of memcached,
column oriented databases
cloud storage
Cloud Storage
Amazon- Amazon Simple Storage Services (S3)- Amazon S3 provides a simple web-services interface that can be used to store
and retrieve any amount of data, at any time, from anywhere on the web. http://aws.amazon.com/s3/ . Cost is a maximum of 3
cents per GB per month. There are three types of storage Standard Storage, Reduced Redundancy Storage, Glacier Storage.
Reduced Redundancy Storage (RRS) is a storage option within Amazon S3 that enables customers to reduce their costs by
storing non-critical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. Amazon Glacier stores
data for as little as $0.01 per gigabyte per month, and is optimized for data that is infrequently accessed and for which retrieval
times of 3 to 5 hours are suitable. These details can be seen at http://aws.amazon.com/s3/pricing/
Google - Google Cloud Storage https://cloud.google.com/products/cloud-storage/ . It also has two kinds of storage. Durable
Reduced Availability Storage enables you to store data at lower cost, with the tradeoff of lower availability than standard Google
Cloud Storage.. Prices are 2.6 cents for Standard Storage (GB/Month) and 2 cents for Durable Reduced Availability (DRA)
Storage (GB/Month). They can be seen at https://developers.google.com/storage/pricing#storage-pricing
Azure- Microsoft has different terminology for it's cloud infrastructure. Storage is classified in three types with a fourth type (Files)
being available as a preview. There are three levels of redundancy Locally Redundant Storage (LRS),Geographically
Redundant Storage (GRS) ,Read-Access Geographically Redundant Storage (RA-GRS): You can see details and prices at
https://azure.microsoft.com/en-us/pricing/details/storage/
Oracle Storage is available at https://cloud.oracle.com/storage and costs around 30$ / TB per month
Databases on the Cloud- Amazon
Amazon RDS -Managed MySQL, Oracle and SQL Server databases. http://aws.amazon.com/rds/ While relational
database engines provide robust features and functionality, scaling requires significant time and expertise.
DynamoDB - Managed NoSQL database service. http://aws.amazon.com/dynamodb/ Amazon DynamoDB focuses on
providing seamless scalability and fast, predictable performance. It runs on solid state disks (SSDs) for low-latency
response times, and there are no limits on the request capacity or storage size for a given table. This is because
Amazon DynamoDB automatically partitions your data and workload over a sufficient number of servers to meet the
scale requirements you provide.
Redshift - It is a managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently
analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour and
scale to a petabyte or more for $1,000 per terabyte per year. http://aws.amazon.com/redshift/
SimpleDB- It is highly available and flexible non-relational data store that offloads the work of database administration.
Developers simply store and query data items via web services requests http://aws.amazon.com/simpledb/. a table in
Amazon SimpleDB has a strict storage limitation of 10 GB and is limited in the request capacity it can achieve
(typically under 25 writes/second); it is up to you to manage the partitioning and Gre-partitioning of your data over
additional SimpleDB tables if you need additional scale. While SimpleDB has scaling limitations, it may be a good fit
for smaller workloads that require query flexibility. Amazon SimpleDB automatically indexes all item attributes and thus
supports query flexibility at the cost of performance and scale.
Databases on the Cloud - Others
Google
Google Cloud SQL -Relational Databases in Google's Cloud https://developers.google.com/cloud-
sql/
Google Cloud Datastore - Managed NoSQL Data Storage Service
https://developers.google.com/datastore/
Google Big Query- Enables you to write queries on huge datasets. BigQuery uses a columnar
data structure, which means that for a given query, you are only charged for data processed
in each column, not the entire table https://cloud.google.com/products/bigquery/
Azure SQL Database https://azure.microsoft.com/en-in/services/sql-database/ SQL Database is a
relational database service in the cloud based on the Microsoft SQL Server engine, with mission-
critical capabilities. Because it’s based on the SQL Server engine, SQL Database supports existing
SQL Server tools, libraries and APIs, which makes it easier for you to move and extend to the
cloud.
Basic Statistics
Some of the basic statistics that every data scientist should know are given here. This assumes rudimentary basic knowledge of
statistics ( like measures of central tendency or variation) and basic familiarity with some of the terminology used by statisticians.
Random Sampling- In truly random sampling,the sample should be representative of the entire data. RAndom sampling remains of
relevance in the era of Big Data and Cloud Computing
Distributions- A data scientist should know the distributions ( normal, Poisson, Chi Square, F) and also how to determine the
distribution of data.
Hypothesis Testing - Hypothesis testing is meant for testing assumptions statistically regarding values of central tendency (mean,
median) or variation. A good example of an easy to use software for statistical testing is the “test” tab in the Rattle GUI in R.
Outliers- Checking for outliers is a good way for a data scientist to see anomalies as well as identify data quality. The box plot
(exploratory data analysis) and the outlierTest function from car package ( Bonferroni Outlier Test) is how statistical rigor can be
maintained to outlier detection.
Basic Techniques
Some of the basic techniques that a data scientist must know are listed as follows-
Text Mining - In text mining , text data is analyzed for frequencies, associations and corelation for predictive purposes. The tm
package from R greatly helps with text mining.
Sentiment Analysis- In sentiment analysis the text data is classified based on a sentiment lexicography ( eg which says happy is less
positive than delighted but more positive than sad) to create sentiment scores of the text data mined.
Social Network Analysis- In social network analysis, the direction of relationships, the quantum of messages and the study of
nodes,edges and graphs is done to give insights..
Time Series Forecasting- Data is said to be auto regressive with regards to time if a future value is dependent on a current value for
a variable. Technqiues such as ARIMA and exponential smoothing and R packages like forecast greatly assist in time series
forecasting.
Web Analytics
Social Media Analytics
Data Mining or Machine Learning
Data Science Tools
- R
- Python
- Tableau
- Spark with ML
- Hadoop (Pig and Hive)
- SAS
- SQL
R
R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical
techniques, and is highly extensible.
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. It includes an effective data handling and storage facility, a suite
of operators for calculations on arrays, in particular matrices, a large, coherent,
integrated collection of intermediate tools for data analysis, graphical facilities for
data analysis and display either on-screen or on hardcopy, and a well-developed,
simple and effective programming language
https://www.r-project.org/about.html
Python
http://python-history.blogspot.in/ and https://www.python.org/
SAS
http://www.sas.com/en_in/home.html
Big Data: Hadoop Stack with Spark
http://spark.apache.org/ Apache Spark™ is a fast and general engine for large-scale data processing.
Big Data: Hadoop Stack with Mahout
https://mahout.apache.org/
The Apache Mahout™ project's goal is to build an environment for quickly creating
scalable performant machine learning applications.
Apache Mahout Samsara Environment includes
Distributed Algebraic optimizer
R-Like DSL Scala API
Linear algebra operations
Ops are extensions to Scala
IScala REPL based interactive shell
Integrates with compatible libraries like MLLib
Runs on distributed Spark, H2O, and Flink
Apache Mahout Samsara Algorithms included
Stochastic Singular Value Decomposition (ssvd, dssvd)
Stochastic Principal Component Analysis (spca, dspca)
Big Data: Hadoop Stack with Mahout
https://mahout.apache.org/
Apache Mahout software provides three major features:
A simple and extensible programming environment and framework for building scalable algorithms
A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink
Samsara, a vector math experimentation environment with R-like syntax which works at scale
Data Science Techniques
- Machine Learning
- Regression
- Logistic Regression
- K Means Clustering
- Association Analysis
- Decision Trees
- Text Mining
What is an algorithm
a process or set of rules to be followed in calculations or other problem-
solving operations, especially by a computer.
a self-contained step-by-step set of operations to be performed
a procedure or formula for solving a problem, based on conducting a
sequence of specified action
a procedure for solving a mathematical problem (as of finding the greatest
common divisor) in a finite number of steps that frequently involves
repetition of an operation; broadly : a step-by-step procedure for solving a
problem or accomplishing some end especially by a computer.
Machine Learning
Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning
system could be trained on email messages to learn to distinguish between spam and non-spam messages
Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a
set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a
desired output value (also called the supervisory signal).
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a
training set of correctly identified observations is available.
In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the
examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes
unsupervised learning from supervised learning
The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories
based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional
vector space).
CRAN VIEW Machine Learning
http://cran.r-project.org/web/views/MachineLearning.html
Machine Learning in Python
http://scikit-learn.org/stable/
Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a
new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership
is known.
The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features,
etc.
These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"),
integer-valued (e.g. the number of occurrences of a part word in an email) or
real-valued (e.g. a measurement of blood pressure).
Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups
(e.g. less than 5, between 5 and 10, or greater than 10).
Regression
regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the relationship between
a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable')
changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent
variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent
variables.
kNN
Support Vector Machines
http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf
Association Rules
http://en.wikipedia.org/wiki/Association_rule_learning
Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between
products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.
For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing
activities such as, e.g., promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining,
association rule learning typically does not consider the order of items either within a transaction or across transactions
Conecpts- Support, Confidence, Lift
In R
apriori() in arules package
In Python
http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/
Gradient Descent
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.
http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html
Start at some x value, use derivative at that value to tell
us which way to move, and repeat. Gradient descent.
http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
Gradient Descent
https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
A standard approach to
solving this type of
problem is to define
an error function (also
called a cost function)
that measures how “good”
a given line is.
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
Decision Trees
http://select.cs.cmu.edu/class/10701-F09/recitations/recitation4_decision_tree.pdf
Decision Trees
Http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
Random Forest
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of
the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the
classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1.If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This
sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out
of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two things:
The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the
individual trees decreases the forest error rate.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
Bagging
Bagging, aka bootstrap aggregation, is a relatively simple way to increase the
power of a predictive statistical model by taking multiple random samples(with
replacement) from your training data set, and using each of these samples to
construct a separate model and separate predictions for your test set. These
predictions are then averaged to create a, hopefully more accurate, final
prediction value.
http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
Boosting
Boosting is one of several classic methods for creating ensemble models,
along with bagging, random forests, and so forth. Boosting means that each
tree is dependent on prior trees, and learns by fitting the residual of the trees
that preceded it. Thus, boosting in a decision tree ensemble tends to improve
accuracy with some small risk of less coverage.
XGBoost is a library designed and optimized for boosting trees algorithms.
XGBoost is used in more than half of the winning solutions in machine learning
challenges hosted at Kaggle.
http://xgboost.readthedocs.io/en/latest/model.html#
And http://dmlc.ml/rstats/2016/03/10/xgboost.html
Data Science Process
By Farcaster at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=40129394
LTV Analytics
Life Time Value (LTV) will help us answer 3
fundamental questions:
1. Did you pay enough to acquire
customers from each marketing
channel?
2. Did you acquire the best kind of
customers?
3. How much could you spend on
keeping them sweet with email and
social media?
LTV Analytics :Case Study
https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
LTV Analytics
https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
LTV Analytics
https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
LTV Analytics
https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
LTV Analytics
http://www.kaushik.net/avinash/analytics-tip-calculate-ltv-customer-lifetime-value/
LTV Analytics
Download the zip file from http://www.kaushik.net/avinash/avinash_ltv.zip
Pareto principle
The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity)
states that, for many events, roughly 80% of the effects come from 20% of the causes
80% of a company's profits come from 20% of its customers
80% of a company's complaints come from 20% of its customers
80% of a company's profits come from 20% of the time its staff spend
80% of a company's sales come from 20% of its products
80% of a company's sales are made by 20% of its sales staff
Several criminology studies have found 80% of crimes are committed by 20% of criminals.
RFM Analysis
RFM is a method used for analyzing customer value.
Recency - How recently did the customer purchase?
Frequency - How often do they purchase?
Monetary Value - How much do they spend?
A method
Recency = 10 - the number of months that have passed since the customer last purchased
Frequency = number of purchases in the last 12 months (maximum of 10)
Monetary = value of the highest order from a given customer (benchmarked against $10k)
Alternatively, one can create categories for each attribute. For instance, the Recency attribute might be broken into three
categories: customers with purchases within the last 90 days; between 91 and 365 days; and longer than 365 days. Such
categories may be arrived at by applying business rules, or using a data mining technique, to find meaningful breaks.
A commonly used shortcut is to use deciles. One is advised to look at distribution of data before choosing breaks.
Are you ready
To use more
Data Science

Weitere ähnliche Inhalte

Was ist angesagt?

Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challengesfazail amin
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIMC Institute
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 

Was ist angesagt? (20)

Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
Big data 101
Big data 101Big data 101
Big data 101
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
Big data road map
Big data road mapBig data road map
Big data road map
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 

Andere mochten auch

Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 
Don't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROIDon't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROIDataWorks Summit
 
Big data meets big analytics
Big data meets big analyticsBig data meets big analytics
Big data meets big analyticsDeepak Ramanathan
 
Why Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueDataWhy Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueDataData Con LA
 
Dell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with IsilonDell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with IsilonGreg Kirchoff
 
BlueData Isilon Validation Brief
BlueData Isilon Validation BriefBlueData Isilon Validation Brief
BlueData Isilon Validation BriefBoni Bruno
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Cloudera, Inc.
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData, Inc.
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Python at yhat (august 2013)
Python at yhat (august 2013)Python at yhat (august 2013)
Python at yhat (august 2013)Austin Ogilvie
 
Analyzing mlb data with ggplot
Analyzing mlb data with ggplotAnalyzing mlb data with ggplot
Analyzing mlb data with ggplotAustin Ogilvie
 
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)Austin Ogilvie
 
Hadley verse
Hadley verseHadley verse
Hadley verseAjay Ohri
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.Dr. Volkan OBAN
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Ggplot in python
Ggplot in pythonGgplot in python
Ggplot in pythonAjay Ohri
 
What is r in spanish.
What is r in spanish.What is r in spanish.
What is r in spanish.Ajay Ohri
 

Andere mochten auch (20)

Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Don't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROIDon't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROI
 
Big data meets big analytics
Big data meets big analyticsBig data meets big analytics
Big data meets big analytics
 
Why Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueDataWhy Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueData
 
Dell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with IsilonDell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with Isilon
 
BlueData Isilon Validation Brief
BlueData Isilon Validation BriefBlueData Isilon Validation Brief
BlueData Isilon Validation Brief
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Python at yhat (august 2013)
Python at yhat (august 2013)Python at yhat (august 2013)
Python at yhat (august 2013)
 
Analyzing mlb data with ggplot
Analyzing mlb data with ggplotAnalyzing mlb data with ggplot
Analyzing mlb data with ggplot
 
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
 
Hadley verse
Hadley verseHadley verse
Hadley verse
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Ggplot in python
Ggplot in pythonGgplot in python
Ggplot in python
 
What is r in spanish.
What is r in spanish.What is r in spanish.
What is r in spanish.
 

Ähnlich wie How Big Data ,Cloud Computing ,Data Science can help business

Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion ahmed alshikh
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA Zeeshan Khan
 

Ähnlich wie How Big Data ,Cloud Computing ,Data Science can help business (20)

Big Data
Big DataBig Data
Big Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Big data
Big dataBig data
Big data
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA
 
Big data
Big dataBig data
Big data
 

Mehr von Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanishAjay Ohri
 
Introduction to sas in spanish
Introduction to sas in spanishIntroduction to sas in spanish
Introduction to sas in spanishAjay Ohri
 
Logical Fallacies
Logical FallaciesLogical Fallacies
Logical FallaciesAjay Ohri
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-Ajay Ohri
 
Introduction to sas
Introduction to sasIntroduction to sas
Introduction to sasAjay Ohri
 
Summer School with DecisionStats brochure
Summer School with DecisionStats brochureSummer School with DecisionStats brochure
Summer School with DecisionStats brochureAjay Ohri
 
Social media and social media analytics by decisionstats.org
Social media and social media analytics by decisionstats.orgSocial media and social media analytics by decisionstats.org
Social media and social media analytics by decisionstats.orgAjay Ohri
 

Mehr von Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanish
 
Introduction to sas in spanish
Introduction to sas in spanishIntroduction to sas in spanish
Introduction to sas in spanish
 
Rcpp
RcppRcpp
Rcpp
 
Logical Fallacies
Logical FallaciesLogical Fallacies
Logical Fallacies
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-
 
Introduction to sas
Introduction to sasIntroduction to sas
Introduction to sas
 
Summer School with DecisionStats brochure
Summer School with DecisionStats brochureSummer School with DecisionStats brochure
Summer School with DecisionStats brochure
 
Social media and social media analytics by decisionstats.org
Social media and social media analytics by decisionstats.orgSocial media and social media analytics by decisionstats.org
Social media and social media analytics by decisionstats.org
 

Kürzlich hochgeladen

ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 

Kürzlich hochgeladen (17)

ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 

How Big Data ,Cloud Computing ,Data Science can help business

  • 1. Analytics Talk By Ajay Ohri at Allianz Trivandrum 9 October 2016
  • 2. Analytics Session Introduction to Big Data, Cloud Computing, Data Science and How They Affect You
  • 3. Agenda Big Data - definition and explanation Cloud Computing Data Science Business Strategy Models Case Studies in Insurance
  • 4. Big Data What is Big Data? "Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.
  • 5. Big Data What is Big Data? "extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions. 1. "much IT investment is going towards managing and maintaining big data" https://en.wikipedia.org/wiki/Big_data Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.
  • 6. Big Data: Statistics IBM- http://www-01.ibm.com/software/data/bigdata/ Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
  • 7. Big Data: Moving Fast IBM- https://www.ibm.com/big-data/us/en/ Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics capabilities and skills.
  • 8. 4V of BIG DATA http://www.ibmbigdatahub.com /infographic/four-vs-big-data
  • 18. Who uses Big Data http://www.sas.com/en_us/insights/big-data/what-is-big-data.html Banking With large amounts of information streaming in from countless sources, banks are faced with finding new and innovative ways to manage big data. While it’s important to understand customers and boost their satisfaction, it’s equally important to minimize risk and fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step ahead of the game with advanced analytics. Education Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluation and support of teachers and principals. Government When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime. But while there are many advantages to big data, governments must also address issues of transparency and privacy.
  • 19. Who uses Big Data http://www.sas.com/en_us/insights/big-data/what-is-big-data.html Health Care Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly, accurately – and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care. Manufacturing Armed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste – processes that are key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business decisions. Retail Customer relationship building is critical to the retail industry – and the best way to manage that is to manage big data. Retailers need to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business. Big data remains at the heart of all those things.
  • 20. Big Data: Hadoop Stack The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. http://hadoop.apache.org/
  • 21. Big Data: Hadoop Stack Hadoop-related projects at Apache include: Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro™: A data serialization system. Cassandra™: A scalable multi-master database with no single points of failure. Chukwa™: A data collection system for managing large distributed systems. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout™: A Scalable machine learning and data mining library. Pig™: A high-level data-flow language and execution framework for parallel computation. Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. ZooKeeper™: A high-performance coordination service for distributed applications.
  • 25. NoSQL A NoSQL (Not-only-SQL) database is one that has been designed to store, distribute and access data using methods that differ from relational databases (RDBMS’s). NoSQL technology was originally created and used by Internet leaders such as Facebook, Google, Amazon, and others who required database management systems that could write and read data anywhere in the world, while scaling and delivering performance across massive data sets and millions of users.
  • 28. How NoSQL Databases Differ From Each Other https://www.datastax.com/nosql-databases There are a variety of different NoSQL databases on the market with the key differentiators between them being the following: Architecture: Some NoSQL databases like MongoDB are architected in a master/slave model in somewhat the same way as many RDBMS’s. Others (like Cassandra) are designed in a ‘masterless’ fashion where all nodes in a database cluster are the same. The architecture of a NoSQL database greatly impacts how well the database supports requirements such as constant uptime, multi-geography data replication, predictable performance, and more. Data Model: NoSQL databases are often classified by the data model they support. Some support a wide- row tabular store, while others sport a model that is either document-oriented, key-value, or graph. Data Distribution Model: Because of their architecture differences, NoSQL databases differ on how they support the reading, writing, and distribution of data. Some NoSQL platforms like Cassandra support writes and reads on every node in a cluster and can replicate / synchronize data between many data centers and cloud providers. Development Model: NoSQL databases differ on their development API’s with some supporting SQL-like languages (e.g. Cassandra’s CQL).
  • 30. Cloud Computing Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf --National Institute of Standards and Technology
  • 31. Cloud Computing: Types five essential characteristics 1. On demand self service 2. Broad Network Access 3. Resource Pooling 4. Rapid Elasticity 5. Measured Service
  • 32. Cloud Computing 1. the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
  • 33. Cloud Computing: Types three service models (SaaS, PaaS and IaaS)
  • 34. Cloud Computing: Types four deployment models (private, public, community and hybrid). Key enabling technologies include: 1. fast networks, 2. inexpensive computers, and 3. virtualization for commodity hardware.
  • 35. Cloud Computing: Types major barriers to broader cloud adoption are security, interoperability, and portability For a layman to be explained in simple short terms, cloud computing is a lot of scalable and custom computing power available by rent/by hour and accessible remotely. It can help in doing more computing at a fraction of the cost
  • 36. Data Driven Decision Making - using data and trending historical data - validating assumptions if any - using champion challenger to test scenarios - using experiments - use baselines - continuous improvement - customer experiences - costs - revenues If you can't measure it, you can't manage it -Peter Drucker
  • 37. BCG Matrix for Product Lines BCG Matrix is best used to analyze your own or target organization’s product portfolio- applicable for companies with multiple products To help corporations with analyzing their business units or product lines. This helps the company allocate resources
  • 38. Porter’s 5 Forces Model for Industries It draws upon industrial organization (IO) economics to derive five forces that determine the competitive intensity and therefore attractiveness of a market. Attractiveness in this context refers to the overall industry profitability. An “unattractive” industry is one in which the combination of these five forces acts to drive down overall profitability. A very unattractive industry would be one approaching “pure competition”, in which available profits for all firms are driven to normal profit.
  • 39. Porter’s Diamond Model an economical model developed by Michael Porter in his book The Competitive Advantage of Nations, where he published his theory of why particular industries become competitive in particular locations.
  • 40. McKinsey 7S Framework To check which teams work and which teams done (within an organization) use this framework by the famous consulting company-a strategic vision for groups, to include businesses, business units, and teams. The 7S are structure, strategy, systems, skills, style, staff and shared values. The model is most often used as a tool to assess and monitor changes in the internal situation of an organization.
  • 41. Greiner Model for Organizational Growth Developed by Larry E. Greiner it is helpful when examining the problems associated with growth on organizations and the impact of change on employees. It can be argued that growing organizations move through five relatively calm periods of evolution, each of which ends with a period of crisis and revolution. Each evolutionary period is characterized by the dominant management style used to achieve growth, while Each revolutionary period is characterized by the dominant management problem that must be
  • 42. Marketing Model 4P and 4 C model helps you identify marketing mix Products Price Promotion Place Consumers Cost Communication Convenience
  • 43. Business Canvas Model The Business Model Canvas is a strategic management template for developing new or documenting existing business models. It is a visual chart with elements describing a firm’s value proposition, infrastructure, customers, and finances. It assists firms in aligning their activities by illustrating potential trade-offs.
  • 44. Motivation Models Hertzberg motivation-hygiene theory job satisfaction and job dissatisfaction act independently of each other Leading to satisfaction Achievement Recognition Work itself Responsibility Advancement Leading to dissatisfaction Company policy Supervision Relationship with boss Work conditions Salary Relationship with peers
  • 46. Business Strategy Models http://decisionstats.com/2013/12/19/business-strategy-models/ 1. Porters 5 forces Model-To analyze industries 2. Business Canvas 3. BCG Matrix- To analyze Product Portfolios 4. Porters Diamond Model- To analyze locations 5. McKinsey 7 S Model-To analyze teams 6. Gernier Theory- To analyze growth of organization 7. Herzberg Hygiene Theory- To analyze soft aspects of individuals
  • 47. Data Science What is a data scientist? A data scientist is one who had inter disciplinary skills in both programming, statistics and business domains to create actionable insights based on experiments or summaries from data.
  • 48. Data Science On a daily basis, a data scientist is simply a person who can write some code in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR) for data storage, querying, summarization, visualization efficiently, and in time on databases, on cloud, servers and understand enough statistics to derive insights from data so business can make decisions What should a data scientist know? He should know how to get data, store it, query it, manage it, and turn it into actionable insights.
  • 49. Big Data Social Media Analysis https://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/ Social Network Analysis
  • 50. How does information propagate through a social network? http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
  • 51. Fraud Analysis anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.
  • 52. How they affect you :Financial Profitability Data Storage is getting cheaper but the way it is stored is changing ( from company servers to external cloud) Big Data helps to store every interaction, transaction, with customer but this also increases complexity of data Data Science is getting cheaper ( open source) but more skilled professionals in analytics required
  • 53. How they affect you :Sales and Marketing Which customers to target and who not to target ( traditional propensity models) Where to target ( geocoded) When to target Forecast Demand
  • 54. How they affect you :Operations Optimize cost and logistics Maximize output per resource Can also be combined with IoT
  • 55. How they affect you :Human Resources Which employee is like to leave first Which skill is most likely to be crucial next 12 24 months Forecast for skills, employees
  • 56. Insurance Examples http://www.insurancenetworking.com/news/data-analytics/big-datas-big- guns-progressive-insurance-35951-1.html Agents increasingly want mobile enablement, and not just the ability to quote, but to bind and sell policies on smartphones and tablets. -Progressive progressive snapshot https://www.progressive.com/auto/snapshot/ To participate you attach the Snapshot device to the computer in your car, which collects data about your driving habits. According to Progressive, the device records your vehicle identification number (VIN), how many miles you drive each day and how often you drive between between midnight and 4 a.m. After driving with Snapshot for 30 days, you return it to Progressive and, depending on your driving habits, the company says you can get a discount up to 30%
  • 57. Insurance Examples Mass Mutual http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-massmutual-35952-1.html Created Haven Life, an online insurance agency that uses an algorithmic underwriting tool and series of related decisions that was created in collaboration with team of data scientists. insurance companies are vast decision-making engines that take and manage risk. The inputs into this engine are data, and the capabilities created by the field of data science can and will impact every process in the company — from underwriting to claims management to security,
  • 58. Insurance Examples CNA is applying big data technology to workers compensation claims and adjusters’ notes. “That is a classic, unstructured big data kind of problem,” says Nate Root, SVP of CNA’s shared service organization. “We have hundreds of thousands of workers compensation claims, and claims adjuster notes, and there is tremendous value in those notes.” Root says the insurer recently began identifying workers’ compensation claims that have the potential to turn into a total disability, or partial permanent disability, without the right sort of attention. By examining the unstructured data, CNA has developed a hundred different variables that can predict a propensity for a claim to become serious, and then assign a nurse case manager to help the insured get necessary treatments for a better patient outcome, get them back to work and lower the overall cost of coverage. For example, the program can find people who are missing appointments or who are not engaged with physical therapy and should be. http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-cna-35959-1.html
  • 59. Insurance Examples American Family Insurance licensed APT’s Test & Learn software (http://www.predictivetechnologies.com/products/test-learn.aspx ) to enhance customer engagement and increase support for agents. “This is a statistical tool that enables us to create and analyze statistical tests,” For example, call-routing techniques affect wait times and, ultimately claims satisfaction. The insurer also tracks how claims are handled, and by whom, and whether agents are involved in resolution. Using APT, the insurer can isolate variables and accurately determine the success of one design vs. another for various products, geographies or demographics, http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns- american-family-insurance-35953-1.html .
  • 60. Insurance Examples http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-american-family-insurance-35953-1.html . American Family Insurance Unstructured data, such as that collected in call center transcripts, also can be studied to better understand what approaches are best for different situations, he says. “Hadoop and other tools enable natural- language processing and sentiment analysis,” Cruz says. “We can look for key words or patterns in those words, do counts and build models off textual indicators that enable us to identify three things: 1. when there could be fraud involved, 2. where there might be severity issues, 3. or how we can get ahead of that and plan for it,” Customer communication, web design and direct mail are other areas the insurer is, or soon will be, using APT, 1. Do we see greater lift in these geographies vs. those? Or,
  • 61. Insurance Examples Like MassMutual, Nationwide has partnered with a local college — Ohio State University, the university with the third- largest enrollment in the country. The Nationwide Center for Advanced Customer Insights (NCACI) gives OSU students in advanced degree programs the ability to work with real-world data to solve some of the biggest insurance business problems. Faculty and students from the marketing, statistics, psychology, economics and computer science departments work with Nationwide to develop predictive models and data mining techniques aimed at improving 1. marketing and distribution, 2. identifying consumer behavior patterns, and 3. increasing customer satisfaction and 4. lifetime value.
  • 62. Insurance Examples John Hancock his team set out to find a way to leverage the wealth of data collected by wearable technologies, including the popular FitBit and recently released Apple Watch, to give something back to their customers. The end result was John Hancock Vitality, a new life insurance product that offers up to a 15 percent premium discount to customers who track their healthy habits with wearables and turn that information over to the insurance company. New buyers even get their own FitBit to begin tracking. http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-john-hancock-35954-1.html Fitbit Inc. is an American company known for its products of the same name, which are activity trackers, wireless-enabled wearable technology devices that measure data such as the number of steps walked, heart rate, quality of sleep, steps climbed, and other personal metrics.
  • 63. Insurance Examples Swiss Re is using more public data to improve underwriting results and decrease the number of questions the insurer has to ask consumers to underwrite them. Swiss Re is looking at big data in terms of two major streams. In the first, big data is being used to help reduce costs and improve the efficiency of current processes throughout the insurance value chain, including claims and fraud management, cyber risk, customer management, pricing, risk assessment and selection, distribution and service management, product innovation, and research and development. In the second stream, big data also offers a new framework to think bigger in terms of market disruption. Swiss Re has created more than 100 prototypes internally, and that as a result the entire organization sees the value and importance of big data and smart analytics. http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-swiss-re-35957-1.html
  • 64. Insurance Examples ‘How do you take that operationally efficient data and turn it into a customer/household view and understand all the products attached to a person?’” Allstate has focused heavily on master data management and data governance creating party and household IDs for data. The company is also building a team to work across business areas on analytics projects rather than siloing big data projects within certain units. “Something meant for a single purpose often leads to other insights. We know, for example based on some call-volume analysis in our call center, how often customers defect.”We have an application in claims, QuickFoto, where a policyholder that isn’t in a major accident can snap a picture of the damage and send it to us. But whereas in the past, that would’ve gone into a physical folder and then a filing cabinet, now I have all those pictures of cars in a database, and there’s a lot more that I can do.”
  • 66. Data Science Tools and Techniques for extracting maximum value from Customer Data and Interactions
  • 67. Agenda Data Science Approach Data Science Tools Data Science Techniques
  • 68. Data Science Approach On a daily basis, a data scientist is simply a person who can write some code in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR) for data storage, querying, summarization, visualization efficiently, and in time on databases, on cloud, servers and understand enough statistics to derive insights from data so business can make decisions
  • 69. Data Science Approach What should a data scientist know? He should know how to get data, store it, query it, manage it, and turn it into actionable insights. The following approach elaborates on this simple and sequential premise.
  • 70. Where to get Data A data scientist needs data to do science on, right! Some of the usual sources of data for a data scientist are- APIs- API is an acronym for Application Programming Interface.We cover APIs in detail in Chapter 6. APIs is how the current big data paradigm is enabled, as it enables machines to talk and fetch data from each other programmatically. For a list of articles written by the same author on APIs- see https://www.programmableweb.com/profile/ajayohri. Internet Clickstream Logs- Internet clickstream logs refer to the data generated by humans when they click specific links within a webpage. This data is time stamped, and the uniqueness of the person clicking the link can be established by IP address. IP addresses can be parsed by registries like https://www.arin.net/whois or http://www.apnic.net/whois for examining location (country and city). internet service provider and owner of the address (for website owners this can be done using the website http://who.is/). In Windows using the command ipconfig and in Linux systems using ifconfig can help us examine IP Address. You can read this for learning more on IP addresses http://en.wikipedia.org/wiki/IP_address. Software like Clicky from (http://getclicky.com) and Google Analytics( www.google.com/analytics) also help us give data which can then be parsed using their APIs. (See https://code.google.com/p/r-google-analytics/ for Google Analytics using R). Machine Generated Data- Machines generate a lot of data especially for sensors to ensure that the machine is working properly. This data can be logged and can used with events like cracks or failures to have predictive asset maintance of M2M (Machine to Machine) Analytics.
  • 71. Where to get Data Surveys- Surveys are mostly questionaries filled by humans. They used to be administed manually over paper, but online surveys are now the definitive trend. Surveys reveal valuable data about current preferences of current and potential customers. They do suffer from the bias inherent from design of questions by the creator. Since customer preferences evolve surveys help in getting primary data about current preferences. Coupled with stratified random sampling, they can be a powerful method for collecting data. SurveyMonkey is one such company that helps create online questionaries (https://www.surveymonkey.com/pricing/) Commercial Databases- Commercial Databases are properietary databases that have been collected over time and are sold /rented by vendors. They can be used for prospect calling, appending information to existing database, and refining internal database quality. Credit Bureaus- Credit bureaus collect financial information about people, and this information is then available for marketing organizations (subject to legal and privacy guideliness). The cost of such information is balanced by the added information about customers. Social Media- Social media is a relatively new source of data and offers powerful insights albiet through a lot of unstructured data. Companies like Datasift offer social media data, and companies like Salesforce/Radian6 offer social media tools (http://www.salesforcemarketingcloud.com/). Facebook has 829 million daily active users on average in June 2014 with 1.32 billion monthly active users . Twitter has 255 million monthly active users and 500 million Tweets are sent per day. That generates a lot of data about what current and potential customers are thinking and writing about your products.
  • 72. Where to process data? Now you have the data. We need computers to process it. Local Machine - Benefits of storing the data in local machine are ease of access. The potential risks include machine outages, data recovery, data theft (especially for laptops) and limited scalability. A local machine is also much more expensive in terms of processing and storage and gets obsolete within a relatively short period of time. Server- Servers respond to requests across networks. They can be thought of as centralized resources that help cut down cost of processing and storage. They can be an intermediate solution between local machines and clouds, though they have huge capital expenditure upfront. Not all data that can fit on a laptop should be stored on a laptop. You can store data in virtual machines on your server and connected through thin shell clients with secure access. Cloud- The cloud can be thought of a highly scalable, metered service that allows requests from remote networks. They can be thought of as a large bank of servers but that is a simplistic definition. hindrance to adoption to the cloud is resistance within existing IT department whose members are not trained to transition and maintain the network over cloud as they used to do for enterprise networks.
  • 73. Cloud Computing Providers We exapnd on the cloud processing part. Amazon EC2 - Amazon Elastic Compute Cloud (Amazon EC2) provides scalable processing power in the cloud. It has a web based management console, has a command line tool , and offers resources for Linux and Windows virtual images. Further details are available at http://aws.amazon.com/ec2/ . Amazon EC2 is generally considered the industry leader.For beginners a 12 month basic preview is available for free at http://aws.amazon.com/free/ that can allow practioners to build up familiarity. Google Compute- https://cloud.google.com/products/compute-engine/ Microsoft Azure - https://azure.microsoft.com/en-us/pricing/details/virtual-machines / Azure Virtual Machines enable you to deploy a Windows Server, Linux, or third-party software images to Azure. You can select images from a gallery or bring your own customized images. Charge for Virtual Machines is by the minute. Discounts can range from 205 to 32 % depending if you pre pay 6 months or 12 month plans and based on usage tier. IBM shut down its SmartCloud Enterprise cloud computing platform by Jan. 31, 2014 and will migrate those customers to its SoftLayer cloud computing platform, which was an IBM acquired company https://www.softlayer.com/virtual-servers Oracle Oracle's plans for the cloud are still in preview for enterprise customers a https://cloud.oracle.com/compute
  • 74. Where to store data The need to store data in a secure and reliable environment for speedy and repeated access. There is a cost of storing this data, and there is a cost of losing the data due to some technical accident. You can store data in the following way csv files, spreadsheet and text files locally espeially for smaller files. Note while this increases ease of access, it also creates problems of version control as well as security of confidential data. relational databases (RDBMS) and data warehouses hadoop based storage
  • 75. Where to store data noSQL databases- are non-relational, distributed, open-source and horizontally scalable. A complete list of NoSQL databases is at http://nosql-database.org/ . Notable NoSQL databases are MongoDB, couchDB et al. key value store -Key-value stores use the map or dictionary as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection Redis -Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets (http://redis.io/). Riak is an open source, distributed database. http://basho.com/riak/. MemcacheDB is a persistence enabled variant of memcached, column oriented databases cloud storage
  • 76. Cloud Storage Amazon- Amazon Simple Storage Services (S3)- Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. http://aws.amazon.com/s3/ . Cost is a maximum of 3 cents per GB per month. There are three types of storage Standard Storage, Reduced Redundancy Storage, Glacier Storage. Reduced Redundancy Storage (RRS) is a storage option within Amazon S3 that enables customers to reduce their costs by storing non-critical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. Amazon Glacier stores data for as little as $0.01 per gigabyte per month, and is optimized for data that is infrequently accessed and for which retrieval times of 3 to 5 hours are suitable. These details can be seen at http://aws.amazon.com/s3/pricing/ Google - Google Cloud Storage https://cloud.google.com/products/cloud-storage/ . It also has two kinds of storage. Durable Reduced Availability Storage enables you to store data at lower cost, with the tradeoff of lower availability than standard Google Cloud Storage.. Prices are 2.6 cents for Standard Storage (GB/Month) and 2 cents for Durable Reduced Availability (DRA) Storage (GB/Month). They can be seen at https://developers.google.com/storage/pricing#storage-pricing Azure- Microsoft has different terminology for it's cloud infrastructure. Storage is classified in three types with a fourth type (Files) being available as a preview. There are three levels of redundancy Locally Redundant Storage (LRS),Geographically Redundant Storage (GRS) ,Read-Access Geographically Redundant Storage (RA-GRS): You can see details and prices at https://azure.microsoft.com/en-us/pricing/details/storage/ Oracle Storage is available at https://cloud.oracle.com/storage and costs around 30$ / TB per month
  • 77. Databases on the Cloud- Amazon Amazon RDS -Managed MySQL, Oracle and SQL Server databases. http://aws.amazon.com/rds/ While relational database engines provide robust features and functionality, scaling requires significant time and expertise. DynamoDB - Managed NoSQL database service. http://aws.amazon.com/dynamodb/ Amazon DynamoDB focuses on providing seamless scalability and fast, predictable performance. It runs on solid state disks (SSDs) for low-latency response times, and there are no limits on the request capacity or storage size for a given table. This is because Amazon DynamoDB automatically partitions your data and workload over a sufficient number of servers to meet the scale requirements you provide. Redshift - It is a managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour and scale to a petabyte or more for $1,000 per terabyte per year. http://aws.amazon.com/redshift/ SimpleDB- It is highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests http://aws.amazon.com/simpledb/. a table in Amazon SimpleDB has a strict storage limitation of 10 GB and is limited in the request capacity it can achieve (typically under 25 writes/second); it is up to you to manage the partitioning and Gre-partitioning of your data over additional SimpleDB tables if you need additional scale. While SimpleDB has scaling limitations, it may be a good fit for smaller workloads that require query flexibility. Amazon SimpleDB automatically indexes all item attributes and thus supports query flexibility at the cost of performance and scale.
  • 78. Databases on the Cloud - Others Google Google Cloud SQL -Relational Databases in Google's Cloud https://developers.google.com/cloud- sql/ Google Cloud Datastore - Managed NoSQL Data Storage Service https://developers.google.com/datastore/ Google Big Query- Enables you to write queries on huge datasets. BigQuery uses a columnar data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table https://cloud.google.com/products/bigquery/ Azure SQL Database https://azure.microsoft.com/en-in/services/sql-database/ SQL Database is a relational database service in the cloud based on the Microsoft SQL Server engine, with mission- critical capabilities. Because it’s based on the SQL Server engine, SQL Database supports existing SQL Server tools, libraries and APIs, which makes it easier for you to move and extend to the cloud.
  • 79. Basic Statistics Some of the basic statistics that every data scientist should know are given here. This assumes rudimentary basic knowledge of statistics ( like measures of central tendency or variation) and basic familiarity with some of the terminology used by statisticians. Random Sampling- In truly random sampling,the sample should be representative of the entire data. RAndom sampling remains of relevance in the era of Big Data and Cloud Computing Distributions- A data scientist should know the distributions ( normal, Poisson, Chi Square, F) and also how to determine the distribution of data. Hypothesis Testing - Hypothesis testing is meant for testing assumptions statistically regarding values of central tendency (mean, median) or variation. A good example of an easy to use software for statistical testing is the “test” tab in the Rattle GUI in R. Outliers- Checking for outliers is a good way for a data scientist to see anomalies as well as identify data quality. The box plot (exploratory data analysis) and the outlierTest function from car package ( Bonferroni Outlier Test) is how statistical rigor can be maintained to outlier detection.
  • 80. Basic Techniques Some of the basic techniques that a data scientist must know are listed as follows- Text Mining - In text mining , text data is analyzed for frequencies, associations and corelation for predictive purposes. The tm package from R greatly helps with text mining. Sentiment Analysis- In sentiment analysis the text data is classified based on a sentiment lexicography ( eg which says happy is less positive than delighted but more positive than sad) to create sentiment scores of the text data mined. Social Network Analysis- In social network analysis, the direction of relationships, the quantum of messages and the study of nodes,edges and graphs is done to give insights.. Time Series Forecasting- Data is said to be auto regressive with regards to time if a future value is dependent on a current value for a variable. Technqiues such as ARIMA and exponential smoothing and R packages like forecast greatly assist in time series forecasting. Web Analytics Social Media Analytics Data Mining or Machine Learning
  • 81. Data Science Tools - R - Python - Tableau - Spark with ML - Hadoop (Pig and Hive) - SAS - SQL
  • 82. R R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language https://www.r-project.org/about.html
  • 85. Big Data: Hadoop Stack with Spark http://spark.apache.org/ Apache Spark™ is a fast and general engine for large-scale data processing.
  • 86. Big Data: Hadoop Stack with Mahout https://mahout.apache.org/ The Apache Mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications. Apache Mahout Samsara Environment includes Distributed Algebraic optimizer R-Like DSL Scala API Linear algebra operations Ops are extensions to Scala IScala REPL based interactive shell Integrates with compatible libraries like MLLib Runs on distributed Spark, H2O, and Flink Apache Mahout Samsara Algorithms included Stochastic Singular Value Decomposition (ssvd, dssvd) Stochastic Principal Component Analysis (spca, dspca)
  • 87. Big Data: Hadoop Stack with Mahout https://mahout.apache.org/ Apache Mahout software provides three major features: A simple and extensible programming environment and framework for building scalable algorithms A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink Samsara, a vector math experimentation environment with R-like syntax which works at scale
  • 88. Data Science Techniques - Machine Learning - Regression - Logistic Regression - K Means Clustering - Association Analysis - Decision Trees - Text Mining
  • 89. What is an algorithm a process or set of rules to be followed in calculations or other problem- solving operations, especially by a computer. a self-contained step-by-step set of operations to be performed a procedure or formula for solving a problem, based on conducting a sequence of specified action a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation; broadly : a step-by-step procedure for solving a problem or accomplishing some end especially by a computer.
  • 90. Machine Learning Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space).
  • 91. CRAN VIEW Machine Learning http://cran.r-project.org/web/views/MachineLearning.html
  • 92. Machine Learning in Python http://scikit-learn.org/stable/
  • 93. Classification In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features, etc. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a part word in an email) or real-valued (e.g. a measurement of blood pressure). Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).
  • 94. Regression regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables.
  • 95. kNN
  • 97. Association Rules http://en.wikipedia.org/wiki/Association_rule_learning Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions Conecpts- Support, Confidence, Lift In R apriori() in arules package In Python http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/
  • 98. Gradient Descent Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html Start at some x value, use derivative at that value to tell us which way to move, and repeat. Gradient descent. http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
  • 99. Gradient Descent https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how “good” a given line is. initial_b = 0 # initial y-intercept guess initial_m = 0 # initial slope guess num_iterations = 1000
  • 102. Random Forest Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Each tree is grown as follows: 1.If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree. 2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. 3. Each tree is grown to the largest extent possible. There is no pruning. In the original paper on random forests, it was shown that the forest error rate depends on two things: The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
  • 103. Bagging Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set. These predictions are then averaged to create a, hopefully more accurate, final prediction value. http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
  • 104. Boosting Boosting is one of several classic methods for creating ensemble models, along with bagging, random forests, and so forth. Boosting means that each tree is dependent on prior trees, and learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage. XGBoost is a library designed and optimized for boosting trees algorithms. XGBoost is used in more than half of the winning solutions in machine learning challenges hosted at Kaggle. http://xgboost.readthedocs.io/en/latest/model.html# And http://dmlc.ml/rstats/2016/03/10/xgboost.html
  • 105. Data Science Process By Farcaster at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=40129394
  • 106. LTV Analytics Life Time Value (LTV) will help us answer 3 fundamental questions: 1. Did you pay enough to acquire customers from each marketing channel? 2. Did you acquire the best kind of customers? 3. How much could you spend on keeping them sweet with email and social media?
  • 107. LTV Analytics :Case Study https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
  • 112. LTV Analytics Download the zip file from http://www.kaushik.net/avinash/avinash_ltv.zip
  • 113. Pareto principle The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity) states that, for many events, roughly 80% of the effects come from 20% of the causes 80% of a company's profits come from 20% of its customers 80% of a company's complaints come from 20% of its customers 80% of a company's profits come from 20% of the time its staff spend 80% of a company's sales come from 20% of its products 80% of a company's sales are made by 20% of its sales staff Several criminology studies have found 80% of crimes are committed by 20% of criminals.
  • 114. RFM Analysis RFM is a method used for analyzing customer value. Recency - How recently did the customer purchase? Frequency - How often do they purchase? Monetary Value - How much do they spend? A method Recency = 10 - the number of months that have passed since the customer last purchased Frequency = number of purchases in the last 12 months (maximum of 10) Monetary = value of the highest order from a given customer (benchmarked against $10k) Alternatively, one can create categories for each attribute. For instance, the Recency attribute might be broken into three categories: customers with purchases within the last 90 days; between 91 and 365 days; and longer than 365 days. Such categories may be arrived at by applying business rules, or using a data mining technique, to find meaningful breaks. A commonly used shortcut is to use deciles. One is advised to look at distribution of data before choosing breaks.
  • 115. Are you ready To use more Data Science