Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

How Big Data ,Cloud Computing ,Data Science can help business

2.935 Aufrufe

Veröffentlicht am

Two talks given by Ajay Ohri of Decisionstats at Allianz Trivandrum India on 9 Nov 2016 on Big Data, Cloud Computing and Data Science

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

How Big Data ,Cloud Computing ,Data Science can help business

  1. 1. Analytics Talk By Ajay Ohri at Allianz Trivandrum 9 October 2016
  2. 2. Analytics Session Introduction to Big Data, Cloud Computing, Data Science and How They Affect You
  3. 3. Agenda Big Data - definition and explanation Cloud Computing Data Science Business Strategy Models Case Studies in Insurance
  4. 4. Big Data What is Big Data? "Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.
  5. 5. Big Data What is Big Data? "extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions. 1. "much IT investment is going towards managing and maintaining big data" https://en.wikipedia.org/wiki/Big_data Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.
  6. 6. Big Data: Statistics IBM- http://www-01.ibm.com/software/data/bigdata/ Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
  7. 7. Big Data: Moving Fast IBM- https://www.ibm.com/big-data/us/en/ Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics capabilities and skills.
  8. 8. 4V of BIG DATA http://www.ibmbigdatahub.com /infographic/four-vs-big-data
  9. 9. VOLUME http://www.ibmbigdatahub.com/ infographic/four-vs-big-data
  10. 10. VELOCITY http://www.ibmbigdatahub.com/ infographic/four-vs-big-data
  11. 11. VARIETY http://www.ibmbigdatahub.com/ infographic/four-vs-big-data
  12. 12. VERACITY http://www.ibmbigdatahub.com/ infographic/four-vs-big-data
  13. 13. VALUE http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
  14. 14. Veracity and Variety http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
  15. 15. Volume and Velocity http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
  16. 16. Example Source- https://www.renesas.com/en- sg/about/web-magazine/edge/global/13-big- data.html
  17. 17. Example Source- https://www.renesas.com/en- sg/about/web-magazine/edge/global/13-big- data.html
  18. 18. Who uses Big Data http://www.sas.com/en_us/insights/big-data/what-is-big-data.html Banking With large amounts of information streaming in from countless sources, banks are faced with finding new and innovative ways to manage big data. While it’s important to understand customers and boost their satisfaction, it’s equally important to minimize risk and fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step ahead of the game with advanced analytics. Education Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluation and support of teachers and principals. Government When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime. But while there are many advantages to big data, governments must also address issues of transparency and privacy.
  19. 19. Who uses Big Data http://www.sas.com/en_us/insights/big-data/what-is-big-data.html Health Care Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly, accurately – and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care. Manufacturing Armed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste – processes that are key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business decisions. Retail Customer relationship building is critical to the retail industry – and the best way to manage that is to manage big data. Retailers need to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business. Big data remains at the heart of all those things.
  20. 20. Big Data: Hadoop Stack The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. http://hadoop.apache.org/
  21. 21. Big Data: Hadoop Stack Hadoop-related projects at Apache include: Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro™: A data serialization system. Cassandra™: A scalable multi-master database with no single points of failure. Chukwa™: A data collection system for managing large distributed systems. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout™: A Scalable machine learning and data mining library. Pig™: A high-level data-flow language and execution framework for parallel computation. Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. ZooKeeper™: A high-performance coordination service for distributed applications.
  22. 22. Big Data: Hadoop Stack
  23. 23. Big Data: Hadoop Stack
  24. 24. Big Data: Hadoop Stack
  25. 25. NoSQL A NoSQL (Not-only-SQL) database is one that has been designed to store, distribute and access data using methods that differ from relational databases (RDBMS’s). NoSQL technology was originally created and used by Internet leaders such as Facebook, Google, Amazon, and others who required database management systems that could write and read data anywhere in the world, while scaling and delivering performance across massive data sets and millions of users.
  26. 26. NoSQL https://www.datastax.com/nosql-databases
  27. 27. NoSQL https://www.datastax.com/nosql-databases
  28. 28. How NoSQL Databases Differ From Each Other https://www.datastax.com/nosql-databases There are a variety of different NoSQL databases on the market with the key differentiators between them being the following: Architecture: Some NoSQL databases like MongoDB are architected in a master/slave model in somewhat the same way as many RDBMS’s. Others (like Cassandra) are designed in a ‘masterless’ fashion where all nodes in a database cluster are the same. The architecture of a NoSQL database greatly impacts how well the database supports requirements such as constant uptime, multi-geography data replication, predictable performance, and more. Data Model: NoSQL databases are often classified by the data model they support. Some support a wide- row tabular store, while others sport a model that is either document-oriented, key-value, or graph. Data Distribution Model: Because of their architecture differences, NoSQL databases differ on how they support the reading, writing, and distribution of data. Some NoSQL platforms like Cassandra support writes and reads on every node in a cluster and can replicate / synchronize data between many data centers and cloud providers. Development Model: NoSQL databases differ on their development API’s with some supporting SQL-like languages (e.g. Cassandra’s CQL).
  29. 29. Big Data Strategy
  30. 30. Cloud Computing Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf --National Institute of Standards and Technology
  31. 31. Cloud Computing: Types five essential characteristics 1. On demand self service 2. Broad Network Access 3. Resource Pooling 4. Rapid Elasticity 5. Measured Service
  32. 32. Cloud Computing 1. the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
  33. 33. Cloud Computing: Types three service models (SaaS, PaaS and IaaS)
  34. 34. Cloud Computing: Types four deployment models (private, public, community and hybrid). Key enabling technologies include: 1. fast networks, 2. inexpensive computers, and 3. virtualization for commodity hardware.
  35. 35. Cloud Computing: Types major barriers to broader cloud adoption are security, interoperability, and portability For a layman to be explained in simple short terms, cloud computing is a lot of scalable and custom computing power available by rent/by hour and accessible remotely. It can help in doing more computing at a fraction of the cost
  36. 36. Data Driven Decision Making - using data and trending historical data - validating assumptions if any - using champion challenger to test scenarios - using experiments - use baselines - continuous improvement - customer experiences - costs - revenues If you can't measure it, you can't manage it -Peter Drucker
  37. 37. BCG Matrix for Product Lines BCG Matrix is best used to analyze your own or target organization’s product portfolio- applicable for companies with multiple products To help corporations with analyzing their business units or product lines. This helps the company allocate resources
  38. 38. Porter’s 5 Forces Model for Industries It draws upon industrial organization (IO) economics to derive five forces that determine the competitive intensity and therefore attractiveness of a market. Attractiveness in this context refers to the overall industry profitability. An “unattractive” industry is one in which the combination of these five forces acts to drive down overall profitability. A very unattractive industry would be one approaching “pure competition”, in which available profits for all firms are driven to normal profit.
  39. 39. Porter’s Diamond Model an economical model developed by Michael Porter in his book The Competitive Advantage of Nations, where he published his theory of why particular industries become competitive in particular locations.
  40. 40. McKinsey 7S Framework To check which teams work and which teams done (within an organization) use this framework by the famous consulting company-a strategic vision for groups, to include businesses, business units, and teams. The 7S are structure, strategy, systems, skills, style, staff and shared values. The model is most often used as a tool to assess and monitor changes in the internal situation of an organization.
  41. 41. Greiner Model for Organizational Growth Developed by Larry E. Greiner it is helpful when examining the problems associated with growth on organizations and the impact of change on employees. It can be argued that growing organizations move through five relatively calm periods of evolution, each of which ends with a period of crisis and revolution. Each evolutionary period is characterized by the dominant management style used to achieve growth, while Each revolutionary period is characterized by the dominant management problem that must be
  42. 42. Marketing Model 4P and 4 C model helps you identify marketing mix Products Price Promotion Place Consumers Cost Communication Convenience
  43. 43. Business Canvas Model The Business Model Canvas is a strategic management template for developing new or documenting existing business models. It is a visual chart with elements describing a firm’s value proposition, infrastructure, customers, and finances. It assists firms in aligning their activities by illustrating potential trade-offs.
  44. 44. Motivation Models Hertzberg motivation-hygiene theory job satisfaction and job dissatisfaction act independently of each other Leading to satisfaction Achievement Recognition Work itself Responsibility Advancement Leading to dissatisfaction Company policy Supervision Relationship with boss Work conditions Salary Relationship with peers
  45. 45. Motivation Models Maslow Hierarchy of Needs
  46. 46. Business Strategy Models http://decisionstats.com/2013/12/19/business-strategy-models/ 1. Porters 5 forces Model-To analyze industries 2. Business Canvas 3. BCG Matrix- To analyze Product Portfolios 4. Porters Diamond Model- To analyze locations 5. McKinsey 7 S Model-To analyze teams 6. Gernier Theory- To analyze growth of organization 7. Herzberg Hygiene Theory- To analyze soft aspects of individuals
  47. 47. Data Science What is a data scientist? A data scientist is one who had inter disciplinary skills in both programming, statistics and business domains to create actionable insights based on experiments or summaries from data.
  48. 48. Data Science On a daily basis, a data scientist is simply a person who can write some code in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR) for data storage, querying, summarization, visualization efficiently, and in time on databases, on cloud, servers and understand enough statistics to derive insights from data so business can make decisions What should a data scientist know? He should know how to get data, store it, query it, manage it, and turn it into actionable insights.
  49. 49. Big Data Social Media Analysis https://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/ Social Network Analysis
  50. 50. How does information propagate through a social network? http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
  51. 51. Fraud Analysis anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.
  52. 52. How they affect you :Financial Profitability Data Storage is getting cheaper but the way it is stored is changing ( from company servers to external cloud) Big Data helps to store every interaction, transaction, with customer but this also increases complexity of data Data Science is getting cheaper ( open source) but more skilled professionals in analytics required
  53. 53. How they affect you :Sales and Marketing Which customers to target and who not to target ( traditional propensity models) Where to target ( geocoded) When to target Forecast Demand
  54. 54. How they affect you :Operations Optimize cost and logistics Maximize output per resource Can also be combined with IoT
  55. 55. How they affect you :Human Resources Which employee is like to leave first Which skill is most likely to be crucial next 12 24 months Forecast for skills, employees
  56. 56. Insurance Examples http://www.insurancenetworking.com/news/data-analytics/big-datas-big- guns-progressive-insurance-35951-1.html Agents increasingly want mobile enablement, and not just the ability to quote, but to bind and sell policies on smartphones and tablets. -Progressive progressive snapshot https://www.progressive.com/auto/snapshot/ To participate you attach the Snapshot device to the computer in your car, which collects data about your driving habits. According to Progressive, the device records your vehicle identification number (VIN), how many miles you drive each day and how often you drive between between midnight and 4 a.m. After driving with Snapshot for 30 days, you return it to Progressive and, depending on your driving habits, the company says you can get a discount up to 30%
  57. 57. Insurance Examples Mass Mutual http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-massmutual-35952-1.html Created Haven Life, an online insurance agency that uses an algorithmic underwriting tool and series of related decisions that was created in collaboration with team of data scientists. insurance companies are vast decision-making engines that take and manage risk. The inputs into this engine are data, and the capabilities created by the field of data science can and will impact every process in the company — from underwriting to claims management to security,
  58. 58. Insurance Examples CNA is applying big data technology to workers compensation claims and adjusters’ notes. “That is a classic, unstructured big data kind of problem,” says Nate Root, SVP of CNA’s shared service organization. “We have hundreds of thousands of workers compensation claims, and claims adjuster notes, and there is tremendous value in those notes.” Root says the insurer recently began identifying workers’ compensation claims that have the potential to turn into a total disability, or partial permanent disability, without the right sort of attention. By examining the unstructured data, CNA has developed a hundred different variables that can predict a propensity for a claim to become serious, and then assign a nurse case manager to help the insured get necessary treatments for a better patient outcome, get them back to work and lower the overall cost of coverage. For example, the program can find people who are missing appointments or who are not engaged with physical therapy and should be. http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-cna-35959-1.html
  59. 59. Insurance Examples American Family Insurance licensed APT’s Test & Learn software (http://www.predictivetechnologies.com/products/test-learn.aspx ) to enhance customer engagement and increase support for agents. “This is a statistical tool that enables us to create and analyze statistical tests,” For example, call-routing techniques affect wait times and, ultimately claims satisfaction. The insurer also tracks how claims are handled, and by whom, and whether agents are involved in resolution. Using APT, the insurer can isolate variables and accurately determine the success of one design vs. another for various products, geographies or demographics, http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns- american-family-insurance-35953-1.html .
  60. 60. Insurance Examples http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-american-family-insurance-35953-1.html . American Family Insurance Unstructured data, such as that collected in call center transcripts, also can be studied to better understand what approaches are best for different situations, he says. “Hadoop and other tools enable natural- language processing and sentiment analysis,” Cruz says. “We can look for key words or patterns in those words, do counts and build models off textual indicators that enable us to identify three things: 1. when there could be fraud involved, 2. where there might be severity issues, 3. or how we can get ahead of that and plan for it,” Customer communication, web design and direct mail are other areas the insurer is, or soon will be, using APT, 1. Do we see greater lift in these geographies vs. those? Or,
  61. 61. Insurance Examples Like MassMutual, Nationwide has partnered with a local college — Ohio State University, the university with the third- largest enrollment in the country. The Nationwide Center for Advanced Customer Insights (NCACI) gives OSU students in advanced degree programs the ability to work with real-world data to solve some of the biggest insurance business problems. Faculty and students from the marketing, statistics, psychology, economics and computer science departments work with Nationwide to develop predictive models and data mining techniques aimed at improving 1. marketing and distribution, 2. identifying consumer behavior patterns, and 3. increasing customer satisfaction and 4. lifetime value.
  62. 62. Insurance Examples John Hancock his team set out to find a way to leverage the wealth of data collected by wearable technologies, including the popular FitBit and recently released Apple Watch, to give something back to their customers. The end result was John Hancock Vitality, a new life insurance product that offers up to a 15 percent premium discount to customers who track their healthy habits with wearables and turn that information over to the insurance company. New buyers even get their own FitBit to begin tracking. http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-john-hancock-35954-1.html Fitbit Inc. is an American company known for its products of the same name, which are activity trackers, wireless-enabled wearable technology devices that measure data such as the number of steps walked, heart rate, quality of sleep, steps climbed, and other personal metrics.
  63. 63. Insurance Examples Swiss Re is using more public data to improve underwriting results and decrease the number of questions the insurer has to ask consumers to underwrite them. Swiss Re is looking at big data in terms of two major streams. In the first, big data is being used to help reduce costs and improve the efficiency of current processes throughout the insurance value chain, including claims and fraud management, cyber risk, customer management, pricing, risk assessment and selection, distribution and service management, product innovation, and research and development. In the second stream, big data also offers a new framework to think bigger in terms of market disruption. Swiss Re has created more than 100 prototypes internally, and that as a result the entire organization sees the value and importance of big data and smart analytics. http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-swiss-re-35957-1.html
  64. 64. Insurance Examples ‘How do you take that operationally efficient data and turn it into a customer/household view and understand all the products attached to a person?’” Allstate has focused heavily on master data management and data governance creating party and household IDs for data. The company is also building a team to work across business areas on analytics projects rather than siloing big data projects within certain units. “Something meant for a single purpose often leads to other insights. We know, for example based on some call-volume analysis in our call center, how often customers defect.”We have an application in claims, QuickFoto, where a policyholder that isn’t in a major accident can snap a picture of the damage and send it to us. But whereas in the past, that would’ve gone into a physical folder and then a filing cabinet, now I have all those pictures of cars in a database, and there’s a lot more that I can do.”
  65. 65. Questions?
  66. 66. Data Science Tools and Techniques for extracting maximum value from Customer Data and Interactions
  67. 67. Agenda Data Science Approach Data Science Tools Data Science Techniques
  68. 68. Data Science Approach On a daily basis, a data scientist is simply a person who can write some code in one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR) for data storage, querying, summarization, visualization efficiently, and in time on databases, on cloud, servers and understand enough statistics to derive insights from data so business can make decisions
  69. 69. Data Science Approach What should a data scientist know? He should know how to get data, store it, query it, manage it, and turn it into actionable insights. The following approach elaborates on this simple and sequential premise.
  70. 70. Where to get Data A data scientist needs data to do science on, right! Some of the usual sources of data for a data scientist are- APIs- API is an acronym for Application Programming Interface.We cover APIs in detail in Chapter 6. APIs is how the current big data paradigm is enabled, as it enables machines to talk and fetch data from each other programmatically. For a list of articles written by the same author on APIs- see https://www.programmableweb.com/profile/ajayohri. Internet Clickstream Logs- Internet clickstream logs refer to the data generated by humans when they click specific links within a webpage. This data is time stamped, and the uniqueness of the person clicking the link can be established by IP address. IP addresses can be parsed by registries like https://www.arin.net/whois or http://www.apnic.net/whois for examining location (country and city). internet service provider and owner of the address (for website owners this can be done using the website http://who.is/). In Windows using the command ipconfig and in Linux systems using ifconfig can help us examine IP Address. You can read this for learning more on IP addresses http://en.wikipedia.org/wiki/IP_address. Software like Clicky from (http://getclicky.com) and Google Analytics( www.google.com/analytics) also help us give data which can then be parsed using their APIs. (See https://code.google.com/p/r-google-analytics/ for Google Analytics using R). Machine Generated Data- Machines generate a lot of data especially for sensors to ensure that the machine is working properly. This data can be logged and can used with events like cracks or failures to have predictive asset maintance of M2M (Machine to Machine) Analytics.
  71. 71. Where to get Data Surveys- Surveys are mostly questionaries filled by humans. They used to be administed manually over paper, but online surveys are now the definitive trend. Surveys reveal valuable data about current preferences of current and potential customers. They do suffer from the bias inherent from design of questions by the creator. Since customer preferences evolve surveys help in getting primary data about current preferences. Coupled with stratified random sampling, they can be a powerful method for collecting data. SurveyMonkey is one such company that helps create online questionaries (https://www.surveymonkey.com/pricing/) Commercial Databases- Commercial Databases are properietary databases that have been collected over time and are sold /rented by vendors. They can be used for prospect calling, appending information to existing database, and refining internal database quality. Credit Bureaus- Credit bureaus collect financial information about people, and this information is then available for marketing organizations (subject to legal and privacy guideliness). The cost of such information is balanced by the added information about customers. Social Media- Social media is a relatively new source of data and offers powerful insights albiet through a lot of unstructured data. Companies like Datasift offer social media data, and companies like Salesforce/Radian6 offer social media tools (http://www.salesforcemarketingcloud.com/). Facebook has 829 million daily active users on average in June 2014 with 1.32 billion monthly active users . Twitter has 255 million monthly active users and 500 million Tweets are sent per day. That generates a lot of data about what current and potential customers are thinking and writing about your products.
  72. 72. Where to process data? Now you have the data. We need computers to process it. Local Machine - Benefits of storing the data in local machine are ease of access. The potential risks include machine outages, data recovery, data theft (especially for laptops) and limited scalability. A local machine is also much more expensive in terms of processing and storage and gets obsolete within a relatively short period of time. Server- Servers respond to requests across networks. They can be thought of as centralized resources that help cut down cost of processing and storage. They can be an intermediate solution between local machines and clouds, though they have huge capital expenditure upfront. Not all data that can fit on a laptop should be stored on a laptop. You can store data in virtual machines on your server and connected through thin shell clients with secure access. Cloud- The cloud can be thought of a highly scalable, metered service that allows requests from remote networks. They can be thought of as a large bank of servers but that is a simplistic definition. hindrance to adoption to the cloud is resistance within existing IT department whose members are not trained to transition and maintain the network over cloud as they used to do for enterprise networks.
  73. 73. Cloud Computing Providers We exapnd on the cloud processing part. Amazon EC2 - Amazon Elastic Compute Cloud (Amazon EC2) provides scalable processing power in the cloud. It has a web based management console, has a command line tool , and offers resources for Linux and Windows virtual images. Further details are available at http://aws.amazon.com/ec2/ . Amazon EC2 is generally considered the industry leader.For beginners a 12 month basic preview is available for free at http://aws.amazon.com/free/ that can allow practioners to build up familiarity. Google Compute- https://cloud.google.com/products/compute-engine/ Microsoft Azure - https://azure.microsoft.com/en-us/pricing/details/virtual-machines / Azure Virtual Machines enable you to deploy a Windows Server, Linux, or third-party software images to Azure. You can select images from a gallery or bring your own customized images. Charge for Virtual Machines is by the minute. Discounts can range from 205 to 32 % depending if you pre pay 6 months or 12 month plans and based on usage tier. IBM shut down its SmartCloud Enterprise cloud computing platform by Jan. 31, 2014 and will migrate those customers to its SoftLayer cloud computing platform, which was an IBM acquired company https://www.softlayer.com/virtual-servers Oracle Oracle's plans for the cloud are still in preview for enterprise customers a https://cloud.oracle.com/compute
  74. 74. Where to store data The need to store data in a secure and reliable environment for speedy and repeated access. There is a cost of storing this data, and there is a cost of losing the data due to some technical accident. You can store data in the following way csv files, spreadsheet and text files locally espeially for smaller files. Note while this increases ease of access, it also creates problems of version control as well as security of confidential data. relational databases (RDBMS) and data warehouses hadoop based storage
  75. 75. Where to store data noSQL databases- are non-relational, distributed, open-source and horizontally scalable. A complete list of NoSQL databases is at http://nosql-database.org/ . Notable NoSQL databases are MongoDB, couchDB et al. key value store -Key-value stores use the map or dictionary as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection Redis -Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets (http://redis.io/). Riak is an open source, distributed database. http://basho.com/riak/. MemcacheDB is a persistence enabled variant of memcached, column oriented databases cloud storage
  76. 76. Cloud Storage Amazon- Amazon Simple Storage Services (S3)- Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. http://aws.amazon.com/s3/ . Cost is a maximum of 3 cents per GB per month. There are three types of storage Standard Storage, Reduced Redundancy Storage, Glacier Storage. Reduced Redundancy Storage (RRS) is a storage option within Amazon S3 that enables customers to reduce their costs by storing non-critical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. Amazon Glacier stores data for as little as $0.01 per gigabyte per month, and is optimized for data that is infrequently accessed and for which retrieval times of 3 to 5 hours are suitable. These details can be seen at http://aws.amazon.com/s3/pricing/ Google - Google Cloud Storage https://cloud.google.com/products/cloud-storage/ . It also has two kinds of storage. Durable Reduced Availability Storage enables you to store data at lower cost, with the tradeoff of lower availability than standard Google Cloud Storage.. Prices are 2.6 cents for Standard Storage (GB/Month) and 2 cents for Durable Reduced Availability (DRA) Storage (GB/Month). They can be seen at https://developers.google.com/storage/pricing#storage-pricing Azure- Microsoft has different terminology for it's cloud infrastructure. Storage is classified in three types with a fourth type (Files) being available as a preview. There are three levels of redundancy Locally Redundant Storage (LRS),Geographically Redundant Storage (GRS) ,Read-Access Geographically Redundant Storage (RA-GRS): You can see details and prices at https://azure.microsoft.com/en-us/pricing/details/storage/ Oracle Storage is available at https://cloud.oracle.com/storage and costs around 30$ / TB per month
  77. 77. Databases on the Cloud- Amazon Amazon RDS -Managed MySQL, Oracle and SQL Server databases. http://aws.amazon.com/rds/ While relational database engines provide robust features and functionality, scaling requires significant time and expertise. DynamoDB - Managed NoSQL database service. http://aws.amazon.com/dynamodb/ Amazon DynamoDB focuses on providing seamless scalability and fast, predictable performance. It runs on solid state disks (SSDs) for low-latency response times, and there are no limits on the request capacity or storage size for a given table. This is because Amazon DynamoDB automatically partitions your data and workload over a sufficient number of servers to meet the scale requirements you provide. Redshift - It is a managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour and scale to a petabyte or more for $1,000 per terabyte per year. http://aws.amazon.com/redshift/ SimpleDB- It is highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests http://aws.amazon.com/simpledb/. a table in Amazon SimpleDB has a strict storage limitation of 10 GB and is limited in the request capacity it can achieve (typically under 25 writes/second); it is up to you to manage the partitioning and Gre-partitioning of your data over additional SimpleDB tables if you need additional scale. While SimpleDB has scaling limitations, it may be a good fit for smaller workloads that require query flexibility. Amazon SimpleDB automatically indexes all item attributes and thus supports query flexibility at the cost of performance and scale.
  78. 78. Databases on the Cloud - Others Google Google Cloud SQL -Relational Databases in Google's Cloud https://developers.google.com/cloud- sql/ Google Cloud Datastore - Managed NoSQL Data Storage Service https://developers.google.com/datastore/ Google Big Query- Enables you to write queries on huge datasets. BigQuery uses a columnar data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table https://cloud.google.com/products/bigquery/ Azure SQL Database https://azure.microsoft.com/en-in/services/sql-database/ SQL Database is a relational database service in the cloud based on the Microsoft SQL Server engine, with mission- critical capabilities. Because it’s based on the SQL Server engine, SQL Database supports existing SQL Server tools, libraries and APIs, which makes it easier for you to move and extend to the cloud.
  79. 79. Basic Statistics Some of the basic statistics that every data scientist should know are given here. This assumes rudimentary basic knowledge of statistics ( like measures of central tendency or variation) and basic familiarity with some of the terminology used by statisticians. Random Sampling- In truly random sampling,the sample should be representative of the entire data. RAndom sampling remains of relevance in the era of Big Data and Cloud Computing Distributions- A data scientist should know the distributions ( normal, Poisson, Chi Square, F) and also how to determine the distribution of data. Hypothesis Testing - Hypothesis testing is meant for testing assumptions statistically regarding values of central tendency (mean, median) or variation. A good example of an easy to use software for statistical testing is the “test” tab in the Rattle GUI in R. Outliers- Checking for outliers is a good way for a data scientist to see anomalies as well as identify data quality. The box plot (exploratory data analysis) and the outlierTest function from car package ( Bonferroni Outlier Test) is how statistical rigor can be maintained to outlier detection.
  80. 80. Basic Techniques Some of the basic techniques that a data scientist must know are listed as follows- Text Mining - In text mining , text data is analyzed for frequencies, associations and corelation for predictive purposes. The tm package from R greatly helps with text mining. Sentiment Analysis- In sentiment analysis the text data is classified based on a sentiment lexicography ( eg which says happy is less positive than delighted but more positive than sad) to create sentiment scores of the text data mined. Social Network Analysis- In social network analysis, the direction of relationships, the quantum of messages and the study of nodes,edges and graphs is done to give insights.. Time Series Forecasting- Data is said to be auto regressive with regards to time if a future value is dependent on a current value for a variable. Technqiues such as ARIMA and exponential smoothing and R packages like forecast greatly assist in time series forecasting. Web Analytics Social Media Analytics Data Mining or Machine Learning
  81. 81. Data Science Tools - R - Python - Tableau - Spark with ML - Hadoop (Pig and Hive) - SAS - SQL
  82. 82. R R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language https://www.r-project.org/about.html
  83. 83. Python http://python-history.blogspot.in/ and https://www.python.org/
  84. 84. SAS http://www.sas.com/en_in/home.html
  85. 85. Big Data: Hadoop Stack with Spark http://spark.apache.org/ Apache Spark™ is a fast and general engine for large-scale data processing.
  86. 86. Big Data: Hadoop Stack with Mahout https://mahout.apache.org/ The Apache Mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications. Apache Mahout Samsara Environment includes Distributed Algebraic optimizer R-Like DSL Scala API Linear algebra operations Ops are extensions to Scala IScala REPL based interactive shell Integrates with compatible libraries like MLLib Runs on distributed Spark, H2O, and Flink Apache Mahout Samsara Algorithms included Stochastic Singular Value Decomposition (ssvd, dssvd) Stochastic Principal Component Analysis (spca, dspca)
  87. 87. Big Data: Hadoop Stack with Mahout https://mahout.apache.org/ Apache Mahout software provides three major features: A simple and extensible programming environment and framework for building scalable algorithms A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink Samsara, a vector math experimentation environment with R-like syntax which works at scale
  88. 88. Data Science Techniques - Machine Learning - Regression - Logistic Regression - K Means Clustering - Association Analysis - Decision Trees - Text Mining
  89. 89. What is an algorithm a process or set of rules to be followed in calculations or other problem- solving operations, especially by a computer. a self-contained step-by-step set of operations to be performed a procedure or formula for solving a problem, based on conducting a sequence of specified action a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation; broadly : a step-by-step procedure for solving a problem or accomplishing some end especially by a computer.
  90. 90. Machine Learning Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space).
  91. 91. CRAN VIEW Machine Learning http://cran.r-project.org/web/views/MachineLearning.html
  92. 92. Machine Learning in Python http://scikit-learn.org/stable/
  93. 93. Classification In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features, etc. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a part word in an email) or real-valued (e.g. a measurement of blood pressure). Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).
  94. 94. Regression regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables.
  95. 95. kNN
  96. 96. Support Vector Machines http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf
  97. 97. Association Rules http://en.wikipedia.org/wiki/Association_rule_learning Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions Conecpts- Support, Confidence, Lift In R apriori() in arules package In Python http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/
  98. 98. Gradient Descent Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html Start at some x value, use derivative at that value to tell us which way to move, and repeat. Gradient descent. http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
  99. 99. Gradient Descent https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how “good” a given line is. initial_b = 0 # initial y-intercept guess initial_m = 0 # initial slope guess num_iterations = 1000
  100. 100. Decision Trees http://select.cs.cmu.edu/class/10701-F09/recitations/recitation4_decision_tree.pdf
  101. 101. Decision Trees Http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
  102. 102. Random Forest Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Each tree is grown as follows: 1.If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree. 2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. 3. Each tree is grown to the largest extent possible. There is no pruning. In the original paper on random forests, it was shown that the forest error rate depends on two things: The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
  103. 103. Bagging Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set. These predictions are then averaged to create a, hopefully more accurate, final prediction value. http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
  104. 104. Boosting Boosting is one of several classic methods for creating ensemble models, along with bagging, random forests, and so forth. Boosting means that each tree is dependent on prior trees, and learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage. XGBoost is a library designed and optimized for boosting trees algorithms. XGBoost is used in more than half of the winning solutions in machine learning challenges hosted at Kaggle. http://xgboost.readthedocs.io/en/latest/model.html# And http://dmlc.ml/rstats/2016/03/10/xgboost.html
  105. 105. Data Science Process By Farcaster at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=40129394
  106. 106. LTV Analytics Life Time Value (LTV) will help us answer 3 fundamental questions: 1. Did you pay enough to acquire customers from each marketing channel? 2. Did you acquire the best kind of customers? 3. How much could you spend on keeping them sweet with email and social media?
  107. 107. LTV Analytics :Case Study https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
  108. 108. LTV Analytics https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
  109. 109. LTV Analytics https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
  110. 110. LTV Analytics https://blog.kissmetrics.com/how-to-calculate-lifetime-value/
  111. 111. LTV Analytics http://www.kaushik.net/avinash/analytics-tip-calculate-ltv-customer-lifetime-value/
  112. 112. LTV Analytics Download the zip file from http://www.kaushik.net/avinash/avinash_ltv.zip
  113. 113. Pareto principle The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity) states that, for many events, roughly 80% of the effects come from 20% of the causes 80% of a company's profits come from 20% of its customers 80% of a company's complaints come from 20% of its customers 80% of a company's profits come from 20% of the time its staff spend 80% of a company's sales come from 20% of its products 80% of a company's sales are made by 20% of its sales staff Several criminology studies have found 80% of crimes are committed by 20% of criminals.
  114. 114. RFM Analysis RFM is a method used for analyzing customer value. Recency - How recently did the customer purchase? Frequency - How often do they purchase? Monetary Value - How much do they spend? A method Recency = 10 - the number of months that have passed since the customer last purchased Frequency = number of purchases in the last 12 months (maximum of 10) Monetary = value of the highest order from a given customer (benchmarked against $10k) Alternatively, one can create categories for each attribute. For instance, the Recency attribute might be broken into three categories: customers with purchases within the last 90 days; between 91 and 365 days; and longer than 365 days. Such categories may be arrived at by applying business rules, or using a data mining technique, to find meaningful breaks. A commonly used shortcut is to use deciles. One is advised to look at distribution of data before choosing breaks.
  115. 115. Are you ready To use more Data Science