2. Smarter Business 2012
Mobility Smarter Social Smarter Smarter
– bring your own Analytics Collaboration Security Cities
device
Insight to Action – Smarter Smarter Smarter Smarter
Big Data - Challenge Commerce Product Process Infrastructure
and Opportunity & Marketing Innovation Optimization Management
Automation
3. Agenda
10:30 IBM Big Data Platform
Flemming Bagger, Big Data Analytics Leader, Nordic
11:15 Pause
11:30 Opnå konkrete resultater med Big Data Analytics
Lauren Walker, Big Data Analytics Leader, Europe
12:15 Frokost
13:30 Succes eller fiasko? Sådan håndteres Big Data i den finansielle sektor
Keith Prince, EMEA Industry Solutions Executive, Financial Services, IBM
14:15 Pause
14:30 Dataindsamling og overvågning på tværs af sociale medier
Ulrik Bo Larsen, Founder & CEO, FALCON Social
15:10 Afrunding
4. Agenda
10:30 IBM Big Data Platform
Flemming Bagger, Big Data Analytics Leader, Nordic
11:15 Pause
11:30 Opnå konkrete resultater med Big Data Analytics
Lauren Walker, Big Data Analytics Leader, Europe
12:15 Frokost
13:30 Succes eller fiasko? Sådan håndteres Big Data i den finansielle sektor
Keith Prince, EMEA Industry Solutions Executive, Financial Services, IBM
14:15 Pause
14:30 Dataindsamling og overvågning på tværs af sociale medier
Ulrik Bo Larsen, Founder & CEO, FALCON Social
15:10 Afrunding
25. Agenda
10:30 IBM Big Data Platform
Flemming Bagger, Big Data Analytics Leader, Nordic
11:15 Pause
11:30 Opnå konkrete resultater med Big Data Analytics
Lauren Walker, Big Data Analytics Leader, Europe
12:15 Frokost
13:30 Succes eller fiasko? Sådan håndteres Big Data i den finansielle sektor
Keith Prince, EMEA Industry Solutions Executive, Financial Services, IBM
14:15 Pause
14:30 Dataindsamling og overvågning på tværs af sociale medier
Ulrik Bo Larsen, Founder & CEO, FALCON Social
15:10 Afrunding
Nothing illustrates the breakthrough of Twitter than a simple comparison between the Olympic Games. To put it in perspective, if you look at the time between the Beijing Olympics in 2008 and the 2012 London games +CLICK+ as we start the London Olympics, there are over 500-million active users on Twitter, pushing out over 400 million tweets a day. This is a massive increase from the six million Twitter users during the Beijing Olympics in 2008 pushing out about 300,000 tweets per day. How big? +CLICK+ The number of Twitter users between these two periods in time have increase by 83X and the number of tweets by a whopping 1333X!
C&A is a Brazilian retailer that has ‘Smart Hangers’ where shoppers can Like a piece of clothing and see the amount of Likes an object has on Facebook. On the C&A Web site, each piece of clothing has it’s own post and the Likes just keep piling up. While folks can get push happy, so it’s tough to tell just how popular something is, the point here is where we are headed.
Obviously, there are many other forms of data. Let ’ s start with the hottest topic associated with Big Data today: social networks. Twitter generates about 12 terabytes a day of tweet data – which is every single day. Now, keep in mind, these numbers are hard to keep accurate, so the point is that they ’ re big , right? So don ’ t fixate on the actual number because they change all the time and realize that even if these numbers are out of date by 2 years, it ’ s at a point where it ’ s too staggering to handle exclusively using traditional approaches. +CLICK+ Facebook over a year ago was generating 25 terabytes of log data every day ( Facebook log data reference: http://www.datacenterknowledge.com/archives/2009/04/17/a-look-inside-facebooks-data-center/ ) and probably about 7 to 8 terabytes of data that goes up on the Internet. +CLICK+ Google, who knows? Look at Google Plus, YouTube, Google Maps, and all that kind of stuff. So that ’ s the left hand of this chart – the social network layer. +CLICK+ Now let ’ s get back to instrumentation: there are massive amounts of proliferated technologies that allow us to be more interconnected than in the history of the world – and it just isn ’ t P2P (people to people) interconnections, it ’ s M2M (machine to machine) as well. Again, with these numbers, who cares what the current number is, I try to keep them updated, but it ’ s the point that even if they are out of date, it ’ s almost unimaginable how large these numbers are. Over 4.6 billion camera phones that leverage built-in GPD to tag your location or your photos, purpose built GPS devices, smart metres. If you recall the bridge that collapsed in Minneapolis a number of years ago in the USA, it was rebuilt with smart sensors inside it that measure the contraction of the concrete based on weather conditions, ice build up, and so much more. So I didn ’ t realise how true it was when Sam P launched Smart Planet: I thought it was a marketing play. But truly the world is more instrumented, interconnected, and intelligent than it ’ s ever been before and this capability allows us to address new problems and gain new insight never before thought possible and that ’ s what the Big Data opportunity is going to be all about!
Big data comes from many sources. Its much more than traditional data sources. And it order to capitalize on the breakthrough opportunities we’ve discussed, you definitely need to look beyond traditional data sources. But at the same time, don’t forget that big data comes from those traditional sources too. Transactional data and application data is growing an a significant rate. Although it’s structured, that data is large and it is contained in many different structures. Big data includes machine data – logs, web logs, instrumentation data, network data. Data generated by machines is multiplying quickly, and it contains valuable insights that need to be discovered. Social data also needs to be incorporated. Most social data is really textual data. And the valuable insights remain locked within that text and its many possible meanings. And most of that data isn’t valuable, or has a very short expiry date during which it is valuable. That makes social data very challenging – extracting insight from largely textual content in very little time. And enterprise content must be amalgamated as well. And that data comes in many forms, and also in significant volume.
Big data has 4 key characteristics. The first is volume. Of course this may seem obvious, but it is complex that you may think. Yes the volume of data is growing. Experts predict that the volume of data in the world will grow to 25 Zettabytes in 2020. That same phenomenon affects every business – their data is growing at the same exponential rate too. But it isn’t jus the volume of data that is growing. It’s the number of sources of that data. And that leads to the third characteristic of big data, variety, which we will cover later. Data is increasingly accelerating the velocity at which it is created and at which it is integrated. We’ve moved from batch to a real-time business. Data comes at you at a record or a byte level, not always in bulk. And the demands of the business have increased as well – from an answer next week to an answer in a minute. And the world is also becoming more instrumented and interconnected. The volume of data streaming off those instruments is exponentially larger than it was even 2 years ago. Variety presents an equally difficult challenge. The growth in data sources has fuelled the growth in data types. In fact, 80% of the worlds data is unstructured. Yet most traditional methods apply analytics only to structured information. And finally we have veracity. How can you act upon information if you don’t trust it. Establishing trust in big data presents a huge challenge as the sources and the variety grows.
In this slide you can see a graph – it ’ s not to scale, but you get the point – and this graph shows that the percentage of data available to an enterprise is growing enormously; you can see that at the top bar. And as the amount of data available to an organization grows, the percentage of data that the organization can actually process is decreasing. It ’ s kind of like we ’ re getting “ dumber ” as organizations – in terms of proportion of measurement to the data we are collecting - are understanding less and less of it. +CLICK+ I call the shaded area between these opposite trending lines “ The Blind Spot ” : it contains signals and noise. This area has got all this data in there, and perhaps it would make sense for us to ingest this into our traditional analytic systems, but we don ’ t know if that data will yield value or not – it ’ s a blind spot. We have a hunch that there is value in there, but truly we have no idea what ’ s in the shaded area. Furthermore, while we know there is value in here, we know it ’ s not all going to be useful, so how do we sift through the noise to find the signals? We can start ingesting 10 TB of data a day , ask the CIO for her or his approval for triple OPEX and CAPEX costs on a hunch? So we have to find a way to find the signals within all the noise in a cost effective manner. Now if we can leverage some new approach to find the value in the blind spot, at a relatively low cost, if we could tie together things like Big Data social media around our core trusted information that we know about our customers, and drop the stuff that isn ’ t related to what the business is trying to accomplish, you could really start to monetize that relationship and intents - not just transactions. And that ’ s the difference, right? Do we monetize intent and relationships? - And that ’ s a problem domain that includes Big Data. In the previous paragraph I just gave a ubiquitous example, since social media is so obviously tied to Big Data. But you can imagine this dichotomy in any industry. For example, think Oil and Gas (O&G) drill well readings streaming in – and wanting to apply analytics to that with geological data that is unstructured and comes from other sources in various formats and is likely often changing (from an attribute perspective). Harvesting wind energy, traffic patterns, and more.
Another reason that big data is a hot topic in the market today is the new technology that enables an organization to take advantage of the natural resource of big data. Big data itself isn’t new – its been here for a while and growing exponentially. What is new is the technology to process and analyze it. The purpose of big data technology is to cost effectively manage and analyze all of the available data. Any data, as is. If you want to analyze structured data, then structure it. If you want to analyze an acoustic file, then analyze the acoustic file with appropriate analytics. You’ll see the wide variety of sources of big data. It comes from our traditional systems – Billing systems, ERP systems, CRM systems. It also comes from machine data – from RFID tags, sensors, network switches. And it comes from humans – website data, social media, etc.
Key Points Many use cases require multiple technologies to address big data challenges Pre-processing – to ingest multiple data types, structuring data, identify insights, then store those insights in a structured DW Combined structured and unstructured – having a structured DW and unstructured Hadoop system analyzing data and sharing insights back and forth High velocity and historical – stream computing to analyze in motion data and store insights in structured DW for deeper insights and/or reporting Reuse structured – unload structured data into Hadoop and experiment – some companies have found entirely new uses for data that could become new service offerings (e.g,. A large bank discovered that they can profile their client base by financial profile and potentially offer a service to tell customers how they rate vs. their profile – e.g., you have 20% higher mortgage than clients in your fin profile)
Let’s first look at unlocking big data. The customer need is to understand existing data sources without moving any of the data – to discover, navigate, view, and search big data in a federated manner. One customer was able to get up and running in a few months to search and navigate big data across many existing sources of big data. This type of implementation can yield significant business value - from cutting manual efforts to search and retrieve big data, to gaining a better understanding of existing sources of big data before further analysis. The payback period is often short. Customer example – Proctor and Gamble …. The entry point in the big data platform is Vivisimo Velocity – it enables federated search and navigation.
Next we have a pain point around analyzing raw data. The primary need is to analyze unstructured, or semi-structured, data from one or multiple sources. Often the content is textual – and the meaning is hidden within the text. Another common need is to combine different data types – structured and unstructured – for combined analysis. Customers often gain significant value in this approach – they unlock insights that were previously unknown. Those insights can be the key to retaining a valuable customer, to identifying a previously undetected fraud, or discovering a game-changing efficiency in operational processes. One client, a financial services regulatory organization, analyzed a variety of new data sources and integrated the insights with their existing data warehouse to further enhance their risk modeling processes. The big data platform entry point is InfoSphere BigInsights, a Hadoop-based analytics system.
Often data warehouse environments are anything but simple. Warehouses can become glutted with data and not be well-suited to any one particular task. Often, organizations will be hampered by poor performance of analytics – queries will take hours or even days to run. And the cost of the data warehouse and improving performance can be prohibitively high. The value is striking. Many organizations realize a 10 to 100 times performance boost on deep analytics. Queries that took hours now take minutes. So the cost and performance is significant – and the efficiency of employees is boosted. Its also extremely simple to install and administer, yielding significantly lower administration costs. One customer example is Catalina marketing – who executes 10x the amount of predictive workloads with the same staffing level. The entry point for this pain point is IBM Netezza.
Hadoop is a cost-efficient platform and it has the ability to significantly lower the cost of certain workloads. Organizations may have particular pain around reducing the overall cost of their data warehouse. Certain groups of data may be seldom used and possible candidates to offload to a lower-cost platform. Certain operations such as transformations may be able to be offloaded to a more cost efficient platform. The primary area of value creation is cost savings. By pushing workloads and data sets onto a Hadoop platform, organizations are able to preserve their queries and take advantage of Hadoop’s cost-effective processing capabilities. One customer example, a financial services firm, moved processing of applications and reports from an operational data warehouse to Hadoop Hbase; they were able to preserve their existing queries and reduce the operating costs of their data management platform. The entry point for this pain is InfoSphere BigInsights – IBM’s Hadoop-based product.
Key Points Hadoop is not a product but an open source framework for more cost effectively and efficiently analyzing large amounts of structured and unstructured data. However, to use open source Hadoop requires the download, installation, configuration, and maintenance of a myriad of different software pieces (Hadoop, MapReduce, Hive, Pig, Hbase, etc). On the left, you see the characteristics that make Hadoop different and so valuable for analyzing big data. Some vendors try to simplify the installation and configuration of the Hadoop framework and projects by prepackaging all the components into a single “distribution” without providing any real added value. IBM’s approach to Hadoop is different. On the right, you see what the innovations and enhancements we added to our BigInsights hadoop-analytics product making it significantly better for enterprises than open source Hadoop: In the area of performance and reliability, we’ve added ground-breaking innovations: Our “Adaptive MapReduce” that speeds up MapReduce workloads by enabling dynamic changes to resource utilization (CPU, disk space, memory) without human intervention. Without this innovation, Hadoop users would need to monitor their MapReduce workloads and manually turn the configuration “knobs” to adjust the resource utilization. Compression enhancements that reduce storage needs/costs as well as query time Indexing reduces the latency on text searches And, our Workload Scheduler capability that makes it easy to schedule and optimize Hadoop analytics runs Other enhancements: Accelerators of prepackaged content and knowledge (best practice patterns) to solve discrete big data problems UIs and tooling needed by data scientists, developers, and administrators to minimize the Hadoop learning curve Our of the box integration connectors to access any type of data type and source Security to control data access – critical to maintain data privacy and protect confidential data
Customers often have many sources of streaming data, yet they are unable to take full advantage of them. Sometimes its because there is simply too much data to collect and store before analyzing it. Or it may be because of timing – by the time they store data on disk, analyze it, and respond – it’s too late. They need a way to harness the natural resource of streaming data and turn it into actionable insight. The benefits of streaming analytics are immediately obvious. Dramatic cost savings by analyzing data and only storing what is necessary. The ability to detect and make real-time decisions, resulting in customer retention to detecting fraud to cross-selling a product. One client, Ufone, analyzed Call Detail Records as data streamed off their network. By analyzing CDRs in real-time, they were able to detect potential customer service issues and proactively respond, thereby reducing customer churn. The entry point to the big data platform is InfoSphere Streams, which is often accompanied by a system to persist insights and perform deeper analysis to adjust the streaming analytic models – either Netezza or InfoSphere BigInsights.
There are many entry points to the big data platform. It isn’t a one-time, one-size-fits-all proposition. There are many entry points to the big data platform – illustrated on this slide and in the previous slides. <Read pains and entry points to re-iterate>. They key point is that clients will start with one pain and entry point, and adopt others over time. And there is a benefit to doing so – they may leverage reusable aspects of the platform as they adopt new capabilities – sharing analytics, accelerators, etc. from one implementation to the next. And that is the power of the platform – the ability to leverage from one project to the next and to go faster .