besonders große Datenmengen, die mit Hilfe von Standard-Datenbanken und Daten-Management-Tools nicht oder nur unzureichend verarbeitet werden können.Problematisch sind hierbei vor allem die Erfassung, die Speicherung, die Suche, Verteilung, Analyse und Visualisierung von großen Datenmengen. Das Volumen dieser Datenmengen geht in die Terabytes, Petabytes, Exabytes und Zettabytes.
1021 Byte1.000.000.000.000.000.000.000 Byte
1 Terabyte = Consumer Disk
Social Media (WEB 2.0) und SensordatenStrukturierte und unstrukturierte DatenAccording to IDC, “In 2011, the amount of information created and replicated will surpass 1.8 zettabytes (1.8 trillion gigabytes), growing by a factor of nine in just five years. That’s nearly as many bits of information in the digital universe as stars in the physical universe” (The 2011 Digital Universe Study: Extracting Value from Chaos, IDC, June 2011).The explosion of data isn’t new. It continues a trend that started in the 1970s. What has changed is the velocity of growth, the diversity of the data, and the imperative to make better use of it all to transform the business.“IDC’s prediction is that by 2020, the amount of information that IT is responsible for managing will grow by a factor of 44,” said May. “That means there will be work for IT to do. What is concerning is that the forecast is that IT headcount is only going to grow by a factor of 1.4. The situation is becoming unsustainable.”
Fazit: Daten wachsen massiv in Volumen und DiversitätBetrifft mich das??NEXT: RelativNow let’s take a look at the factors that are driving the need for a more strategic, information management approach. This slide describes the factors and the required diversity that is necessary for Information Management to support so that organizations can better capitalize on data, information, etc. This will allow us to contrast Information Management with a more the traditional vs. taking a more traditional or limited approach to managing data. The diagram represents the various factors and then highlights an example that represents the factors that dictate the need for a comprehensive information management approach.Each element that is included has a diversity of requirements or broad scope considerations that are driving the need to take a more strategic information management approach. Touch on each factor briefly and be familiar with the details below so that you can highlight the items that are most relevant to the account:Data Diversity – the information management strategy should accommodate the diversity of data, including the following attributes:Big data: Volume, velocity, variety – big data is relative, but all organizations need to think about the entirety of the data is at their disposal and how the requirements for storage, scale, processing, etc., will multiply in the future.Big text / unstructured data – in many cases, the big data phenomena is being driven by text or unstructured data and the need to mine value out of that text is critical for both analytical and operational scnearios. It is critical that information management strategies effectively accommodate text and unstructured data and this highlights the need for a combined analytics and data management capability.In-flight (perishable), data at rest – the ability to support data that is in-flight is becoming increasingly important given the large volumes of data that people are faced with and the increased amount of streaming data sources.Internal, External – it’s not just about internal data, and all internal data is not “on premise”. It’s also about partner data and other 3rd party data sources that need to be leveraged and managed effectively.Text / Unstructured / Semi-structured – although all data and data source types need to be accommodated in a comprehensive strategy, data types that are outside of the traditional legacy or relational structure are worth noting. This is where many organizations struggle, they have a decent handle on transactional or relational data, but text, unstructured, semi-structured (whatever you want to call it) is typically a challenge.All forms of information assets – addressing data diversity needs to be expanded beyond traditional sources of data. For example, analytical models are information assets that need to be leveraged and managed effectively.Consumption Diversity – the information management strategy needs to support an increasingly diverse set of consumption “devices” in a continuous and concurrent fashion:Support for multiple form factors and form factor versions including PCs, smart phones, tablets and industrial devices (medical, network, etc.)Support for user based interaction as well, automated and machine to machine support using a service based approach. Use Case Diversity – the information management strategy should support all relevant use cases for an organization, including:Analytical prep, data migrations, single view, enterprise application integration, B2B integration, etc. Refer to Data Management market slides that outline the use cases based on Gartner’s definition. Application Diversity – the information management strategy should support all types of applications that span operational and analytical:Analytics – BI, Data Mining, Text Analytics, Forecasting, Optimization, etc.Operational / Transactional – ERP, CRM, HR, Supply Chain, etc. Temporal Diversity – the information management strategy should support the various temporal requirements relating to data – from batch to extremely low-latency (real-time or near-real time):Batch – for information processing that is suited to batch requirements, the processing must be supported in the appropriate batch window. In many cases, IT is being squeezed on both ends – batch windows are shrinking (in order to support 24x7x365 processing driven by international business, on-line commerce, etc.) while the amount of data and the processing or analytical requirements are expanding.Real-time / Near-real-time – there is an increased need to support processing in real-time via the introduction of rich information and analytic services. Deployment Diversity& Scope – the information management strategy needs to support multiple deployment approaches:The information management strategy should effectively leverage multiple deployment or infrastructure options including:Cloud (public, private, hybrid) as well as various “as a Service types” like SaaS, PaaS, Analytic Results as a Service, On-premiseApplianceThe information strategy should take an expansive view of the capabilities and services that are needed to deploy and manage production information systems and the support necessary to support development, test, production cycle associated with operational applications or the iterative sandbox environment and production cycle associated with analytical applications. Architecture Scope – the information management strategy should support a robust architecture that supports a broad range of technical requirements:Enterprise Architecture approach – although not related to diversity, it’s important that an enterprise architecture approach is used to formulate the overall information strategy.The architecture approach should accommodate the ability to manage the analytical assets or models that are leveraged in analytical scenarios. Centralized or distributed – based on the needs of the organization (which will likely change), the information management strategy should support a centralized or distributed approach.Service-based SOA – supporting the diverse set of information and analytic requirements typically calls for a service-based approach. Persona / Role Diversity – the information management strategy should support all of the different roles that are associated with managing information, including:Business – support for the key business stakeholders, both business analysts and end usersIT – support for the various roles that work with data in IT, including architects, DBAsData Stewards – support for those that are responsible for managing data governanceData Scientists (data analyst, statistician, etc.) – support for those that are responsible for managing analytics
Gartner IT trends vomOktober 2011Eingeklammert von Lösungsansätzen zum Thema Big Data
Wir müssen uns auf grosse Datenmengen einstellenWie gehe ich damit um?
INFORMATION OVERLOAD: Nicht alle 900 Mio Facebook Nutzer sind für mich relevant -> welche sprechen über meine Produkte / Dienstleistungen?Speicherung der Daten bringt keinen Mehrwert Das Grundproblem der IT verschärft sich damit: Aus den vielen Daten (Volumen), die sehr unterschiedlich strukturiert sind (Variety) und schnell anwachsen (Velocity) gilt es, die wertvollen, relevanten Daten herauszufiltern.17% of the world’s population used a social networking site in 2011.Twitter logs 100 million Tweets per day.Facebook counts 350 million unique visitors per day.60 hours of video is transferred to YouTube every 60 seconds.80% of companies use social media for recruitment.Many competitors are talking about the first 3 Vs. We belive that Value is the key one. Just like speed isn’t enough, gaining value from the data is all that matters.
BIG Data für alle Phasen relevantKeinSampling mehr!NEXT: Lösung?While organizations may represent the process differently, they are all using the same lifecycle. And frankly, simply having more data doesn’t give you a competitive advantage. It’s only when an organization begins to revisit their analytic lifecycle taking advantage of the additional data and compute power that they begin to gain an advantage – both over their competition and where they are today.NEXT:WiesiehteineLösungaus?
Verarbeiten grosse Datenmengen in relativ kurzer Zeit“Monte Rosa” des CSCS (ETH Zürich) beiLugano
ABER: Prozessoren zwar Commodity Hardware – der Rest leider nicht!HoheKostendurchspezialAnfertigungenCa. 15Mio CHF Betriebskosten / Jahr(“Monte Rosa” plus “Tödi” plus andere)Fokus auf WissenschaftlicheAnwendungenundWettervorhersagenNEXT: Lösungsansatz kann übernommen werden…
Gigantische Anforderungen an I/O Throughput(mehrere Mio Investition in Storage)
Verteilen der Daten auf vieleKnoten (Nodes)Jeder Knoten rechnet nur einen Teil der Daten und ist entsprechend schnellerKonsolidieren der Ergebnisse zu einem Gesamtergebnis (Map-Reduce Ansatz)
ADW = Analytical Data Warehouse -> Relevante Daten aus Enterprise Data Warehouse und operationalen SytemenThe SAS Enterprise Analytics Architecture, which includes SAS High Performance Computing technology, enables organizations to “raise the relevance” in their data. SAS has introduced the concept of an Analytical Data Warehouse (ADW) to sit “along side” the traditional Enterprise Data Warehouse (EDW). The ADW takes the complexity out of the EDW by surfacing only relevant data and INFORMATION to the business users. To draw a parallel . . . we all go to the grocery store. But, do you realize there are two parts to a grocery store? There are the aisles with the bread, milk, cereal, etc. This is part that the CONSUMER sees. And, of course, groceries are organized as the CONSUMER typically shops. The other part is generally behind the wall of the store where items are organized quite differently. Here, items are organized for maximum storage and for quick replenishment of the store shelves. So, the “AISLES” are the ADW, while the part “BEHIND THE WALL” is the EDW. In our world, SAS leverages both the EDW and the ADW, depending upon the nature of the analytical problem, the required data, and the needs of the consumer. Our Information Management capabilities, combined with our advanced analytics, provides the necessary linkages to make this happen. In addition, SAS’ Grid, In-Database, and In-Memory technologiesallow organizations to take large computational problems and distribute the analysis across the appropriate computing resources. This enables IT to “get out of” the “problem solving” business, and “get into” the “information providing” business. We provide the BUSINESS users with INFORMATION, organized in a fashion for consumption, enabling them to solve problemsquickly and confidently.SAS is the only vendor that can drive this High-Performance data-to-delivery process.
Massiv Paralelle Datenbank Systeme mit SAS als Embedded ProcessEMC Greenplum oder TeradataApache Hadoop als open source AlternativeErprobt (und entwickelt)durch Amazon, Google, Facebook, YahooSkaliert praktisch beliebig!Commodity Hardware (Blade Server)Zweites Linux (Erfolgsstory)?!
2-4 CPUs (jeweils 6-8 Cores)Bis zu 2 Terabyte Hauptspeicher (96 Gigabyte standard)Extreme CPU und Speicher-DichteGünstigda «von der Stange»DELL PowerEdge M910: 4GB to 1TB of DDR3 RAM, Intel Xeon Processor 7500 (8 Cores / 16 Threads per CPU)32 servers in one enclosure. A single rack can support 128 servers for a total of 1024 CPU cores and 2 terabytes of memory. And peak performance per rack is a staggering 12.3 teraflops.
Symmetrisches Multiprozessorsystem Ein symmetrisches Multiprozessorsystem (SMP) ist in der Informationstechnologie eine Multiprozessor-Architektur, bei der zwei oder mehr identische Prozessoren einen gemeinsamen Adressraum besitzen. Dies bedeutet, dass jeder Prozessor mit derselben (physikalischen) Adresse dieselbe Speicherzelle oder dasselbe Peripherieregister adressiert. Die meisten Mehrprozessorsysteme heute sind SMP-Architekturen.Massively Parallel Processing (MPP) bezeichnet in der Informatik die Verteilung einer Aufgabe auf mehrere Hauptprozessoren, die jeweils auch über eigenen Arbeitsspeicher verfügen können. Ein Massiv-paralleler Computer ist demnach ein Parallelrechner, der über eine Vielzahl (zum Teil mehrere tausend) unabhängiger Ausführungseinheiten verfügt.The SAS Enterprise Analytics Architecture, which includes SAS High Performance Computing technology, enables organizations to “raise the relevance” in their data. SAS has introduced the concept of an Analytical Data Warehouse (ADW) to sit “along side” the traditional Enterprise Data Warehouse (EDW). The ADW takes the complexity out of the EDW by surfacing only relevant data and INFORMATION to the business users. To draw a parallel . . . we all go to the grocery store. But, do you realize there are two parts to a grocery store? There are the aisles with the bread, milk, cereal, etc. This is part that the CONSUMER sees. And, of course, groceries are organized as the CONSUMER typically shops. The other part is generally behind the wall of the store where items are organized quite differently. Here, items are organized for maximum storage and for quick replenishment of the store shelves. So, the “AISLES” are the ADW, while the part “BEHIND THE WALL” is the EDW. In our world, SAS leverages both the EDW and the ADW, depending upon the nature of the analytical problem, the required data, and the needs of the consumer. Our Information Management capabilities, combined with our advanced analytics, provides the necessary linkages to make this happen. In addition, SAS’ Grid, In-Database, and In-Memory technologiesallow organizations to take large computational problems and distribute the analysis across the appropriate computing resources. This enables IT to “get out of” the “problem solving” business, and “get into” the “information providing” business. We provide the BUSINESS users with INFORMATION, organized in a fashion for consumption, enabling them to solve problemsquickly and confidently.SAS is the only vendor that can drive this High-Performance data-to-delivery process.
BIG Data ist ein Hype!Das Buzzword wird verschwinden, aber die grossen Datenmengen werden bleiben eCommerceWEB 2.0Die technischen Voraussetzungen für BIG Analytics sind vorhanden SAS High Performance Analytics, überschaubarer finanzieller AufwandBIG Analytics verschafft den entscheidenden Wettbewerbsvorteil – SCHON HEUTE
Der immer größere Datenstrom muss frühzeitig sortiert werden: Manche Daten sollen sofort Ereignisse auslösen, andere Informationen müssen im Vertriebsreporting landen und wieder andere werden für eine spätere Analyse kostengünstig aufgehoben
Klassisches BI stößt hier schnell an Grenzen. Es geht immer mehr um Advanced Analytics mit der wertvolle Muster und Zusammenhänge erkannt werden können. Und das Ganze muss aussage-kräftig visualisiert werden, damit es verständlich bleibt.