SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Introducing Aginity’s“Big Data” Research Lab

              Launched, March 2009
Background
Google changed everything….

What makes Google great isn’t the user interface …
or the word processor, or even gmail, although these are great tools.

What made Google great was their massive database of searches and indexes to content that
allows them to understand what you are searching for even better than you do yourself.

Google is a database company. They process more data every day than almost any other company
in the world. And unlike other big data companies, most of Google’s data is unstructured.

To pull this off, Google invented a new class of database that could perform analytics on-the-fly
“In-Database”, with largely unstructured data using large clusters of off the shelf computers.

From this work, was launched a new class of data warehouse that we believe will change the
world.
What Was Our Goal?
We wanted to see what could be built using the framework invented by Google for
under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse
software.

Our goal was to build a 10 terabyte MPP Always-on data Warehouse using
desktop-class commodity hardware, an open source operating system, and the
leading MPP database software on the planet.

This is a technology sandbox in which we are seeing how close we can get to a 2
million dollar data warehouse of 5 years ago for $10,000 to $20,000.

Obviously, this is not a production-class system but it is a good illustration of the
power of the latest Software Only “Big Data” systems and Aginity’s mastery of
those systems.
What Is A MPP Data Warehouse?
MPP, or Massively Parallel Processing, is a class of architectures aimed specifically at addressing
the processing requirements of very large databases. MPP architecture has been accepted as the
only way to go at the high end of the data warehousing world.
           Degrees of Massively Parallel Processing
           John O'Brien
           InfoManagement Direct, February 26, 2009
What Is MapReduce?
MapReduce was invented by Google and is a programming model and an associated implementation for processing and
generating large data sets.

The core ideas of MapReduce are:

• MapReduce isn’t about data management, at least not primarily. It’s about parallelism.

• In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are
  super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such
  scenarios are found in CRM and relationship analytics.

• MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up

• On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should
  just increase its applicability and power.

• At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot
  else. MapReduce can handle a significant fraction of that.

• MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want
  to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help.

DBMS2
What are we testing?
• Very large 5 TB database with 2 TB fact table

• Ability to do “on-the-fly” analytics without creating cubes or any form of pre-aggregation at sub-
  second speed.

• Very large complex queries that span nodes

• The benefits of using the MapReduce indexing model

• In-Database Analytics

• Fault tolerance at scale? What happens if I unplug one of the nodes during a complex process?
How much MPP power can $5,682.10 buy in 2009?
At least 10 terabytes. We constructed a 9-box server farm using off-the-shelf components. Our
Chief Architect, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise-
wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the
story of how we built similar capabilities for our lab for $5,682.10 U.S..




           Then                           Our Lab                  Real-world blade servers
The Hardware Parts List and Cost: $5,682.10
The Databases We Are Testing
Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the
ones we work with cost about $15,000 per terabyte per year to operate.
The Foundation
The databases are running on SUSE….Novell’s open source Linux.
About 11 hours to assemble the boxes
MapReduce
MapReduce: Simplified Data Processing on Large Clusters
Google Research
Complete article here

MapReduce is a programming model and an associated implementation for processing and generating large data
sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs,
and a reduce function that merges all intermediate values associated with the same intermediate key. Many real
world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity
machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's
execution across a set of machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and distributed systems to easily
utilize the resources of a large distributed system.

Our [Google’s]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable:
a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find
the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day….

Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose
computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to
compute various kinds of derived data…continued in paper.
In-Database Analytics
In-Database Analytics: A Passing Lane for Complex Analysis
Seth Grimes
Intelligent Enterprise, December 15, 2008
What once took one company three to four weeks now takes four to eight hours thanks to in-database
computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it
happen.

A next-generation computational approach is earning front-line operational relevance for data warehouses,
long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics
exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata,
Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move
calculations into the data warehouse, avoiding data movement that slows response time. Coupled with
performance and scalability advances that stem from database platforms with parallelized, shared-nothing
(MPP) architectures, database-embedded calculations respond to growing demand for high-throughput,
operational analytics for needs such as fraud detection, credit scoring, and risk management.

Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in
September the company announced five partner-developed applications that rely on in-database
computations to accelerate analytics. quot;Netezza's [on-stream programmability] enabled us to create
applications that were not possible before,quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.
Massively Parallel Processing (MPP)
Degrees of Massively Parallel Processing
John O'Brien
InfoManagement Direct, February 26, 2009

The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid
pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with
a breakneck speed of change and exponential growth almost everywhere we look, especially with the
information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest
databases in the world are literally dwarfed by today’s databases.

The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking.
Today, some telecommunications, energy and financial companies can generate that much data in a month.
Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress.

MPP is a class of architectures aimed specifically at addressing the processing requirements of very large
databases. MPP architecture has been accepted as the only way to go at the high end of the data
warehousing world. If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it?
The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of
organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized
vendors are bringing solutions to the market that shield the user from the complexity of implementing their
own MPP systems. These solutions take a variety of forms, such as custom-built deployments,
software/hardware configurations and all-in-one appliances.

Weitere ähnliche Inhalte

Was ist angesagt?

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big DataTeemu Heikkilä
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Decision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDecision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDLT Solutions
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreSoftweb Solutions
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12mark madsen
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 

Was ist angesagt? (20)

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big Data
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Decision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDecision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great Data
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 

Ähnlich wie Aginity's $5k "Big Data

Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportAhmad El Tawil
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
Big data management
Big data managementBig data management
Big data managementzeba khanam
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
 

Ähnlich wie Aginity's $5k "Big Data (20)

Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
Big data
Big dataBig data
Big data
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Big data management
Big data managementBig data management
Big data management
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Big Data
Big DataBig Data
Big Data
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
 

Kürzlich hochgeladen

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Kürzlich hochgeladen (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Aginity's $5k "Big Data

  • 1. Introducing Aginity’s“Big Data” Research Lab Launched, March 2009
  • 2. Background Google changed everything…. What makes Google great isn’t the user interface … or the word processor, or even gmail, although these are great tools. What made Google great was their massive database of searches and indexes to content that allows them to understand what you are searching for even better than you do yourself. Google is a database company. They process more data every day than almost any other company in the world. And unlike other big data companies, most of Google’s data is unstructured. To pull this off, Google invented a new class of database that could perform analytics on-the-fly “In-Database”, with largely unstructured data using large clusters of off the shelf computers. From this work, was launched a new class of data warehouse that we believe will change the world.
  • 3. What Was Our Goal? We wanted to see what could be built using the framework invented by Google for under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse software. Our goal was to build a 10 terabyte MPP Always-on data Warehouse using desktop-class commodity hardware, an open source operating system, and the leading MPP database software on the planet. This is a technology sandbox in which we are seeing how close we can get to a 2 million dollar data warehouse of 5 years ago for $10,000 to $20,000. Obviously, this is not a production-class system but it is a good illustration of the power of the latest Software Only “Big Data” systems and Aginity’s mastery of those systems.
  • 4. What Is A MPP Data Warehouse? MPP, or Massively Parallel Processing, is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. Degrees of Massively Parallel Processing John O'Brien InfoManagement Direct, February 26, 2009
  • 5. What Is MapReduce? MapReduce was invented by Google and is a programming model and an associated implementation for processing and generating large data sets. The core ideas of MapReduce are: • MapReduce isn’t about data management, at least not primarily. It’s about parallelism. • In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such scenarios are found in CRM and relationship analytics. • MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up • On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should just increase its applicability and power. • At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot else. MapReduce can handle a significant fraction of that. • MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help. DBMS2
  • 6. What are we testing? • Very large 5 TB database with 2 TB fact table • Ability to do “on-the-fly” analytics without creating cubes or any form of pre-aggregation at sub- second speed. • Very large complex queries that span nodes • The benefits of using the MapReduce indexing model • In-Database Analytics • Fault tolerance at scale? What happens if I unplug one of the nodes during a complex process?
  • 7. How much MPP power can $5,682.10 buy in 2009? At least 10 terabytes. We constructed a 9-box server farm using off-the-shelf components. Our Chief Architect, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise- wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the story of how we built similar capabilities for our lab for $5,682.10 U.S.. Then Our Lab Real-world blade servers
  • 8. The Hardware Parts List and Cost: $5,682.10
  • 9. The Databases We Are Testing Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the ones we work with cost about $15,000 per terabyte per year to operate.
  • 10. The Foundation The databases are running on SUSE….Novell’s open source Linux.
  • 11. About 11 hours to assemble the boxes
  • 12. MapReduce MapReduce: Simplified Data Processing on Large Clusters Google Research Complete article here MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our [Google’s]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day…. Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data…continued in paper.
  • 13. In-Database Analytics In-Database Analytics: A Passing Lane for Complex Analysis Seth Grimes Intelligent Enterprise, December 15, 2008 What once took one company three to four weeks now takes four to eight hours thanks to in-database computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it happen. A next-generation computational approach is earning front-line operational relevance for data warehouses, long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata, Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move calculations into the data warehouse, avoiding data movement that slows response time. Coupled with performance and scalability advances that stem from database platforms with parallelized, shared-nothing (MPP) architectures, database-embedded calculations respond to growing demand for high-throughput, operational analytics for needs such as fraud detection, credit scoring, and risk management. Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in September the company announced five partner-developed applications that rely on in-database computations to accelerate analytics. quot;Netezza's [on-stream programmability] enabled us to create applications that were not possible before,quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.
  • 14. Massively Parallel Processing (MPP) Degrees of Massively Parallel Processing John O'Brien InfoManagement Direct, February 26, 2009 The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with a breakneck speed of change and exponential growth almost everywhere we look, especially with the information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest databases in the world are literally dwarfed by today’s databases. The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking. Today, some telecommunications, energy and financial companies can generate that much data in a month. Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress. MPP is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it? The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized vendors are bringing solutions to the market that shield the user from the complexity of implementing their own MPP systems. These solutions take a variety of forms, such as custom-built deployments, software/hardware configurations and all-in-one appliances.