SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Introducing Aginity’s“Big Data” Research Lab

              Launched, March 2009
Background
Google changed everything….

What makes Google great isn’t the user interface …
or the word processor, or even gmail, although these are great tools.

What made Google great was their massive database of searches and indexes to content that
allows them to understand what you are searching for even better than you do yourself.

Google is a database company. They process more data every day than almost any other company
in the world. And unlike other big data companies, most of Google’s data is unstructured.

To pull this off, Google invented a new class of database that could perform analytics on-the-fly
“In-Database”, with largely unstructured data using large clusters of off the shelf computers.

From this work, was launched a new class of data warehouse that we believe will change the
world.
What Was Our Goal?
We wanted to see what could be built using the framework invented by Google for
under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse
software.

Our goal was to build a 10 terabyte MPP Always-on data Warehouse using
desktop-class commodity hardware, an open source operating system, and the
leading MPP database software on the planet.

This is a technology sandbox in which we are seeing how close we can get to a 2
million dollar data warehouse of 5 years ago for $10,000 to $20,000.

Obviously, this is not a production-class system but it is a good illustration of the
power of the latest Software Only “Big Data” systems and Aginity’s mastery of
those systems.
What Is A MPP Data Warehouse?
MPP, or Massively Parallel Processing, is a class of architectures aimed specifically at addressing
the processing requirements of very large databases. MPP architecture has been accepted as the
only way to go at the high end of the data warehousing world.
           Degrees of Massively Parallel Processing
           John O'Brien
           InfoManagement Direct, February 26, 2009
What Is MapReduce?
MapReduce was invented by Google and is a programming model and an associated implementation for processing and
generating large data sets.

The core ideas of MapReduce are:

• MapReduce isn’t about data management, at least not primarily. It’s about parallelism.

• In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are
  super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such
  scenarios are found in CRM and relationship analytics.

• MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up

• On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should
  just increase its applicability and power.

• At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot
  else. MapReduce can handle a significant fraction of that.

• MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want
  to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help.

DBMS2
What are we testing?
• Very large 5 TB database with 2 TB fact table

• Ability to do “on-the-fly” analytics without creating cubes or any form of pre-aggregation at sub-
  second speed.

• Very large complex queries that span nodes

• The benefits of using the MapReduce indexing model

• In-Database Analytics

• Fault tolerance at scale? What happens if I unplug one of the nodes during a complex process?
How much MPP power can $5,682.10 buy in 2009?
At least 10 terabytes. We constructed a 9-box server farm using off-the-shelf components. Our
Chief Architect, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise-
wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the
story of how we built similar capabilities for our lab for $5,682.10 U.S..




           Then                           Our Lab                  Real-world blade servers
The Hardware Parts List and Cost: $5,682.10
The Databases We Are Testing
Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the
ones we work with cost about $15,000 per terabyte per year to operate.
The Foundation
The databases are running on SUSE….Novell’s open source Linux.
About 11 hours to assemble the boxes
MapReduce
MapReduce: Simplified Data Processing on Large Clusters
Google Research
Complete article here

MapReduce is a programming model and an associated implementation for processing and generating large data
sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many
real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity
machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's
execution across a set of machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and distributed systems to easily
utilize the resources of a large distributed system.

Our [Google’s]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable:
a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find
the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day….

Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose
computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to
compute various kinds of derived data…continued in paper.
In-Database Analytics
In-Database Analytics: A Passing Lane for Complex Analysis
Seth Grimes
Intelligent Enterprise, December 15, 2008
What once took one company three to four weeks now takes four to eight hours thanks to in-database
computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it
happen.

A next-generation computational approach is earning front-line operational relevance for data warehouses,
long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics
exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata,
Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move
calculations into the data warehouse, avoiding data movement that slows response time. Coupled with
performance and scalability advances that stem from database platforms with parallelized, shared-nothing
(MPP) architectures, database-embedded calculations respond to growing demand for high-throughput,
operational analytics for needs such as fraud detection, credit scoring, and risk management.

Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in
September the company announced five partner-developed applications that rely on in-database
computations to accelerate analytics. quot;Netezza's [on-stream programmability] enabled us to create
applications that were not possible before,quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.
Massively Parallel Processing (MPP)
Degrees of Massively Parallel Processing
John O'Brien
InfoManagement Direct, February 26, 2009

The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid
pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with
a breakneck speed of change and exponential growth almost everywhere we look, especially with the
information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest
databases in the world are literally dwarfed by today’s databases.

The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking.
Today, some telecommunications, energy and financial companies can generate that much data in a month.
Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress.

MPP is a class of architectures aimed specifically at addressing the processing requirements of very large
databases. MPP architecture has been accepted as the only way to go at the high end of the data
warehousing world. If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it?
The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of
organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized
vendors are bringing solutions to the market that shield the user from the complexity of implementing their
own MPP systems. These solutions take a variety of forms, such as custom-built deployments,
software/hardware configurations and all-in-one appliances.

Weitere ähnliche Inhalte

Was ist angesagt?

Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupKnowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupBenjamin Nussbaum
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
The importance of data
The importance of dataThe importance of data
The importance of dataAPNIC
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Big Data & Machine Learning
Big Data & Machine LearningBig Data & Machine Learning
Big Data & Machine LearningAngelo Mariano
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data PresentationMatthew Urdan
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)SiamAhmed16
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big dataPrashant Sharma
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science EducationJames Hendler
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 finalAmjid Ali
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsRamakant Gawande
 
Big data overview external
Big data overview externalBig data overview external
Big data overview externalBrett Colbert
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 

Was ist angesagt? (20)

The promise and challenge of Big Data
The promise and challenge of Big DataThe promise and challenge of Big Data
The promise and challenge of Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupKnowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
The importance of data
The importance of dataThe importance of data
The importance of data
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Big data
Big dataBig data
Big data
 
Big Data & Machine Learning
Big Data & Machine LearningBig Data & Machine Learning
Big Data & Machine Learning
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data Presentation
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 

Andere mochten auch

CyMAT Enfermeria
CyMAT EnfermeriaCyMAT Enfermeria
CyMAT EnfermeriaJuan Mijana
 
Preguntas Test De Mitosis Y Meiosis
Preguntas Test De Mitosis Y MeiosisPreguntas Test De Mitosis Y Meiosis
Preguntas Test De Mitosis Y MeiosisMilagros Quinzano
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 

Andere mochten auch (8)

CyMAT Enfermeria
CyMAT EnfermeriaCyMAT Enfermeria
CyMAT Enfermeria
 
Preguntas Test De Mitosis Y Meiosis
Preguntas Test De Mitosis Y MeiosisPreguntas Test De Mitosis Y Meiosis
Preguntas Test De Mitosis Y Meiosis
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Ähnlich wie Aginity Big Data Research Lab V3

Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Labkevinflorian
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportAhmad El Tawil
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
Big data management
Big data managementBig data management
Big data managementzeba khanam
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 

Ähnlich wie Aginity Big Data Research Lab V3 (20)

Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
Big data
Big dataBig data
Big data
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Big data management
Big data managementBig data management
Big data management
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Aginity Big Data Research Lab V3

  • 1. Introducing Aginity’s“Big Data” Research Lab Launched, March 2009
  • 2. Background Google changed everything…. What makes Google great isn’t the user interface … or the word processor, or even gmail, although these are great tools. What made Google great was their massive database of searches and indexes to content that allows them to understand what you are searching for even better than you do yourself. Google is a database company. They process more data every day than almost any other company in the world. And unlike other big data companies, most of Google’s data is unstructured. To pull this off, Google invented a new class of database that could perform analytics on-the-fly “In-Database”, with largely unstructured data using large clusters of off the shelf computers. From this work, was launched a new class of data warehouse that we believe will change the world.
  • 3. What Was Our Goal? We wanted to see what could be built using the framework invented by Google for under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse software. Our goal was to build a 10 terabyte MPP Always-on data Warehouse using desktop-class commodity hardware, an open source operating system, and the leading MPP database software on the planet. This is a technology sandbox in which we are seeing how close we can get to a 2 million dollar data warehouse of 5 years ago for $10,000 to $20,000. Obviously, this is not a production-class system but it is a good illustration of the power of the latest Software Only “Big Data” systems and Aginity’s mastery of those systems.
  • 4. What Is A MPP Data Warehouse? MPP, or Massively Parallel Processing, is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. Degrees of Massively Parallel Processing John O'Brien InfoManagement Direct, February 26, 2009
  • 5. What Is MapReduce? MapReduce was invented by Google and is a programming model and an associated implementation for processing and generating large data sets. The core ideas of MapReduce are: • MapReduce isn’t about data management, at least not primarily. It’s about parallelism. • In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such scenarios are found in CRM and relationship analytics. • MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up • On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should just increase its applicability and power. • At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot else. MapReduce can handle a significant fraction of that. • MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help. DBMS2
  • 6. What are we testing? • Very large 5 TB database with 2 TB fact table • Ability to do “on-the-fly” analytics without creating cubes or any form of pre-aggregation at sub- second speed. • Very large complex queries that span nodes • The benefits of using the MapReduce indexing model • In-Database Analytics • Fault tolerance at scale? What happens if I unplug one of the nodes during a complex process?
  • 7. How much MPP power can $5,682.10 buy in 2009? At least 10 terabytes. We constructed a 9-box server farm using off-the-shelf components. Our Chief Architect, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise- wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the story of how we built similar capabilities for our lab for $5,682.10 U.S.. Then Our Lab Real-world blade servers
  • 8. The Hardware Parts List and Cost: $5,682.10
  • 9. The Databases We Are Testing Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the ones we work with cost about $15,000 per terabyte per year to operate.
  • 10. The Foundation The databases are running on SUSE….Novell’s open source Linux.
  • 11. About 11 hours to assemble the boxes
  • 12. MapReduce MapReduce: Simplified Data Processing on Large Clusters Google Research Complete article here MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our [Google’s]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day…. Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data…continued in paper.
  • 13. In-Database Analytics In-Database Analytics: A Passing Lane for Complex Analysis Seth Grimes Intelligent Enterprise, December 15, 2008 What once took one company three to four weeks now takes four to eight hours thanks to in-database computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it happen. A next-generation computational approach is earning front-line operational relevance for data warehouses, long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata, Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move calculations into the data warehouse, avoiding data movement that slows response time. Coupled with performance and scalability advances that stem from database platforms with parallelized, shared-nothing (MPP) architectures, database-embedded calculations respond to growing demand for high-throughput, operational analytics for needs such as fraud detection, credit scoring, and risk management. Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in September the company announced five partner-developed applications that rely on in-database computations to accelerate analytics. quot;Netezza's [on-stream programmability] enabled us to create applications that were not possible before,quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.
  • 14. Massively Parallel Processing (MPP) Degrees of Massively Parallel Processing John O'Brien InfoManagement Direct, February 26, 2009 The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with a breakneck speed of change and exponential growth almost everywhere we look, especially with the information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest databases in the world are literally dwarfed by today’s databases. The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking. Today, some telecommunications, energy and financial companies can generate that much data in a month. Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress. MPP is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it? The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized vendors are bringing solutions to the market that shield the user from the complexity of implementing their own MPP systems. These solutions take a variety of forms, such as custom-built deployments, software/hardware configurations and all-in-one appliances.