SlideShare ist ein Scribd-Unternehmen logo
1 von 59
The other Apache 
Technologies your Big 
Data solution needs! 
Nick Burch 
CTO, Quanticate
Nick Burch 
CTO, Quanticate 
Copyright © Quanticate 2013
The Apache Software Foundation 
Apache Technologies as in the ASF 
154 Top Level Projects 
33 Incubating Projects (> 100 past ones) 
Y is the only letter we lack 
A, C, O, S & T are popular letters, >= 12 each 
Meritocratic, Community driven 
Open Source
A Growing Foundation... 
November 2014: 
154 Top Level Projects 
33 Incubating Projects 
June 2013: 
117 Top Level Projects 
37 Incubator Projects 
June 2011: 
91 Top Level Projects 
59 Incubating Projects
What we're not covering
Projects not being covered include 
Cassandra cassandra.apache.org 
CouchDB couchdb.apache.org 
Flume flume.apache.org 
Giraph giraph.apache.org 
Hadoop hadoop.apache.org 
Hive hive.apache.org 
HBase hbase.apache.org
Projects not being covered include 
Lucene lucene.apache.org 
Mahout mahout.apache.org 
Mesos mesos.apache.org 
Nutch nutch.apache.org 
OODT oodt.apache.org 
Spark 
Storm etc!
What we are looking at
Talk Structure 
New Things from the Incubator 
Established Big Data projects not to forget 
about 
Non Big-Data projects that can help flesh out 
an overall solution 
Many projects to cover – 
this is only an overview!
Audience Participation: 
A show of hands...
Making the most of the talk 
Slides are available on the Conference site 
Lots of information and projects to cover 
Just an overview, see project sites for more 
Try to take note / remember the projects you 
think will matter to you 
Don't try to take notes on 
everything – there's too much!
What's new(ish) in the 
Apache Incubator?
Argus : argus.incubator.apache.org 
Data Security framework for Hadoop 
Central administration of all security related 
tasks, with central UI + Rest API 
Fine-grained model controlling access to 
data, components, actions, operations 
Centralised Auditing and Monitoring 
Role based, attribute based etc 
Standardised authz for Hadoop
Aurora : aurora.incubator.apache.org 
Service Scheduler on top of Mesos 
To run+manage long-running services 
Deployment and scheduling of jobs 
Uses a DSL to define services 
Heath checking, failure monitoring, restarting 
etc 
Aurora jobs are made up of many 
different Mesos tasks
Blur : incubator.apache.org/blur 
Search engine for massive amounts of 
structured data at high speed 
Query rich, structured data model 
US Census example: show me all of the 
people in the US who were born in Alaska 
between 1940 and 1970 who are now living in 
Kansas. 
Built on Apache Hadoop
Brooklyn : brooklyn.incubator.a.o 
Framework for modelling, monitoring and 
managing applications from blueprints 
Blueprints describe app from components, 
many built in, using bash, Java, Chef etc 
Deployed across multiple machines 
automatically: cloud, private, docker etc 
Scale, replace, restart etc 
Metrics monitored
Calcite : calcite.incubator.a.o 
Dynamic Data Management framework 
Highly customisable engine for planning and 
parsing queries on data from a wide variety of 
formats 
SQL interface for data not in relational 
databases, with query optimisation 
Complementary to Hadoop and NoSQL 
systems, esp. combinations of them 
Formerly known as Optiq
DataFu : datafu.incubator.apache.org 
Collection of libraries for working with large-scale 
data in Hadoop, for data mining, statistics 
etc 
Provides Map-Reduce jobs and high level 
language functions for data analysis, eg 
statistics calculations 
Incremental processing with Hadoop 
with sliding data, eg computing 
daily and weekly statistics
Drill : incubator.apache.org/drill 
Drill is a distributed system for interactive 
analysis of large-scale datasets, inspired by 
Google's Dremel 
Low-latency queries natively on rapidly 
evening multi-structured datasets 
Aiming to be able to scale to 10k+ servers, 
petabytes or data and trillions of 
records, all within seconds
Falcon : falcon.incubator.apache.org 
Data management and processing framework 
built on Hadoop 
Quickly onboard data + its processing into a 
Hadoop based system 
Declarative definition of data endpoints and 
processing rules, inc dependencies 
Orchestrates data pipelines, 
management, lifecycle, motion etc
Flink : flink.incubator.apache.org 
Flink is an open source system for 
expressive, declarative, fast, and efficient 
data analysis. Flink combines the scalability 
and programming flexibility of distributed 
MapReduce-like platforms with the 
efficiency, out-of-core execution, and query 
optimization capabilities found in parallel 
databases.
Ignite : ignite.incubator.apache.org 
Formerly known as GainGrid 
Only just entered incubation 
In-Memory data fabric 
High performance, distributed data 
management between heterogeneous data 
sources and user applications 
Stream processing and compute grid 
Structured and unstructured data
MetaModel: metamodel.incubator.a.o 
MetaModel is a data access framework, 
providing a common interface for exploration 
and querying of different types of datastores 
Safe & Uniform model for querying datastores 
Uses native support where available 
Implements it on top for other datastores 
RDBMS, CSV, NoSQL, XML, 
XLS, JSON etc
MRQL : mrql.incubator.apache.org 
Large scale, distributed data analysis system, built 
on Hadoop, Hama, Spark 
Query processing and optimisation 
SQL-like query for data analysis 
Works on raw data in-situ, such as XML, JSON, 
binary files, CSV 
Powerful query constructs avoid the need to write 
MapReduce code 
Write data analysis tasks as SQL-like
Parquet : parquet.incubator.a.o 
Columnar storage format for Hadoop 
Compressed, efficient columnar data 
representation for any Hadoop projects 
Allows complex nested data structures 
Based on record shredding/assembly 
algorithm from the Dremel Paper 
Per-column compressions / 
encodings
REEF : reef.incubator.apache.org 
REEF (Retainable Evaluator Execution 
Framework) is a scale-out computing fabric 
that eases the development of Big Data 
applications on top of resource managers 
such as Apache YARN and Mesos.
Samza : samza.incubator.apache.org 
Samza provides a system for processing 
stream data from publish-subscribe systems 
such as Apache Kafka. The developer 
writes a stream processing task, and 
executes it as a Samza job. Samza then 
routes messages between stream 
processing tasks and the publish-subscribe 
systems that the messages are addressed 
to.
Sentry : sentry.incubator.apache.org 
Sentry is a highly modular system for 
providing fine grained role based 
authorization to both data and metadata 
stored on an Apache Hadoop cluster.
Slider : slider.incubator.apache.org 
Slider is a collection of tools and technologies 
to package, deploy, and manage long 
running applications on Apache Hadoop 
YARN clusters.
Twill : twill.incubator.apache.org 
Twill is an abstraction over Apache Hadoop 
YARN that reduces the complexity of 
developing distributed applications, allowing 
developers to focus more on their business 
logic
UserGrid : usergrid.incubator.a.o 
Backend-as-a-Service “Baas” “mBaaS” 
Distributed NoSQL database + asset storage 
Mobile and server-side SDKs 
Rapidly build mobile and/or web applications, 
inc content driven ones 
Provides key services, eg users, 
queues, storage, queries etc
Loading and Querying
Pig – pig.apache.org 
Originally from Yahoo, entered the Incubator 
in 2007, graduated 2008, very widely used 
Provides an easy way to query data, which is 
compiled into Hadoop M/R (not SQL-like) 
Typically 1/20th of the lines of code, and 1/15th 
of the development time 
Optimising compiler – often only 
slower, occasionally faster!
Gora – gora.apache.org 
ORM Framework for (NoSQL) Column Stores 
Grew out of the Nutch project 
Supports HBase, Cassanda, Hypertable 
K/V: Voldermort, Redis 
HDFS Flat Files, plus basic SQL 
Data is stored in Avro (more later) 
Query with Pig, Lucene, Hive, Cascading, 
Hadoop M/R, or native Store code
Sqoop – sqoop.apache.org (no u!) 
Bulk data transfer tool for big data systems 
Hadoop (HDFS), HBase and Hive on one side 
SQL Databases on the other 
Can be used to import existing data into your 
big data cluster 
Or, export the results of a big data 
job out to your data wharehouse 
Generates code to work with data
Building MapReduce Jobs
Avro – avro.apache.org 
Language neutral data serialization 
Rich data structures (JSON based) 
Compact and fast binary data format 
Code generation optional for dynamic 
languages 
Supports RPC 
Data includes schema details
Avro – avro.apache.org 
Schema is always present – allows dynamic 
typing and smaller sizes 
Java, C, C++, C#, Python, Perl, Ruby, PHP 
Different languages can transparently talk to 
each other, and make RPC calls to each 
other 
Often faster than Thrift and ProtoBuf 
No streaming support though
Thrift – thrift.apache.org 
Java, C++, Python, PHP, Ruby, Erlang, Perl, 
Haskell, C#, JS, Cocoa, OCaml and more! 
From Facebook, at Apache since 2008 
Rich data structure, compiled down into 
suitable code 
RPC support too 
Streaming is available 
Worth reading the White Paper!
MRUnit – mrunit.apache.org 
Built on top of JUnit 
Checks Map, Reduce, then combined 
Provides test drivers for Hadoop 
Avoids you needing lots of boiler plate code 
to start/stop Hadoop 
Avoids brittle mock objects 
Handles multiple input K/Vs 
Counter checking
For the Cloud
Provider Independent Cloud APIs 
Lets you provision, manage and query Cloud 
services, without vendor lock-in 
Translates general calls to the specific (often 
proprietary) ones for a given cloud provider 
Work with remote and local cloud providers 
(almost) transparently
Provider Independent Cloud APIs 
Create, stop, start, reboot and destroy 
instances 
Control what's run on new instances 
List active instances 
Fetch available and active profiles 
EC2, Eycalyptos, Rackspace, RHEV, 
vSphere, Linode, OpenStack
LibCloud – libcloud.apache.org 
Python library (limited Java support) 
Very wide range of providers 
Script your cloud services 
from libcloud.compute.types import Provider 
from libcloud.compute.providers import get_driver 
EC2_ACCESS_ID = 'your access id' 
EC2_SECRET_KEY = 'your secret key' 
Driver = get_driver(Provider.EC2) 
conn = Driver(EC2_ACCESS_ID, EC2_SECRET_KEY) 
nodes = conn.list_nodes() 
# [<Node: uuid=..., state=3, public_ip=['1.1.1.1'], 
provider=EC2 ...>, ...]
DeltaCloud – deltacloud.apache.org 
REST API (xml) + web portal 
Major Cloud Providers, RHEV-M, vSphere 
<instances> 
<instance href="http://fancycloudprovider.com/api/instances/inst1" id='inst1'> 
<owner_id>larry</owner_id> 
<name>Production JBoss Instance</name> 
<image href="http://fancycloudprovider.com/api/images/img3"/> 
<hardware_profile href="http://fancycloudprovider.com/api/hardware_profiles/m1-small"/> 
<realm href="http://fancycloudprovider.com/api/realms/us"/> 
<state>RUNNING</state> 
<actions> 
<link rel="reboot" href="http://fancycloudprovider.com/api/instances/inst1/reboot"/> 
<link rel="stop" href="http://fancycloudprovider.com/api/instances/inst1/stop"/> 
</actions> 
<public_addresses> 
<address>inst1.larry.fancycloudprovider.com</address> 
</public_addresses> 
<private_addresses> 
<address>inst1.larry.internal</address> 
</private_addresses> 
</instance> 
</instances>
jclouds – jclouds.apache.org 
Apache jclouds is an open source multi-cloud 
toolkit for the Java platform that gives 
you the freedom to create applications that 
are portable across clouds while giving you 
full control to use cloud-specific features.
Building out your Solution
Tika : tika.apache.org 
Text and Metadata extraction 
Identify file type, language, encoding 
Extracts text as structured XHTML 
Consistent Metadata across formats 
Java library, CLI and Network Server 
SOLR integration 
Handles format differences for you
UIMA : uima.apache.org 
Unstructured Information analysis 
Lets you build a tool to extract information 
from unstructured data 
Language Ident, Segmentation, Entities etc 
Components in C++ and Java 
Network enabled – can spread work out 
across a cluster 
Helped IBM to win Jeopardy!
OpenNLP : opennlp.apache.org 
Natural Language Processing 
Various tools for sentence detection, 
tokenization, tagging, chunking, entity 
detection etc 
Maximum Entropy, Perception Based M-L 
UIMA likely to be better for a whole-solution 
OpenNLP good when integrating 
NLP into your own solution
cTakes : ctakes.apache.org 
Clinical Text Analysis & Knowledge Extraction 
Builds on top of UIMA and OpenNLP 
Extracts information from free text in electronic 
medical records (and OCR'd paper) 
Identifies named entities from common or 
custom dictionaries (eg UMLS) 
For each, produce attributes, eg 
subject, mapping, context
MINA : mina.apache.org 
Framework for writing scalable, high 
performance network apps in Java 
TCP and UDP, Client and Server 
Build non blocking, event driven networking 
code in Java 
MINA also provides pure Java SSH, XMPP, 
Web and FTP servers
Etch : etch.apache.org 
Framework for building, producing and 
consuming network services. 
Cross platform, language and transport 
Java, C#, C, C++, Go, JS, Python 
Produce a formal description of the exchange 
1-way, 2-way and realtime communication 
Scalable, high performance 
Heterogeneous systems
Commons : commons.apache.org 
Collection of libraries for Java projects 
Some historic, many still useful! 
Commons CLI – parameters / options 
Commons Codec – Base64, Hex, Phonetic 
Commons Compress – zip, tar, gz, bz2 
Commons Daemon – OS friendly start/stop 
Commons Pool – Object pools (db, conn etc)
JMeter : jmeter.apache.org 
Loading testing tool 
Performance test network services 
Defined a series of tasks, execute in parallel 
Talks to Web, SOAP, LDAP, JSM, JDBC etc 
Handy for checking how external resources 
and systems will hold up, once a 
big data system start to make 
heavy use of them!
Chemistry : chemistry.apache.org 
Java, Python, .Net, PHP, JS, Android, iOS 
interface to Content Management Systems 
Implements the OASIS CMIS spec 
Browse, read and write data in your content 
repositories 
Rich information and structure 
Supported by Alfresco, Microsoft, SAP, 
Adobe, EMC, OpenText and more
ManifoldCF : manifoldcf.apache.org 
Framework for content (mostly text) 
extraction from content repositories 
Aimed at indexing solutions, eg SOLR 
Connectors for reading and writing 
Simpler than Chemistry, but also works for 
CIFS, file systems, RSS etc 
Extract from SharePoint, FileNet, 
Documentum, LiveLink etc
Questions?
Thanks! 
Twitter - @Gagravarr 
Email – nick.burch@quanticate.com 
The Apache Software Foundation: 
http://www.apache.org/ 
Apache projects list: 
http://projects.apache.org/

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to HiveDatabricks
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSandish Kumar H N
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 

Was ist angesagt? (20)

Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Spark sql
Spark sqlSpark sql
Spark sql
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 

Ähnlich wie The other Apache Technologies your Big Data solution needs

The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare Mostafa
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 

Ähnlich wie The other Apache Technologies your Big Data solution needs (20)

The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Hadoop
HadoopHadoop
Hadoop
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
963
963963
963
 

Mehr von gagravarr

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?gagravarr
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...gagravarr
 
The Apache Way
The Apache WayThe Apache Way
The Apache Waygagravarr
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?gagravarr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?gagravarr
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologiesgagravarr
 

Mehr von gagravarr (12)

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 

Kürzlich hochgeladen

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

The other Apache Technologies your Big Data solution needs

  • 1. The other Apache Technologies your Big Data solution needs! Nick Burch CTO, Quanticate
  • 2. Nick Burch CTO, Quanticate Copyright © Quanticate 2013
  • 3. The Apache Software Foundation Apache Technologies as in the ASF 154 Top Level Projects 33 Incubating Projects (> 100 past ones) Y is the only letter we lack A, C, O, S & T are popular letters, >= 12 each Meritocratic, Community driven Open Source
  • 4. A Growing Foundation... November 2014: 154 Top Level Projects 33 Incubating Projects June 2013: 117 Top Level Projects 37 Incubator Projects June 2011: 91 Top Level Projects 59 Incubating Projects
  • 5. What we're not covering
  • 6. Projects not being covered include Cassandra cassandra.apache.org CouchDB couchdb.apache.org Flume flume.apache.org Giraph giraph.apache.org Hadoop hadoop.apache.org Hive hive.apache.org HBase hbase.apache.org
  • 7. Projects not being covered include Lucene lucene.apache.org Mahout mahout.apache.org Mesos mesos.apache.org Nutch nutch.apache.org OODT oodt.apache.org Spark Storm etc!
  • 8. What we are looking at
  • 9. Talk Structure New Things from the Incubator Established Big Data projects not to forget about Non Big-Data projects that can help flesh out an overall solution Many projects to cover – this is only an overview!
  • 10. Audience Participation: A show of hands...
  • 11. Making the most of the talk Slides are available on the Conference site Lots of information and projects to cover Just an overview, see project sites for more Try to take note / remember the projects you think will matter to you Don't try to take notes on everything – there's too much!
  • 12. What's new(ish) in the Apache Incubator?
  • 13. Argus : argus.incubator.apache.org Data Security framework for Hadoop Central administration of all security related tasks, with central UI + Rest API Fine-grained model controlling access to data, components, actions, operations Centralised Auditing and Monitoring Role based, attribute based etc Standardised authz for Hadoop
  • 14. Aurora : aurora.incubator.apache.org Service Scheduler on top of Mesos To run+manage long-running services Deployment and scheduling of jobs Uses a DSL to define services Heath checking, failure monitoring, restarting etc Aurora jobs are made up of many different Mesos tasks
  • 15. Blur : incubator.apache.org/blur Search engine for massive amounts of structured data at high speed Query rich, structured data model US Census example: show me all of the people in the US who were born in Alaska between 1940 and 1970 who are now living in Kansas. Built on Apache Hadoop
  • 16. Brooklyn : brooklyn.incubator.a.o Framework for modelling, monitoring and managing applications from blueprints Blueprints describe app from components, many built in, using bash, Java, Chef etc Deployed across multiple machines automatically: cloud, private, docker etc Scale, replace, restart etc Metrics monitored
  • 17. Calcite : calcite.incubator.a.o Dynamic Data Management framework Highly customisable engine for planning and parsing queries on data from a wide variety of formats SQL interface for data not in relational databases, with query optimisation Complementary to Hadoop and NoSQL systems, esp. combinations of them Formerly known as Optiq
  • 18. DataFu : datafu.incubator.apache.org Collection of libraries for working with large-scale data in Hadoop, for data mining, statistics etc Provides Map-Reduce jobs and high level language functions for data analysis, eg statistics calculations Incremental processing with Hadoop with sliding data, eg computing daily and weekly statistics
  • 19. Drill : incubator.apache.org/drill Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel Low-latency queries natively on rapidly evening multi-structured datasets Aiming to be able to scale to 10k+ servers, petabytes or data and trillions of records, all within seconds
  • 20. Falcon : falcon.incubator.apache.org Data management and processing framework built on Hadoop Quickly onboard data + its processing into a Hadoop based system Declarative definition of data endpoints and processing rules, inc dependencies Orchestrates data pipelines, management, lifecycle, motion etc
  • 21. Flink : flink.incubator.apache.org Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases.
  • 22. Ignite : ignite.incubator.apache.org Formerly known as GainGrid Only just entered incubation In-Memory data fabric High performance, distributed data management between heterogeneous data sources and user applications Stream processing and compute grid Structured and unstructured data
  • 23. MetaModel: metamodel.incubator.a.o MetaModel is a data access framework, providing a common interface for exploration and querying of different types of datastores Safe & Uniform model for querying datastores Uses native support where available Implements it on top for other datastores RDBMS, CSV, NoSQL, XML, XLS, JSON etc
  • 24. MRQL : mrql.incubator.apache.org Large scale, distributed data analysis system, built on Hadoop, Hama, Spark Query processing and optimisation SQL-like query for data analysis Works on raw data in-situ, such as XML, JSON, binary files, CSV Powerful query constructs avoid the need to write MapReduce code Write data analysis tasks as SQL-like
  • 25. Parquet : parquet.incubator.a.o Columnar storage format for Hadoop Compressed, efficient columnar data representation for any Hadoop projects Allows complex nested data structures Based on record shredding/assembly algorithm from the Dremel Paper Per-column compressions / encodings
  • 26. REEF : reef.incubator.apache.org REEF (Retainable Evaluator Execution Framework) is a scale-out computing fabric that eases the development of Big Data applications on top of resource managers such as Apache YARN and Mesos.
  • 27. Samza : samza.incubator.apache.org Samza provides a system for processing stream data from publish-subscribe systems such as Apache Kafka. The developer writes a stream processing task, and executes it as a Samza job. Samza then routes messages between stream processing tasks and the publish-subscribe systems that the messages are addressed to.
  • 28. Sentry : sentry.incubator.apache.org Sentry is a highly modular system for providing fine grained role based authorization to both data and metadata stored on an Apache Hadoop cluster.
  • 29. Slider : slider.incubator.apache.org Slider is a collection of tools and technologies to package, deploy, and manage long running applications on Apache Hadoop YARN clusters.
  • 30. Twill : twill.incubator.apache.org Twill is an abstraction over Apache Hadoop YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their business logic
  • 31. UserGrid : usergrid.incubator.a.o Backend-as-a-Service “Baas” “mBaaS” Distributed NoSQL database + asset storage Mobile and server-side SDKs Rapidly build mobile and/or web applications, inc content driven ones Provides key services, eg users, queues, storage, queries etc
  • 33. Pig – pig.apache.org Originally from Yahoo, entered the Incubator in 2007, graduated 2008, very widely used Provides an easy way to query data, which is compiled into Hadoop M/R (not SQL-like) Typically 1/20th of the lines of code, and 1/15th of the development time Optimising compiler – often only slower, occasionally faster!
  • 34. Gora – gora.apache.org ORM Framework for (NoSQL) Column Stores Grew out of the Nutch project Supports HBase, Cassanda, Hypertable K/V: Voldermort, Redis HDFS Flat Files, plus basic SQL Data is stored in Avro (more later) Query with Pig, Lucene, Hive, Cascading, Hadoop M/R, or native Store code
  • 35. Sqoop – sqoop.apache.org (no u!) Bulk data transfer tool for big data systems Hadoop (HDFS), HBase and Hive on one side SQL Databases on the other Can be used to import existing data into your big data cluster Or, export the results of a big data job out to your data wharehouse Generates code to work with data
  • 37. Avro – avro.apache.org Language neutral data serialization Rich data structures (JSON based) Compact and fast binary data format Code generation optional for dynamic languages Supports RPC Data includes schema details
  • 38. Avro – avro.apache.org Schema is always present – allows dynamic typing and smaller sizes Java, C, C++, C#, Python, Perl, Ruby, PHP Different languages can transparently talk to each other, and make RPC calls to each other Often faster than Thrift and ProtoBuf No streaming support though
  • 39. Thrift – thrift.apache.org Java, C++, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, JS, Cocoa, OCaml and more! From Facebook, at Apache since 2008 Rich data structure, compiled down into suitable code RPC support too Streaming is available Worth reading the White Paper!
  • 40. MRUnit – mrunit.apache.org Built on top of JUnit Checks Map, Reduce, then combined Provides test drivers for Hadoop Avoids you needing lots of boiler plate code to start/stop Hadoop Avoids brittle mock objects Handles multiple input K/Vs Counter checking
  • 42. Provider Independent Cloud APIs Lets you provision, manage and query Cloud services, without vendor lock-in Translates general calls to the specific (often proprietary) ones for a given cloud provider Work with remote and local cloud providers (almost) transparently
  • 43. Provider Independent Cloud APIs Create, stop, start, reboot and destroy instances Control what's run on new instances List active instances Fetch available and active profiles EC2, Eycalyptos, Rackspace, RHEV, vSphere, Linode, OpenStack
  • 44. LibCloud – libcloud.apache.org Python library (limited Java support) Very wide range of providers Script your cloud services from libcloud.compute.types import Provider from libcloud.compute.providers import get_driver EC2_ACCESS_ID = 'your access id' EC2_SECRET_KEY = 'your secret key' Driver = get_driver(Provider.EC2) conn = Driver(EC2_ACCESS_ID, EC2_SECRET_KEY) nodes = conn.list_nodes() # [<Node: uuid=..., state=3, public_ip=['1.1.1.1'], provider=EC2 ...>, ...]
  • 45. DeltaCloud – deltacloud.apache.org REST API (xml) + web portal Major Cloud Providers, RHEV-M, vSphere <instances> <instance href="http://fancycloudprovider.com/api/instances/inst1" id='inst1'> <owner_id>larry</owner_id> <name>Production JBoss Instance</name> <image href="http://fancycloudprovider.com/api/images/img3"/> <hardware_profile href="http://fancycloudprovider.com/api/hardware_profiles/m1-small"/> <realm href="http://fancycloudprovider.com/api/realms/us"/> <state>RUNNING</state> <actions> <link rel="reboot" href="http://fancycloudprovider.com/api/instances/inst1/reboot"/> <link rel="stop" href="http://fancycloudprovider.com/api/instances/inst1/stop"/> </actions> <public_addresses> <address>inst1.larry.fancycloudprovider.com</address> </public_addresses> <private_addresses> <address>inst1.larry.internal</address> </private_addresses> </instance> </instances>
  • 46. jclouds – jclouds.apache.org Apache jclouds is an open source multi-cloud toolkit for the Java platform that gives you the freedom to create applications that are portable across clouds while giving you full control to use cloud-specific features.
  • 47. Building out your Solution
  • 48. Tika : tika.apache.org Text and Metadata extraction Identify file type, language, encoding Extracts text as structured XHTML Consistent Metadata across formats Java library, CLI and Network Server SOLR integration Handles format differences for you
  • 49. UIMA : uima.apache.org Unstructured Information analysis Lets you build a tool to extract information from unstructured data Language Ident, Segmentation, Entities etc Components in C++ and Java Network enabled – can spread work out across a cluster Helped IBM to win Jeopardy!
  • 50. OpenNLP : opennlp.apache.org Natural Language Processing Various tools for sentence detection, tokenization, tagging, chunking, entity detection etc Maximum Entropy, Perception Based M-L UIMA likely to be better for a whole-solution OpenNLP good when integrating NLP into your own solution
  • 51. cTakes : ctakes.apache.org Clinical Text Analysis & Knowledge Extraction Builds on top of UIMA and OpenNLP Extracts information from free text in electronic medical records (and OCR'd paper) Identifies named entities from common or custom dictionaries (eg UMLS) For each, produce attributes, eg subject, mapping, context
  • 52. MINA : mina.apache.org Framework for writing scalable, high performance network apps in Java TCP and UDP, Client and Server Build non blocking, event driven networking code in Java MINA also provides pure Java SSH, XMPP, Web and FTP servers
  • 53. Etch : etch.apache.org Framework for building, producing and consuming network services. Cross platform, language and transport Java, C#, C, C++, Go, JS, Python Produce a formal description of the exchange 1-way, 2-way and realtime communication Scalable, high performance Heterogeneous systems
  • 54. Commons : commons.apache.org Collection of libraries for Java projects Some historic, many still useful! Commons CLI – parameters / options Commons Codec – Base64, Hex, Phonetic Commons Compress – zip, tar, gz, bz2 Commons Daemon – OS friendly start/stop Commons Pool – Object pools (db, conn etc)
  • 55. JMeter : jmeter.apache.org Loading testing tool Performance test network services Defined a series of tasks, execute in parallel Talks to Web, SOAP, LDAP, JSM, JDBC etc Handy for checking how external resources and systems will hold up, once a big data system start to make heavy use of them!
  • 56. Chemistry : chemistry.apache.org Java, Python, .Net, PHP, JS, Android, iOS interface to Content Management Systems Implements the OASIS CMIS spec Browse, read and write data in your content repositories Rich information and structure Supported by Alfresco, Microsoft, SAP, Adobe, EMC, OpenText and more
  • 57. ManifoldCF : manifoldcf.apache.org Framework for content (mostly text) extraction from content repositories Aimed at indexing solutions, eg SOLR Connectors for reading and writing Simpler than Chemistry, but also works for CIFS, file systems, RSS etc Extract from SharePoint, FileNet, Documentum, LiveLink etc
  • 59. Thanks! Twitter - @Gagravarr Email – nick.burch@quanticate.com The Apache Software Foundation: http://www.apache.org/ Apache projects list: http://projects.apache.org/