The need to process huge data is increasing day by day. Processing huge data involves compute, network and storage. In terms of Big Data, What it takes to innovate and what is innovation at the end? This talk provide high level details on the need of big data and capabilities of Mapr converged data platform.
Speaker: Vijaya Saradhi Uppaluri, Technical Director at MapR Technologies
2. Presentation Overview
1. Big data. Why it is really big?
2. Technologies that are available today.
3. Need of Converged Data Platform.
1. Innovation!! What it takes?
3. Video Surveillance
Data generated by 704 X 576 resolution CCTV’s generated
1GB per hour roughly.
Video Surveillance estimates 6000 PB in 2017.
Surge in Biometric applications.
Who stole my jacket? Forgot on the desk. Office has CCTV!!
4. Autonomous Cars
Driverless car generates 1 GB/sec roughly.
2 PB per car is the expectation.
Car goes for a trip. Comes back safely. “Is the
car drive good?”
What if someone files a lawsuit after six
months?
5. Aadhar
Biometric identity to Indian Citizens.
~5 Mega Bytes per citizen.
Maps around 15 PB of raw data.
100 million authentications per day. Each authentication is
roughly 4KB plus of data.
Sub second response needed.
6. Aadhar Continued ...
Enrolled data moves from hot to cold. Data temperature
varies.
Data analytics need.
https://uidai.gov.in/images/FrontPageUpdates/uid_doc_3
0012012.pdf
Data is stored on Mapr Technologies.
https://uidai.gov.in/images/AadhaarTechnologyArchitecture_March2014.pdf
7. Retailers View
Walmart needs to process 2.4 PB per hour.
Gain insights on data in 30 - 40 minutes time period.
Error in insights because of bugs and miscalculation will
burn money.
Need to model 40 PB of recent transactional data.
8. Retailers View Continued ...
Data insights figured out that two particular stores are not
selling popular cookies. It’s not easy to find!!
Alert when a particular metric threshold is violated. Helps to
reduce the turnaround time.
200 billion rows of transaction data has to be processed.
9. Retailer Needs ….
Building 360 degrees view of the customer. Measuring Brand Sentiment.
Creating customized promotions.
Improving store layout. Layout matters to make you purchase more!!
Click streams.
Inventory management.
Selling baby lotions to pregnancy women, tracking that weather is not
good and selling Pizzas.
10. BIG DATA: Technologies Primer
Search “GOD” in Laptop running with 1 Terabyte Drive. Assume 100 MB/sec as
throughput.
How to speed search of “GOD”?
Add more CPU. Okay, How many? 128 or 256 or 512?
Add more memory. How much? How many DIMMS? 16 or 64 or 128?
Tired!! Ah, I realize now single machine cannot solve the problem.
Do with multiple machines. May be, commodity machines, But scale in a huge
way.
How to distribute storage?
11. Technologies : Compute, Storage and Network
Scale by moving compute close to data.
Store data efficiently on multiple nodes.
No compromise on reliability.
No compromise on availability.
Automatically take care of addition and deletion of nodes.
Help to extract underlying device performance characteristics.
Network:
Do not let compute happen on data over network.
12. Technologies Available Today
Hadoop! What exactly hadoop is?
Map-Reduce! When is this a right choice?
YARN? Is it refined Map-Reduce? More tight control on resource management
and job scheduling /monitoring?
Looks Hadoop core is distributed storage. Map-Reduce is compute engine. Is the
processing real time? Are we good to go??
13. Technologies continued ...
How to push data to Hadoop storage? Use Flume?
How to push data from an existing application writing to legacy file system? Is it
to be rebuild?
Can the entire big data storage (aka hadoop) be accessed over NFS?
Okay, We somehow manage data into Hadoop. Does it solve all needs? Is there a
way to address data as Key-Value pairs?
14. Unstructured Data as Key-Value Pairs
Why do we need unstructured data as Key-Value pairs?
Aadhar needs to store biometric signature, address, fingerprints etc.
Retailers need to show various attributes on the products. It consists of images,
technical specifications, tables, columns, reviews, etc.
IoT (Internet of things) generate lot of unstructured data.
How to store them and process them? Need of more technologies ...
15. Big Table
HBase. Tries to address the key-value pair.
Cassandra. Tries to address the key-value pair.
Mapr DB. Addresses key-value pair problem.
Is there a JOIN operation on these tables? Can there be atomic operations across
different rows? How about calling the above as NOSQL DB’s.
How can one decide right technology?
16. NOSQL DB
MongoDB.
CouchDB.
Mapr DB - JSON
Why are there still more databases? What do these tables provide more?
Is querying data still a challenge?
18. Real Time Analysis of Data
Hadoop, Connectors to Hadoop, Unstructured key-value pair, Big Table SQL
engines, Ready to go?
Is there a need to process data as soon as it arrives?
May be, Streams are needed. Streams are like pipes!!
APACHE KAFKA
APACHE STORM
APACHE FLINK
MAPR STREAMS
19. AI, GRAPH, ...
Need to represent data in graph
Apache Giraffe
Machine learning.
Apache Mahout
20. Platform
Purchased 1000 nodes.
Have to connect several software to make meaning of the data.
IT needs standard platform to run day after day.
Development and Business needs continuous engagement of new tools and
new software.
Security and Fraud detection keeps on changing day-by-day.
What to do? Do I need virtualization software?
21. Virtualization
Go for existing virtualization techniques? Are they expensive?
How about Linux Containers?
How about scheduling Containers? Do we need scheduling software?
Apache Mesos
Kubernetes
How do I provision storage for containers?
Craft disk independently for each container?
Is there a way to plug in storage from any node in the cluster to a container running on any node?
22. Performance and Security Problems
1000 node cluster is not performing well.
Back to Big Data problem again.
Swim 1000 node logs to identify what is the issue?
Security.
Is data access kept confidential?
Authentication and Authorization is must. Is it same across all softwares?
Data encrypted on the wire?
DoS problems.
23. Multi Tenancy
Have 1000 tenants to work on 1000 node cluster.
How to provision storage, compute and network?
Is this going to be like Amazon cloud? Does each enterprise has the scale and
capacity to develop Amazon cloud software?
Is there a way for tenants to share data?
24. Hot and Cold Data
As time moves forward, Data can possibly become cold.
A need may arise to keep hot data on solid state drives.
How to retain cold data?
Move to cloud.
Does this need another software?
Is there a way to watch attributes of moved data into the cloud? Let’s say the file is /A/B/C.
Can one see the time when C is modified while the data stays in the cloud.
Is there a way to dynamically move data between solid state drives and hard
disk drives?
25. Reliability: Does it mean 3-way replication
Data reliability means 3-way replicating by and large.
Peta Bytes of data being 3-way replicated causes storage waste.
How to eliminate it?
A platform should try to represent data in erasure coded format (Probably 1.5x).
Yet while storing in erasure coded format, It should let to modify data if need arises.
26. IoT Devices : Edge Clusters
IoT devices generate lot of data.
Each IoT device data has to be processed and stored with high reliability to
meet government laws.
IoT devices has to process data.
We know, single machine has limitation in processing data. By virtue of CPU’s, Memory and hard
disks.
Single machine also poses data reliability problems if the drive or CPU went bad.
Is this asking for a cluster near IoT devices? How can we do? NUC (Nuclear unit
of computing) cluster may be the answer!!
27. IoT Edge Clusters
Process data and push to centralized cluster.
Access data in the centralized cluster and local cluster when need arises.
Unified global namespace access is must.
Ability to stream data from Edge Cluster to Centralized Cluster.
Edge cluster applications may not be sophisticated. They may have to write data with standard
file system calls.
Does the software platform we chose can provide Edge Cluster Processing?
28. Application Data Access Model
Table Format.
Big data files. Hadoop files (Write Once and Read Many) or Mapr files (read and
writable).
Object Store.
Flat name space.
Data is accessed as objects with strict SLA’s.
Used to store videos, Images, etc.
29. Converged Data Platform
Needed as Big Data Store.
Ability to support unstructured key-value pairs.
Ability to support data with SQL engines like Drill, Hive, etc.
Ability to support real time streaming of data.
Ability to support container virtualization.
Ability to support applications accessing data through objects.
Ability to support global namespace for IoT Edge Clusters.
30. Converged Data Platform Continued ...
Ability to support Multi Tenancy.
Ability to ensure security across several users and tenants.
Ability to provision CPU, Storage and network across tenants or users.
Ability to support different temperatures of the data.
Ability to move data between cloud and the cluster.
31. Innovation
Is Innovation function of knowledge?
Isn’t knowledge function of time?
What promotes innovation?
Salary?
Stock?
Recognition?
Peer Competition.
32. Innovation Continued ...
Innovation needs innocent mind.
How can one be innocent in this world?
Is there a way mind can be made innocent?
Recognizing innovation is innovation.
33. Questions
I may not be able to answer all your questions!!
We can investigate the question together !! Not alone.