Big Data is on everyone's lips, but what are the available technical solutions to deal with it? We give a brief overview of several solutions: distributed filesystems, NoSQL databases, and end-to-end solutions that take into account computations.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Big Data - SysFera presentation at the CSCI
1. 29.03.12 SysFera
Big Data
Technologies
SysFera
Benjamin Depardon
2. 29.03.12 SysFera
SysFera
• 2001: Research project from the Graal team
(Inria/ENS)
– DIET: grid middleware
• 2007: SysFera-DS used within the Décrypthon
project
– Used in production 24/7/365 since then
– Selected by IBM to replace Univa-UD
• 2010: Creation of SysFera, INRIA spin-off
• 2012: A team of 14 (R&D: 4 engineers and 5 PhD)
– Supported by two experts from INRIA and ENS
– SysFera-DS
2
3. 29.03.12 SysFera
What is Big Data?
• All kinds of data
• Valuable insight, but difficult to
extract
• Several dimensions
– Variety
• Structured/unstructured
• Text, audio, video…
– Velocity
• Time sensitivity
• Streaming
– Volume
• Large files
• Small files in large quantities
– Variability
• Different meanings/format over
different time period
3
4. 29.03.12 SysFera
What can you do with Big Data?
Analyze a Variety of Information
Analyze Information in Motion Social media/sentiment analysis
Smart Grid management Geospatial analysis
Multimodal surveillance Brand strategy
Real-time promotions Scientific research
Cyber security Epidemic early warning system
ICU monitoring Market analysis
Options trading Video analysis
Click-stream analysis Audio analysis
CDR processing
IT log analysis
RFID tracking & analysis
Discovery & Experimentation
Analyze Extreme Volumes of Sentiment analysis
Information Brand strategy
Scientific research
Transaction analysis to create insight-based Ad-hoc analysis
product/service offerings Model development
Fraud modeling & detection Hypothesis testing
Risk modeling & management Transaction analysis to create insight-
Social media/sentiment analysis based product/service offerings
Environmental analysis Manage and Plan
Operational analytics – BI reporting
Planning and forecasting analysis
Predictive analysis
…
5. 29.03.12 SysFera
What can you do with Big Data?
Financial Services Utilities
Fraud detection Weather impact analysis on
Risk management power generation
360° View of the Customer Transmission monitoring
Smart grid management
Transportation IT
Weather and traffic
Transition log analysis
impact on logistics and
for multiple
fuel consumption
transactional systems
Cybersecurity
Health & Life Sciences
Epidemic early warning
Retail
system 360° View of the Customer
ICU monitoring Click-stream analysis
Remote healthcare monitoring Real-time promotions
Telecommunications
CDR processing
Law Enforcement
Real-time multimodal surveillance
Churn prediction
Situational awareness
Geomapping / marketing
Cyber security detection
Network monitoring
6. 29.03.12 SysFera
What do you need?
• Hardware
– Storage capacity
– Computing power
• Software
– Storage
• Filesystems
• Databases
– Computation framework
6
8. 29.03.12 SysFera
HDFS
• Hadoop Distributed File System
• Open source (Apache)
• Design
– High throughput instead of low latency
– Large data sets (large files), data locality
– Fault tolerance (replication)
– Write once and read-many (WORM)
– Userspace
• Limitations
– Write-once model
– Cannot be mounted by existing OS
– No quotas/access permissions
– Name node is a single point of failure
• Used by Yahoo, Twitter, Rackspace, LinkedIn, Facebook…
8
9. 29.03.12 SysFera
GlusterFS
• Open source (GPLV3) NAS file system
• Runs in userspace
• File-based distributed mirroring,
replication, striping, load balancing
• FUSE, POSIX compliant
• Storage quotas
• No meta-data server (fully distributed
architecture, elastic hash)
• Unified global namespace:
aggregation of disk and memory in a
single pool
• Data is stored in logical volumes that
are abstracted from the hardware and
logically partitioned from each other
• Multiprotocole client support:
GlusterFS native, NFS, CIFS, HTTP,
WebDAV, FTP
• Real time Self-healing
• VM live replication
9
10. 29.03.12 SysFera
LUSTRE
• Open Source (GPL)
• Object based: separate metadata
and file data
– Meta Data Servers (MDS) nodes
– Object Storage Servers (OSS)
nodes
• Consistency: Lustre distributed
lock manager (MSD and OSS)
• Performance:
– data can be striped
– MDT is only involved in pathname
and permission checks, and is not
involved in any file IO operations
• POSIX interface
• Lustre Network (LNET):
infinibands, TCP/IP, Myrinet…
• Targeted to manage large files
10
12. 29.03.12 SysFera
CAP theorem (Brewer’s theorem)
It is impossible for a distributed computer
system to simultaneously provide all three of
the following guarantees:
• Consistency
• Availability
• Partition tolerance
12
13. 29.03.12 SysFera
NoSQL
• Release ACID conditions
• 4 types of NoSQL bases
– Key-value
(Memcached, Voldemort):
data agnostic
– Document oriented
(CouchDB, MongoDB) :
data conscious
– Column oriented (Big
Table, Hbase, Cassandra)
– Graph (Neo4j)
• Requires more work on
the client side
13
14. 29.03.12 SysFera
MemCached
• Free & open source, high-performance, distributed
memory object caching system, generic in nature, but
intended for use in speeding up dynamic web
applications by alleviating database load.
• Simple Key/Value Store
• Smarts Half in Client, Half in Server
• Servers are Disconnected From Each Other
• O(1) Everything
• Forgetting Data is a Feature
• Used by
LiveJournal, flickr, Wordpress.org, Wikipedia, YouTube
…
14
15. 29.03.12 SysFera
MongoDB
• Document oriented
• Transport and storage: BSON format (derived
from JSON, but binary)
• Queries
– no join
– Map/reduce
• Database contains collections
• Collections contain documents
• Master-slave replication
15
16. 29.03.12 SysFera
Cassandra
• Column oriented (inspired from Big Table &
Dynamo)
• Notion of super-columns
– (sorted) associative array of columns
• Range queries on keys
• Low latency: sequential access to disk
• O(1) DHT
• Eventual Consistency
• Values limited to 2GB
• RPC with Thrift
16
17. 29.03.12 SysFera
Neo4J
• Graph oriented
• Fully ACID transactions
• Data is stored as a graph/network
– Nodes and relationships with properties
– "Property graph" or "edge-labeled multidigraph"
• Queries
– Indexing of nodes and properties
– Graph traversal
• Disk-based, native storage
• Java, REST API
• Master-slave load balancing
• Use case: social network
17
18. 29.03.12 SysFera
PaaS Databases
• Different providers
– Amazon: RDS, SimpleDB
– Google: AppEngine (GQL)
– Microsoft: SQL Azure
• Different cost models
– CPU hour
– CPU hour + traffic
– Monthly fee + CPU hour + traffic
All depend on the load (number of users)
18
22. 29.03.12 SysFera
IBM Big Data Platform
InfoSphere BigInsights
Hadoop-based low latency analytics
for variety and volume
Hadoop
Information Stream Computing
InfoSphere Information Server Integration InfoSphere Streams
High volume data integration and
Low Latency Analytics for streaming
transformation
data
MPP Data Warehouse
IBM InfoSphere IBM Netezza High Capacity IBM Netezza 1000 IBM Smart Analytics System IBM Informix Timeseries
Warehouse Appliance BI+Ad Hoc Analytics Structured Data Operational Analytics on Time-structured analytics
Large volume structured Queryable Archive Structured Structured Data
data analytics Data
22
25. 29.03.12 SysFera
DAGDA
• Meta data-manager
• Data management from end to
end
• Data replication
– Explicit
– Implicit
• Data persistency
• Memory and disk quotas
• Replacement algorithms (LRU,
LFU, FIFO)
• Best source selection
• Strong link with task manager
• Pluggable policies, local data
managers
25
27. 29.03.12 SysFera
Bibliography
• « Big Data & Open Source: Une convergence inévitable ? », Stefane
Fermigier, http://www.fermigier.com/blog/2012/03/new-
whitepaper-big-data-open-source/
• « Visual Guide to NoSQL
Systems », http://blog.beany.co.kr/archives/275
• The Cassandra Distributed Database », Eric
Evans, http://www.parleys.com/#st=5&id=1866&sl=40
• « Big Data Architecture », Julio
Philippe, http://www.slideshare.net/PhilippeJulio/big-data-
architecture
• « Big Data in Real-Time analysis at Twitter », Nick
Allen, http://www.slideshare.net/nkallen/q-con-3770885
• …
27