Big Data Analytics and Advanced Computer Networking Scenarios
1. August 2013
Institute for Big Data Analytics –
Dalhousie University
Big Data Analytics and Advanced Computer
Networking Scenarios: Research Challenges and
Opportunities
Stenio Fernandes
CIn/UFPE, Recife, Brazil
2. Agenda
A bit of technical background
– Measurements and Analysis in Computer Networks
Advanced Networking Architectures
– Software-Defined Networking (SDN)
– Information-Centric Networking (CCN)
– Network Visualization (NV)
Tools and Techniques for High-Performance Network Traffic
Analysis
– Visual Analytics, GPU, Map Reduce
Applied Research on Computer Networking
– Opportunities and Directions
Research agenda
– CIn/UFPE and DalhousieU
4. Essential (Core) motivation
Profiling Internet traffic
• is an essential task for precise network management
• At both access and backbone networks
It provides useful information for
• Proper (re) configuration of networks
• Deployment of accurate policies
(security, routing, throttling, capping, etc)
• Optimization of network resources
• Support for network design and planning
• Counterattack abnormal behavior
5. Why Operators need Internet profiling?
Network-wide Reporting
Performance/reliability
troubleshooting
Security
Traffic engineering
Capacity planning
• Generating basic information
about usage and reliability
• Detecting and diagnosing
anomalous events
• Detecting, diagnosing, and
blocking security problems
• Adjusting network configuration
to the prevailing traffic
• Deciding where and when to
install new equipment
5
6. Reporting
Examples
• Total volume of traffic
sent to/from each
private peer
• Mixture of traffic by
application
(e.g., Web, Streamin
g, P2P, SPAM)
• Mixture of traffic
to/from individual
customers
• Usage, loss, and
reliability trends for
each link
Requirements
• Network-wide view of
basic traffic statistics
• Ability to have
different views: by
application, by
customer, by peer, by
link type
• Real-time and offline
monitoring of high-
speed links
6
7. Core Network Troubleshooting
Detecting and
diagnosing problems
• Recognizing and explaining
anomalous events
Why a backbone link is suddenly overloaded?
Why DNS queries are failing with high probability?
Why a router processor has high CPU utilization?
Why a customer cannot reach certain networks?
7
8. Core Security
Detecting and
diagnosing
problems
Recognizing
suspicious traffic or
disruptions
Examples
Denial-of-service
attack on a customer
or service
Spread of a worm or
virus through the
network
Router hijack
Requirements
Detailed measurements
from multiple places
Include payload
inspection, in some
cases
Online analysis of the data
Installing filters to block
the offending traffic
8
9. Core Traffic Engineering
• Active queue management and link scheduling
• Green Networking
Resource
allocation
policies
• Divert traffic from congested links
• Balance load on peering links
• Link-scheduling weights to reduce delay for premium
traffic
Examples
• Network-wide view of the traffic carried in the backbone
• Timely view of the network topology
• Analytical models to assess and predict performance of
control operations
Requirements
9
10. Core Capacity Planning
Deploying new
equipment
• What? Where?
When?
Examples
• Where to put the next
backbone router
• When to upgrade a
link to higher capacity
• Whether to
add/remove a
particular peer
• Whether the network
can accommodate a
new customer
• Whether to install a
caching proxy
Requirements
• Projections of future
traffic patterns from
measurements
• Cost estimates for
buying/deploying the
new equipment
• Model of the potential
impact of the change
(e.g., latency
reduction and
bandwidth savings)
10
12. Technical Background: Measurements
Packet
• More detailed: from link to application layer (with timestamps)
• Huge storage and processing requirements
• Header or payload (full or partial)
Flow
• Flow summaries
• connection info, number of packets, duration, volume
• IPFIX/CISCO’s NetFlow v5/v9 records
Aggregate
• SNMP counts
15. Technical Background: Analysis of Packet Traces
IP header
• Traffic volume by IP addresses or ASes
• Burstiness of the stream of packets
• Packet properties (e.g., sizes, out-of-order)
Transport
header
• Traffic breakdown by protocol
• TCP congestion and flow
control
• Number of bytes and packets
per session
Application
header
• URLs, HTTP headers, file type
• DNS queries and responses,
• mobile devices
15
16. Core Modelling
• maximize insight into the data set
• extract important variables
• detect outliers and anomalies
• develop parsimonious models
Exploratory
Data
Analysis
• Does the data follow a particular PDF?
• Maximum Likelihood Estimation
• Hypothesis testing
Statistics
Inference
18. Research Challenges: Measurements
Network-wide view
Crucial for evaluating
control actions
Multiple kinds of data
from multiple
locations
Large scale
Large number of
high-speed links
and routers
Large volume of
measurement data
The “do no harm”
principle (passive
measurements)
Don’t degrade
router performance
Don’t require disabling
key router features
Don’t overload the
network with
measurement data
22
19. Research Challenges: Packet Measurements
Building efficient DPI
engines
• 1 packet every 5ns!!!
• Based on DFA/NFA
from regular
expressions that
express application
signatures
• For hardware-based
or commodity
platforms
Update of app
signatures database
• Encrypted traffic is not
possible
• Analysis of packet
payload forbidden in a
number of countries
20. High-Performance Traffic Monitoring Systems
Large
number of
application
signatures
Complexity
of the
signature
patterns
Unpredictability
of signature
location in the
network
flow, as well as
within the
packet payload
Performance
bottlenecks at
OS and
hardware
levels
Visual
Analytics
21. Research Challenges: Flow level
Analysis
Tries to identify application or classes of applications without
looking at the payload
• May extract high-level models for unsupervised classification and learning
Less data volume to analyse
• Still tough to do it in real-time in high-speed links
• from 40Gbps and beyond
Address privacy issues for lawful interception
26. SDN – Motivation
Current networks cannot support this growth!
-Not service-oriented
-Static configuration
-Status not available to apps/users
-Cannot provide dynamic negotiation to users
28. The Need for a New Network Architecture (The
ONF view)
key computing trends:
– Changing traffic patterns
contrast to client-server applications
today’s apps access different services
access to content and applications from any type of
device, anywhere, at any time
– The rise of cloud services
agility to access applications, infrastructure, and other IT
resources on demand and à la carte
– Big data means more bandwidth
Mega datasets is fueling a constant demand for
additional network capacity in the data center
29. Limitations of Current Networking Technologies
(The ONF View)
Meeting current market requirements using
device-level management tools and manual
processes
Complexity that leads to stasis
– The static nature of networks is in stark contrast to the
dynamic nature of today’s environment
Inconsistent policies
– To implement a network-wide policy, thousands of
devices and mechanisms must be configured
Inability to scale
– traffic patterns are dynamic and unpredictable
– users with different apps and performance needs
30. SDN (the ONF view)
Emerging network architecture where network
control is decoupled from forwarding and is directly
programmable
– Migration of control into accessible computing devices
enables the underlying infrastructure to be abstracted for
applications and network services
can treat the network as a logical or virtual entity
Network intelligence is (logically) centralized
– SDN controllers maintains a global view of the network
Network appears to the applications and policy
engines as a single, logical switch
– infrastructure gains vendor-independent control over the
entire network from a single logical point
33. Motivation: what drives SDN research and
development?
Reduced network costs (CAPEX / OPEX)
Support to Innovative New Products
(applications, services)
Synergy with Cloud Computing Services and
Infrastructure
And most importantly: Real time network
programmability
This is the quest for networks with improved
performance while keeping them
simple, scalable, and “ smart”
34. Innovation Roadblocks vs. Enablers for Big Data
Analytics
Roadblocks
– from the Network Layer
Proprietary software in network
devices
Developers have to rely on the
network as is
– Support for data-intensive
science and applications
One-size-fits-all approach to
network data flows
Enablers
– from the Network Layer
Let developers communicate
with and program the network
itself
Allow developers to optimize the
network for specific applications
• Support for data-intensive science
and applications
Allow special solutions to high-
performance data flows
Include support to network
programmability
38. A Simplified View of SDN
1. A network in which the control plane is physically separate from
the forwarding (data) plane
• A single control plane controls several forwarding devices
39. Consequences of SDN adoption
1. Hardware and Software from different vendors
2. Simplified Programmability
3. Enable application-level control/programming of
network
4. Enables centralized control, which implies
simplification of network operations
5. Prospective integration with Network
Virtualization technologies (cf. next section)
40. Supporting SDN with OpenFlow
First standard communications interface for SDN
– between the control and forwarding layers
It allows direct access to and manipulation of the
forwarding plane of network devices
– both physical and virtual (hypervisor-based)
OpenFlow IS NOT SDN!
41. SDN - Challenges
North (apps) to South (devices) Traffic Pattern
– Needs precise classification systems
– Needs model building
– At high-speed
– Real-time
– Adapt to abrupt and long-term changes
– Cope with millions to billions of flows in short-term
(e.g., mice flows in 5min time window)
Core challenge: decide which service policy to be
applied to a flow (Classification and optimization
problem)
42. OF-based SDN Benefits (1/2)
Centralized control of multi-vendor environments
– use SDN-based orchestration and management tools to
quickly deploy, configure, and update devices across
the entire network
Reduced complexity through automation
– develop tools that automate many management tasks
Higher rate of innovation
– Allowing operators to program and reprogram the
network
in real time to meet specific business needs and user
requirements
43. OF-based SDN Benefits (2/2)
Increased network reliability and security
– define high-level configuration and policy statements
More granular network control
– apply policies at a very granular level
session, user, device, and application levels
Better user experience
– Centralized network control and state information
available to higher-level applications
Infrastructure can better adapt to dynamic user needs
– E.g.: Adaptive Video Streaming
45. SDN: Research Challenges (1/2)
SDN Architecture Design
– accommodating consistency, dependability, and scalability
requirements
control plane: centralized or distributed processing?
– controller placement problem
How many? Where to place them? How to distribute tasks?
– Maximizing fault tolerance and dependable infrastructure
to support high-performance intra-DC data exchange for Big
Data Analytics
Optimized Policy Framework
– automatic policy transformation
46. SDN Challenges (2/2)
Resiliency to security and DoS attacks
– Vulnerability in the Control Plane
Multi-Dimensional Aggregation of Rules
– Use multi-dimensional tags
– Ensure policy consistency
Example: Mobile Infrastructure
48. NV: concepts
What is NV?
– Decoupling of the services provided by a (virtualized)
network from the physical network
Virtual network is a “container” of network services (L2 -
L7) provisioned by software
– Faithful reproduction of services provided by physical
network
Analogy to a VM – complete reproduction of physical
machine (CPU, memory, I/O, etc.)
53. ICN: Motivation
Traditional Internet communication model is based
on end-to-end communication
There is a growing need of highly scalable and
efficient distribution of content
– CDN is a success although might be seen as a patch
Information driven communication breaks the
traditional packet-based model allowing an
content-centric communication
– ICN architectures takes advantage of
in-network storage
multiparty communication
interaction models (e.g., publish-subscribe)
54. ICN: Technical Background
New location-independent approach to
communicate
– more suitable for content distribution
ICN architectures are replacing where with what
Ruled by the consumers of data
– Interest and Data packets
i) a content consumer asks for some content by
broadcasting its interest to all nodes it can reach
ii) any node that receives the Interest packet and has the
content responds with a Data packet
55. ICN: Technical Background
The basic operation of an ICN node is similar to an
IP host
– A packet arrives on an interface
A longest-match lookup is performed on its name
Building blocks for ICN architectures
– Information Objects
– Content Naming
– Security
– Content Forwarding
– In-Network Caching
– Routing and Transport
56. ICN: Technical Background
Information Objects (IO)
– IO represents content information without taking in
consideration its storage location and physical
representation
– IO can have multiple copies of itself
Content Naming
– treat content as a network primitive
Unique, Persistence, Scalability
– Hierarchical or Flat Naming
57. ICN: Technical Background
Security
– Content Validation
– Name Persistence
– Owner Authentication and Identification
Content Forwarding
58. ICN: Technical Background
In-Network Caching
– store temporarily content in the network core elements
– small but popular content generates most Internet traffic
Heavy-tailed nature of Internet traffic
Routing and Transport
– IO identifiers are not bind to a specific location
– common topology-based routing and forwarding algorithms are not
effective for routing Ios
Current Architectures:
CCN
Publish-Subscribe Internet Routing Paradigm (PSIRP)
4WARD-Netinf
Dona
CCNx
59. ICN: challenges
Scalability
– To be effective, routers should be able to keep TBs of
information in cache
Security
– naming scheme that allows both self-certification and
human-friendly identification while avoiding the use of a
PKI is an open issue
Privacy
– makes information visible and identifiable at the network
level
Economic model
– Adoption of ICN depends not only on technical aspects
60. TOOLS AND TECHNIQUES FOR HIGH-
PERFORMANCE NETWORK TRAFFIC
ANALYSIS
Visual Analytics
61. VA: Motivation
Effectively use the immense wealth of data and
information acquired, computed, and stored
analysts can get lost in irrelevant or
inappropriately processed or presented
information
– For computer networks, acquisition of raw data is no
longer a problem
Visualization techniques might be very effective
– but for some analyses, pure visualization do not
completely expose insights hidden in the data
62. VA: definition
Science of analytical reasoning supported by
highly interactive visual interfaces,
transcending simple and direct data
visualization, and requiring active user
participation
65. VA: Challenges
Challenges for Visualization Systems for computer
networks data
– Limited scalability
– Knowledge discovery
– Appropriateness to perform data transformation
– Data presentation
– Interaction with the visualization system
– Hardware bottlenecks
– Multi-attribute visualization
66. TOOLS AND TECHNIQUES FOR HIGH-
PERFORMANCE NETWORK TRAFFIC
ANALYSIS
Graphical Processing Units (GPU)
68. Research Challenges and Opportunities
Cloud Computing Services are driving huge
changes in the computer networking field
– Distributed and hybrid clouds will be a reality soon
Moving massive amount of data to be moved
SDN seems to be a smart solution to address
scalability and other issues for Big Data
– NV is available as the supporting technology
CCN is a paradigm shift and might face barriers to
full deployment
Opportunities for advanced research is
everywhere in those new scenarios
– Content is becoming king in networking
69. Center For Informatics (CIn)
Federal University Of Pernambuco (UFPE)
Recife, Brazil
About
70. CIn/UFPE
• ~42K students, ~1K PhD professorsUFPE
• Top 5 CS Graduate Program in Brazil
• Evaluation: CAPES level 6 (scale 1 to 7)
• Top 10 most important CS Research Center in Latin America
Recognition
• 80+ PhD professors
• ~25% CNPq Research ChairsFaculty
• Computer Science, Computer
Engineering, Information SystemsPrograms
71. 2000+ students
International collaboration:
Europe, Asia, and North America
Research Projects
(Private and Public funded)
CNPq, CAPES, FACEPE
Samsung, Ericsson,
Motorola, Nokia, LG, HP, etc
Recipient of a number of awards:
• 2011 Most Innovative Brazilian
Research Center
• Microsoft Imagine Cup (since 2005)
• ACM Intl. Programming Marathon
Recruitment:
Google, Microsoft, Facebook
CIn/UFPE
73. Research Agenda with Dalhousie
• International Science & Technology Partnership (ISTP)
and Pernambuco State Research Funding Agency
(FACEPE)
• UFPE, Dalhousie University
• GSTS, Neurotech
• ~ CAD 2Mi over 2 years
New R&D
program
• Open to new ideas and interests
Further
Collaboration
Protocols tend to be defined in isolation, however, with each solving a specific problem and without the benefit of any fundamental abstractions. This has resulted in one of the primary limitations of today’s networks: complexity. For example, to add or move any device, IT must touch multiple switches, routers, firewalls, Web authentication portals, etc. and update ACLs, VLANs, quality of services (QoS), and other protocol-based mechanisms using device-level management tools. In addition, network topology, vendor switch model, and software version all must be taken into account. Due to this complexity, today’s networks are relatively static as IT seeks to minimize the risk of service disruption.
SDN also greatly simplifies the network devices themselves, since they no longer need to understand and process thousands of protocol standards but merely accept instructions from the SDN controllers.
The open standards (north and south)
Suppose that you have a cloud distributed services to compute and visualize in different locations. Can imagine how the network might suffer to transport a massive amount of data between datacenters? So, how can the network support such operations? It can’t, using current technologies.
As an example, datacenters can now offer multiple clouds to different tenants, instead of separating virtual networks. This is a more abstract view and facilitates infrastructure management
The FIB is a table used to forward Interest packets to potential sources of their content.The CS acts such as the buffer memory of an IP router. However CS has a different replacement policy: it remembers the Data packets arriving as long as possible (using LRU or LFU scheme) for maximizing the probability of sharing and minimizing the upstream bandwidth demandThe PIT keeps track of the IOs recently requested and not yet served
ICN frequently has to validate the binding between names and content. One technique to do that is known by self-certification. Self-certification is related to all data or just pieces of IO depending of the approach chosen. Therefore, self-certification ensures that the only way of performing unauthorized changes in the data is by changing the IO´s ID (i.e. the content name)persistent names ensures that content names would not change in spite of chances of the storage location
Some of these challenges can be tackle by the research work on big data