1. RAPID PROTOTYPING FOR
BIG DATA WITH AWS
Tuesday, March 15, 2016
8 AM PST/4 PM BST/5 PM CEST
webinar webinar@softserveinc.com
2. SPEAKERS
Serge Haziyev
VP of Technology Services,
SoftServe
Taras Bachynskyy
Data Architect,
SoftServe
Vadim Astakhov
Solution Architect,
Amazon Web Services
Ariel Weil
VP of Marketing and Business
Development, Yottaa
webinar
4. TYPICAL BIG DATA CHALLENGES
UNSTRUCTURED
STRUCTURED
HIGH
MEDIUM
LOW
Archives Docs Business
Apps
Media Social
Networks
Public
Web
Data
Storages
Machine
Log Data
Sensor
Data
Velocity Variety VolumeComplexity
Architecture Concerns:
• Scalability
• Performance
• Extensibility
• Data Quality
• Fault-Tolerance and
Availability
• Security
• Cost
• Skills Availability
Data Sources:
webinar
5. WHY PROTOTYPING IS IMPORTANT?
Typical signs to start prototyping:
• Requirements are uncertain
• Technologies are new
• No comparable system has been previously developed
• No full buy-in from the business
They said they didn’t
need a prototype
webinar
7. WHEN AND WHY TO PROTOTYPE?
Find more info at: “Strategic Prototyping for Developing Big Data Systems”,
IEEE Software, March-April, 2016
Initial
Architecture
Analysis
Vertical
Evolutionary
Prototype
PoC
MVP
Rapid Horizontal
Prototype
Projecttimeline(When?)
• Identification of missing, conflicting or ambiguous architectural requirements
• Creation of initial architecture design and selection of candidate technologies
Goals (Why?):
• Confirmation of user interface requirements and system scope
• Demonstration version of the system to obtain buy-in from the business
• Integration of selected technologies
• Clarification of complex requirements
• Testing critical functionality and quality attribute scenarios
• Validation of technologies and scenarios that pose risks
PoCs
• Getting early feedback from end users and updating the product accordingly
• Presentation of a working version to a trade show or customer event
• Evaluation of team progress and alignment
webinar
10. SIMPLIFY BIG DATA PROCESSING
Ingest Collect Process Analyze
Data Answers
Time
webinar
11. EMR Redshift
Process
AWS BIG DATA TECHNOLOGIES
EC2
S3Amazon Kinesis GlacierDynamoDB
AWS Direct Connect AWS Import/Export
Ingest
Automate
AWS Data Pipeline
Store
VPN/Public Web
webinar
12. S3
Kinesis
DynamoDB
RDS (Aurora)
AWS Lambda
KCL Apps
EMR Redshift
Machine
Learning
Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
BIG DATA PROCESSING
Data Answers
webinar
14. DATA CHARACTERISTICS: HOT, WARM, COLD
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
webinar
16. YOUR BIG DATA APPLICATION ON AWS
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3
Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
webinar
18. YOTTAA CREATES AN ABSTRACTION LAYER ON TOP OF
INFRASTRUCTURE, APP & VISITOR BROWSER
webinar
19. YOTTAA’S PROXY-BASED SOLUTION SEES EVERY VISITOR
REQUEST & INFRASTRUCTURE RESPONSE
Primary Web
(www) Domain
Visitor
Browser
YOTTAA
Network
WAF
Incumbent
CDN
Resource
Domain(s)
3rd Party
WAF
(if present)
3rd Party
Domain(s)
Asset
Optimization
Non-optimized
Assets
webinar
20. REAL-TIME WEB ANALYTICS – LOB & IT USE CASES TO
DRIVE YOTTAAS BUSINESS FORWARD
“The Business”
Customer Journey
• User experience
• Visitor Targeting
• Vendor Attribution
• Business Agility
IT & Operations
Service Levels
• Speed
• Scalability
• Security
• Standards
webinar
21. Complete Visibility
• Centralized log delivery & analytics
• Role-based Access Control
• Dual-factor authentication
• Account lockout
Actionable Insights
• Real-time traffic & threat analysis
• Event management
• In-line actions via Yottaa Portal
THE SOLUTION: IMPACTANALYTICSTM BIG DATA
ANALYTICS FOR ACTIONABLE INSIGHT
webinar
22. TECHNICAL SOLUTION
Architecture Drivers
▪ Volume (> 100 TB scale)
▪ Throughput (> 20K/sec)
▪ Performance (low latency)
▪ Exploratory analytics
▪ Near Real-time (5 sec latency)
▪ Historical view (5 years data)
Lambda Architecture
Solution
Combine different techniques
Stream (resent data) – hot data
Batch (all data) – cold and warm
Velocity
Variety
Volume
Batch Layer
Speed Layer
Serving Layer
Mater Data
Stream
Processing
Batch
View
Real-time
View
Batch
Processing
Web Logs
webinar
In the modern world data is produced with ever increasing volume, velocity, and variety of formats. This data can be extremely valuable. It can be used to understand and track application or service behavior so that we can find errors or suboptimal user experience. We can mind it for patterns and correlations to generate recommendations. Examples can be a ecommerce sites which analyze user access logs and provide product recommendations, another examples are social networking sites or dating sites which provide new friends recommendations, or helps to find qualified soul mates, and so fourth.
Also Consumers and businesses in our days are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing of historical data.
We also are finding that as data creation is becoming more real-time and continuous so there is a need to manage it at high speed.
To simplify big data processing, we present it as a data bus which is comprising from the various stages such as: ingest, store or collect, process, and finally analyze and present data for visualization.
The right technology in each stage based on criteria like data volume and structure as well as query latency, request rate and item size.
AWS delivers technologies to accommodate al of those processing stages. Here you can see extensive portfolio offered to deal with various aspects of Big Data. But what services should you use, why, when, and how?
And The first question is usually, How can I move my data to AWS ?
When data start moving into AWS, We can persist them into a number of storage for further analysis. Those services are Relational Database Service, object storage S3, streaming storage solution Kinesis, to key-value storage DynamoDB, also Hadoop file system on ElasticMapReduce and RedShift warehouse give you a wide range of options to persist data.
But What Data Storage Should be Use and when?
Here at Amazon, We don’t believe that there is one tool that can do everything, but rather if you use the right tools, you can build a highly configurable big data architecture to meet your specific needs.
And AWS comes with variety of services which provides customers with the right tool optimized for
Data structure; Query complexity and other Data characteristics such as data frequency access patter.
Here you can see AWS service grouped in 4 classes based on data structure and complexity.
“Top two quadrant represent structured data and the bottom two represent none-structured data. At the same time, left columns groups services which provide are well optimized for simple query patterns while two groups on the right present services optimized for complex queries.”
Another way to think about big data design for optimal solution is the frequency access patter which can be visualized as data temperature
Data labeled as hot if they are very frequently accessed by customer, probably within a second or few seconds time window
on the opposite side of the temperature scale we have cold data which are typically archived data with rare chance to be accessed, or can tolerate an hour delay access.
Warm –usually referred to the data which access pattern from a few second to a few minutes
Other parameters such as total volume of data, item size, request rate and query latency as well as durability and cost play equally important role to build a highly configurable big data architecture and meet customer specific needs.
Usually, for hot data we are talking about small data objects within a few kb at total volume a few GB at most but usually we expect small query latency and high request rate. While we are talking about cold data then it is usually big data volumes with low request rate for the data and response time for processing within minutes if not hours.
This heat map combines the notion of data temperature with query latency and summarizes AWS solutions available in the context of the temperature of the data and the data volume, data durability, request rate, processing latency as well as pricing requirements.)
ElastiCash and DynamoDB are the good fit for hot data while RDS, CloudSearch, EMR/HDFS and S3 provides you with options for Warm And finally Glacier is the offering for cold data.
There is a certain intersections in terms of latency, request rate and data volume among ElastiCashe, DynamoDB and RDS or from DyanomoDB, RDS and HDFS.
Thus our customers always a few options to implement their solution.
Finally, To provide complete toolset for Big Data problems, AWS provide Processing Applications calls Connectors which can write to Multiple Data Stores and
Processing Frameworks such as Storm, Hive, Spark, etc. which Could Read from Multiple Data Stores.
For visualization tear AWS working with many partners providing Business Intelligent platforms which can connect to AWS BigData services through standard APIs.
This is the end of my presentation and I am thank you for your time.
20 years ago, IT/OPS managed as much of the application delivery chain as possible
Content was aggregated at the web server
Experiences were optimized using Application Delivery Controllers – hardware appliances in your datacenter
Threats were mitigated by hardware-based firewalls
And load balancers ensured scalability
All of these components were under your control and had ample opportunity to accelerate and secure applications and data.
[BUILD]
But today, content is aggregated in the browser. Consider some of the standard 3rd party components that together make up engaging, personalized experiences.
DISCO:
What are some of the 3rd party components that you folks include in your apps today?
Have you had challenges either adding the components you want, or ensuring an optimal experience with the components you have?
What are some of the things you’ve tried to fix that?
Have they worked?
What has it cost you?
[BUILD]
Not only has the aggregation point moved out to the browser, but web architectures have evolved to include more aaS solutions for your infrastructure, platform and software needs.
DISCO:
What ‘aaS’ components do you use today or are you planning to include?
What were your goals for using ‘aaS’ components?
How has that change impacted your business?
[GETTING TO THE YOTTAA POINT]
The industry is moving to a services based model – if you’ve heard of SOA (services oriented architecture) it’s the way developers prefer to build modern applications because it makes them far more efficient and capable of achieving far more.
However it also changes things:
Applications connect directly to the internet – they’re not managing connections and data via application delivery controllers and your firewall
Moving content aggregation to the browser also means that
ADCs have no access to optimize the application
And neither do CDNs
…because the BROWSER is requesting and rendering all of the content. ADCs and CDNs do not extend to the browser.
The ADC stops at your datacenter
And the CDN stops at the edge
[BUILD]
So Yottaa has built an app optimization platform that extends from your datacenter all the way to the user’s browser.
It was designed from the ground-up to work with legacy and modern cloud architectures
This means that we are completely platform, infrastructure and software agnostic – we have to be able to work with any networked solutions you have in place today
And, to enable developers, IT professionals, marketers and the businesses they support to remain agile and focused on the customer, we require no code change. Every Yottaa optimization is configuration-based and delivered in real-time via our cloud service.
SEGUE: the net effect is significant
We’ve been certified as a NetSuite BuiltForNetsite (SuiteApp) technology partner and proven to accelerate eCommerce sites with
NO modification to NetSuite, which means we don’t require a cartridge
NO limitations to other components you might use on the NetSuite platform
And NO slowdowns or other dependencies because we require no code change