12. Real problem
● ~85 mln active users, ~115 mln registered users
● 11.5 messages per user per day
● ~11000 req/sec
● Peaks 6x
● 99% HTTPS with response time < 100ms
● Reliable and scalable for future grow up to 80k
15. SOLR? Why?
● Pros:
○ Quick search on complex queries
○ Has a lot of build-in features (push-
notifications, master-slave replication,
RDBMS integration)
● Cons:
○ Only HTTP, embedded performs worth
○ Not easy for beginners
○ Max load is ~100 req/sec per instance
16. “Simple” query
"-(-connectionTypes:"+"""+getConnectionType()+"""+" AND connectionTypes:[* TO
*]) AND "+"-connectionTypeExcludes:"+"""+getConnectionType()+"""+" AND " + "-(-
OSes:"+"(""+osQuery+"" OR ""+getOS()+"")"+" AND OSes:[* TO *]) AND " + "-
osExcludes:"+"(""+osQuery+"" OR ""+getOS()+"")" "AND (runOfNetwork:T OR
appIncludes:"+getAppId()+" OR pubIncludes:"+getPubId()+" OR categories:
("+categoryList+"))" +" AND -appExcludes:"+getAppId()+" AND -pubExcludes:"
+getPubId()+" AND -categoryExcludes:("+categoryList+") AND " + keywordQuery+" AND
" + "-(-devices:"+"""+getHandsetNormalized()+"""+" AND devices:[* TO *]) AND " +
"-deviceExcludes:"+"""+getHandsetNormalized()+"""+" AND " + "-(-carriers:"+"""
+getCarrier()+"""+" AND carriers:[* TO *]) AND " + "-carrierExcludes:"+"""
+getCarrier()+"""+" AND " + "-(-locales:"+"(""+locale+"" OR ""+langOnly+"")"
+" AND locales:[* TO *]) AND " + "-localeExcludes:"+"(""+locale+"" OR ""
+langOnly+"") AND " + "-(-segments:("+segmentQuery+") AND segments:[* TO *]) AND
" + "-segmentExcludes:("+segmentQuery+")" + " AND -(-geos:"+geoQuery+" AND geos:[*
TO *]) AND " + "-geosExcludes:"+geoQuery
17. Solr
Index size < 1 Gb - response time 20-30 ms
Index size < 100 Gb - response time 1-2 sec
Index size < 400 Gb - response time from 10
secs
20. Why no-sql?
● Realtime data
● Quick response time
● Simple queries by key
● 1-2 queries to no-sql on every request. Average load
10-20k req/sec and >120k req/sec in peaks.
● Cheap solution
21. Why Redis? Pros
● Easy and light-weight
● Low latency and response time.
99% is < 1ms. Average latency is ~0.2ms
● Up to 100k 'get' commands per second on
c1.X-Large
● Cool features (atomic increments, sets,
hashes)
● Ready AWS service — ElastiCache
22. Why Redis? Cons
● Single-threaded from the box
● Utilize all cores - sharding/clustering
● Scaling/failover not easy
● Limited up to max instance memory (240GB largest
AWS)
● Persistence/swapping may delay response
● Cluster solution not production ready
● Data loss possible
25. Redis RAM problem
● 1 user entry ~ from 80 bytes to 3kb
● ~85 mln users
● Required RAM ~ from 1 GB to 300 GB
26. Data compression
Json → Kryo binary → 4x times less data →
Gzipping → 2x times less data == 8x less data
Now we need < 40 GB
+ Less load on network stack
28. AdServer BE
● Logging — 12% of time (5% on SSD);
● Response generation — 15% of time;
● Redis request — 50% of time;
● All business logic — 23% of time;
31. Log structure
● 1 mln. records == 0.6 GB.
● ~900 mln records a day == ~0.55 TB.
● 1 month up to 20 TB of data.
● Zipped data is 10 times less.
32. Reporting
Customer : “And we need fancy reporting”
But 20 TB of data per month is huge. So what
we can do?
33. Reporting
Dimensions:
device, os, osVer, sreenWidth, screenHeight,
country, region, city, carrier, advertisingId,
preferences, gender, age, income, sector,
company, language, etc...
Use case:
I want to know how many users saw my ad in San-
Francisco.
35. Predefined report types → aggregation by
predefined dimensions → 500-1000 times less
data
20 TB per month → 40 GB per month
36. Of course - hadoop
● Pros:
○ Unlimited (depends) horizontal scaling
● Cons:
○ Not real-time
○ Processing time directly depends on quality code
and on infrastructure cost.
○ Not all input can be scaled
○ Cluster startup is so... long
37. Timing
● Hadoop (cascading) :
○ 25 GB in peak hour takes ~40min (-10 min). CSV
output 300MB. With cluster of 4 c3.xLarge.
● MySQL:
○ Put 300MB in DB with insert statements ~40 min.
38. Timing
● Hadoop (cascading) :
○ 25 GB in peak hour takes ~40min (-10 min). CSV
output 300MB. With cluster of 4 c3.xLarge.
● MySQL:
○ Put 300MB in DB with insert statements ~40 min.
● MySQL:
○ Put 300MB in DB with optimizations ~5 min.
39. Optimized are
● No “insert into”. Only “load data” - ~10 times faster
● “ENGINE=MyISAM“ vs “INNODB” when possible - ~5
times faster
● For “upsert” - temp table with “ENGINE=MEMORY” - IO
savings
40. Cascading
Hadoop:
void map(K key, V val,
OutputCollector collector) {
...
}
void reduce(K key, Iterator<V> vals,
OutputCollector collector) {
...
}
Cascading:
Scheme sinkScheme = new TextLine(new Fields(
"word", "count"));
Pipe assembly = new Pipe("wordcount");
assembly = new Each(assembly, new Fields( "line"
), new RegexGenerator(new Fields("word"), ",") );
assembly = new GroupBy(assembly, new Fields(
"word"));
Aggregator count = new Count(new Fields(
"count"));
assembly = new Every(assembly, count);
41. Why cascading?
Hadoop Job 1
Hadoop Job 2
Hadoop Job 3
Result of one job should be processed by another job
60. Small tweaks. FOR loop
for (A a : arrayListA) {
// do something
for (B b : arrayListB) {
// do something
for (C c : arrayListC) {
// do something
}
}
}
61. Small tweaks. FOR loop
for (Iterator<A> i = arrayListA.iterator(); i.hasNext();) {
a = i.next();
}
public Iterator<E> iterator() {
return new Itr();
}
private class Itr implements Iterator<E> {
int cursor = 0;
int lastRet = -1;
int expectedModCount = modCount;
}
81. Hadoop
public boolean equals(Object obj) {
EqualsBuilder equalsBuilder = new EqualsBuilder();
equalsBuilder.append(id, otherKey.getId());
...
}
public int hashCode() {
HashCodeBuilder hashCodeBuilder = new HashCodeBuilder();
hashCodeBuilder.append(id);
...
}
82. Hadoop
public boolean equals(Object obj) {
EqualsBuilder equalsBuilder = new EqualsBuilder();
equalsBuilder.append(id, otherKey.getId());
...
}
public int hashCode() {
HashCodeBuilder hashCodeBuilder = new HashCodeBuilder();
hashCodeBuilder.append(id);
...
}
Wrong
83. Hadoop
public void map(...) {
…
for (String word : words) {
output.collect(new Text(word), new IntVal(1));
}
}
84. Hadoop
public void map(...) {
…
for (String word : words) {
output.collect(new Text(word), new IntVal(1));
}
}
Wrong
85. Hadoop
class MyMapper extends Mapper {
Text word = new Text();
IntVal one = new IntVal(1);
public void map(...) {
for (String word : words) {
word.set(word);
output.collect(word, one);
}
}
}
87. AWS ElastiCache
● Strange timeouts (with SO_TIMEOUT 50ms)
● No replication for another cluster
● «Cluster» is not a cluster
● Cluster uses usual instances, so pay for 4
cores while using 1
88. AWS Limits. You never know where
● Network limit
● PPS rate limit
● LB limit
● Cluster start time up to 20 mins
● Scalability limits
● S3 is slow for many files