Yahoo! Mail antispam - Bay area Hadoop user group

Yokai Versus the ElephantHadoop and the Fight Against Shape-Shifting Spam VishwanathRamarao & Mark Risher Yahoo! Mail

AGENDA 3 Shape-shifting spam Antispam Origins Hadoop Algorithms Applications to Security Resources for Implementers

6 http://f915fde2cf53df18.ligh tto pbody.com]*!}v}]along especially consecutive important dmvfu

8 1,300,925,111,156,286,160,896 (http://bit.ly/cpOyLi)

Typical attack/response profile 11 Rule change (1/23@01:15)

MORE YOKAI - TARGETED ATTACKS <style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobuttontelefoons Jermaine iesaporitoroshan 3026 janatatrennungpalillos toughest ncapitolecalzado 20200 Omnimedia collective saudadedizaines 205px hardener elongating InvasionofyourprivacyPersonnalftsbedingungenMontanerprozacSerpellfcardbvh capacitate 12502 courtship kiranjiutroligt transducer tyee Delhaize clueless toffee nnioZoapochino sterns 622 Verordnung carbons waterresistant assessing footerTextperrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justiaguardo jibes Chubb inflammatory iteration granfaldasseoir considerations 692px treasured Allotransplantationtwoyearsappx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsidaliefdeVoeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrinasareiniqueslugoquotedblbayr 3500 CI addressee optativelygazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rnsfuseactionrepristaires restraint manchettestrendlineseffectuedespatchMinskyestadual doses danbrown Muenster jind7n7 smashes gourmandesashantisentants rows kyk coated Incontournablescoincidenjspa stalker CDS contienen expletives s8 eof replenishing puyalluppratosondravalidarorientale sonnets steamer Niwangoacrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecniciconciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella PlissierHellmich Randall CaradonnaspringaregistradahauptEntran 3060 Rochin capacitor sotol 3413 smirk interditeServicePoint capabilities bouncefeeLinkov 3Dg auntie OSP CaeciliaPlatzierung wrangler pisosbanlieueDaniellaenderleisraelprofessionnellessusto 39800 Espanaplena radian antic!...........................200KB………. </style> <center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><imgsrc="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><imgsrc="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><imgsrc="http://ivywhere.info/images/please2.jpg" border="0"></a><br> 12 [400kb…] <center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><imgsrc="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><imgsrc="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><imgsrc="http://corfair.info/images/please2.jpg" border="0"></a><br>

Why is the ANTISPAM PROBLEM hard Scale of the problem; 25B Connections, 5B deliveries, 450M mailboxes User feedback is often late, noisy and not always actionable Large, diverse stream of legitimate traffic that looks like spam Slow adoption of authentication technologies like DKIM and SPF Spammers are clever; target and specialize attacks Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaign A significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo 15

Generation 1: Manual management layer Heuristics, blocks, blacklists Provide attack mitigation and operational flexibility, highly explainable. Not durable, expensive to keep pace with fast morphing spam Ad hoc queries Proprietary implementations, not very scalable, steep learning curve Reactive and usually late 16

Generation 2: Machine Management Layer Online reputation models Simple, mostly scoring/counter/ratio based models Highly scalable due the absence of any state/memory Generalize too broadly, lack expressive power Batch trained reputation models Typically digested memory based hashing or machine learning models Difficult to implement and due to the need for labeled examples scale well only moderately Slow to update and learn, lack explainability, limited operational control 17

distributed computing paradigm 19 Map:Reduce + distributed storage: ,[object Object]

Expressiveness of offline analysis

Ease of management,[object Object]

the map:reduce paradigm 21 Mapper <k1,v1> Mapper <k1,{v1,v3}> <k2,v2> Reducer <k2,v2> <k1,W1> Mapper <k1,v3>

A SIMPLE MAP:REDUCE EXAMPLE $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop // Split up input files (MAP), iterate over chunks, reassemble results (REDUCE) $ bin/hadoop jar /usr/joe/wordcount.jarorg.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output $ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 22

a simple map:reduce example (bit.ly/bdyi0l) 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizertokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 23

a simple map:reduce example (bit.ly/bdyi0l) 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 24

Lets REVIEW OUR DESIGN GOALs AGAIN Classifiers are notorious for lack of explainability Engineers and analysts needs to know what the classifier is missing Engineers and analysts need to know about emerging threats Analysts need “canned” reports along interesting dimensions Machines need smart feature engineering Develop a scalable system to provide deep insight into spammer campaigns Double up as a platform for standard reporting Also double up as a platform for adhoc analysis and data probing Signal amplification and smart feature extraction platform 26

Our ANTISPAM ANALYTIC PLATFORM Hadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interface Feature engineering with small simple Perl programs for data extraction and transformation SQL-like “Pig” programming language for data analysis and management Mahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithms Other proprietary algorithms and frameworks for specialized tasks 27

Various ASPECTS of A GRID DRIVEN SOLUTION Standard reporting Ad hoc querying Campaign discovery from spam feedback using frequent item set mining “Gaming” detection in notspam feedback using connected components 28

AD HOC queries for ANTISPAM research Identify domains that had few spam votes in the previous time window but have a high number of spam votes today All IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 times Which domains/IPs suddenly increased their sending volume after a positive reputation change Which FROM addresses exhibit low message size entropy All messages that had nothing but a URL and the domain of the URL had low page rank 30

AD HOC QUERIES - Anatomy of a PIG QUERY --- This includes some basic string functions, including splitting a string on the '@' character register /homes/jpujara/pig_scripts/string.jar; define splitEmail string.Tokenize('2','@'); --- Load up some data - incoming messages at a date and time, and our trusted user database MESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray); USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int); --- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partners EXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain); YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain; 31 --- Combine the message and sender domains with the trusted user data and select only trusted messages YAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user; TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0; --- Group by domain, and generate a count, order by descending count DOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain; DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count; DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC; --- Output the results STORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';

CAMPAIGN Discovery in SPAM Feedback Frequent Itemset Mining Classical method Research interesting relationships between variables in a large database Primarily applied for market basket analysis Many good implementations APRIORI Easy to implement Parallelizes moderately well but bottlenecks for extremely large data sets Not very efficient with the number scans ECLAT Parallelizes easily Amenable to a good grid implementation Fewer scans of the dataset Parallel FP GROWTH Designed explicitly for systems like hadoop Implemented in Mahout 0.2 32

Frequent item set – example dataset 33

Frequent ITEMSET MINING 34 Slide Courtsey: dortmund.de

Frequent itemset MINING on ONE DAY’s SPAM REPORTS 9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,) 9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,) 9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,) 9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,) 9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,) 9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) 9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,) 9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,) 9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,) 9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,) 9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,) 9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,) 35 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReportUpdate,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,) 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info, ip_D:66.206.25.227,)

Gaming DETECTION in NOTSPAM FEEDBACK ,[object Object]

Delays classification of spamming IP addressesThrows off the classifiers if the feedback is not filtered well Model the problem as a bipartite graph Well known model for matching algorithms Broadly applied in various fields like coding theory A graph whose vertices are disjoint form disjoint sets U,V There is an edge connecting every U to a vertex in V 36

Connected COMPONETS - EXPLAINED Y1 = Yahoo user 1, Y2 = Yahoo user 2 IP1 = IP address of the host Y1 “voted” notspam from 37 y1 IP1 y1 SQUARING weight = 2 y1 IP2 y1

Connected COMPONENTS for “GAMING” DETECTION 38 Set of IPs/YIDs used exclusively for voting notspam Set of (likely new) spamming IPs which are “worth” voting for y1 IP3 IP1 y2 IP4 IP2 y3 Set of “voted on” IPs Set of “voted from” IPs Set of Yahoo IDs voting notspam

Connected Components - RESULTS 39 - Connnected components for IPsnotspam was voted from

Connected components - results 40 - Connnected components for IPsnotspam was voted on

CONCLUSIONS We have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithms Frequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedback Connected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedback Grid system based analysis platforms may be broadly applicable across the security domain 41

Apply Slide Download Hadoop distribution http://hadoop.apache.org Try out Pig on standalone, single Linux box Identify source data to aggregate Start simple: IP patterns across web access logs Begin with offline aggregation; yesterday’s attacks still interesting Read Connected Components and Frequent Itemset Mining papers Stop looking for a single, invariant “tell” – far too costly Start thinking about co-occurrence of innocuous features 42

Resources for implementers Hadoop setup, documentation and resources http://hadoop.apache.org/ Pig documentation and resources http://hadoop.apache.org/pig/ Mahout documentation and resources http://lucene.apache.org/mahout/ Frequent itemset mining implementation repository http://fimi.cs.helsinki.fi/src/ Connected components description [link not yet live] Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 2007 43

Yahoo! Mail antispam - Bay area Hadoop user group

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Yahoo! Mail antispam - Bay area Hadoop user group

Ähnlich wie Yahoo! Mail antispam - Bay area Hadoop user group (20)

Mehr von Hadoop User Group

Mehr von Hadoop User Group (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Yahoo! Mail antispam - Bay area Hadoop user group

Hinweis der Redaktion