Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn Anderson

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 228 Anzeige

Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn Anderson

Herunterladen, um offline zu lesen

Zipf's Law is prevalent throughout many forms of data and that includes the internet at large and within sectors of the internet, websites and web pages plus linguistics. How does this impact SEO if at all?

Zipf's Law is prevalent throughout many forms of data and that includes the internet at large and within sectors of the internet, websites and web pages plus linguistics. How does this impact SEO if at all?

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von Dawn Anderson MSc DigM (20)

Aktuellste (20)

Anzeige

Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn Anderson

  1. 1. Does Zipfian Distribution Explain SEO? Presented by: Dawn Anderson #Pubcon@dawnieando
  2. 2. Does%Zipfian% Distribution% Explain%SEO? @dawnieando
  3. 3. We#Know# That#SEO#Is# All#About Words Links Crawling Indexing Ranking Importance Popularity Probability;determination Disambiguation
  4. 4. There%is%a% phenomenon% ‘hiding%in%plain% sight’%(blending%in)% which%impacts%all%of% these%elements
  5. 5. Zipf’s'Law'+ The'Principle'of' Least'Effort'is'Everywhere
  6. 6. We#know#words#&#linguistics#have# unusual#patterns#e.g. “You#shall#know#a#word#by#the# company#it#keeps”#(Firth,#1957)#(coE occurrence)
  7. 7. Zipf’s Law*is*another* one*of*those*strange* linguistic*phenomena…* but*MUCH*more*than* that
  8. 8. Information*Retrieval*Lecture*2 University* of*Freiburg*– Lecture*1*– Zipf’s Law
  9. 9. Vsauce Video*– The*Zipf Mystery*5 You*should*watch*it
  10. 10. Mind%blown
  11. 11. So…$Just$What$is$Zipf’s Law?
  12. 12. Zipf’s Law*is*about* frequency*distribution,* rank*law,*&*is*often* associated*with linguistics
  13. 13. Zipf’s Law*is*Prevalent* When*It*Comes*to* ‘Popularity’*(Rank)* Distributions
  14. 14. Zipf’s Law*is* Empirical*(Widely* accepted).*Not* just*a*theory
  15. 15. Popularised by.George. Kingsley.Zipf • George&Kingsley&Zipf (1902&5 1950) • Linguist&&&philologist • Popularised ‘Zipf’s Law’&but&did&not&claim&to&have&discovered&it • Observed&patterns&in&word&frequency&distribution • Extended&Zipf’s Law&to&data&types&other&than&words • Developed&‘Principle&of&Least&Effort’&(Path&of&least&resistance) • Image&attribution:&Freeport&High&School,&Freeport,&Illinois.,&Public& domain,&via&Wikimedia&Commons
  16. 16. ‘Normal’)is) definitely) something) which)Zipf’s Law)is)NOT
  17. 17. A"‘Standard"Deviation"Curve’"(Normal"Distribution) This%file%is%licensed%under%the Creative%Commons Attribution%2.5%Generic license.
  18. 18. In#a#‘Normal’#Distribution Very%few%observations%with%very%low%values Quite%a%Few%observations%with%%low%values Many%observations%with%average%values Quite%a%Few%observations%with%high%values Very%few%observations%with%very%high%values
  19. 19. AKA#‘A#Bell# Curve’
  20. 20. A"Zipfian" Distribution"is"VERY" Different
  21. 21. A"Zipfian"Distribution"is"Very" Skewed"&"Heavily"Biased
  22. 22. ‘Zipfian’)Distribution Very%few%observations%with%very%high%values The%most%frequent%item%(e.g.%word)%is%twice%as% frequent%as%the%second%most%frequent%item The%third%most%frequent%item%is%1/3%the%value%of%the% first%most%frequent%item…%and%so%on Very,%very%many%observations%with%low%values
  23. 23. A"Long"Tail" Distribution
  24. 24. Of#Course.#As#SEOs#We#Are#Used#To#Seeing#‘Long#Tail# Distributions’.#In#Search#Behaviour
  25. 25. But$It$Goes$Much$ Further$Than$Search$ Demand$Curves
  26. 26. In#Linguistics#+ By#Zipfs Law#– The# Frequency#of#Any#Word#Is#Inversely# Proportional#To#Its#Rank#In#The# Frequency#Table
  27. 27. Zipfian' Distribution Layman’s(terms(– In(any(count(of( the(frequency(of(words(in(a(large( body(of(content,(the(probability(of( occurrence(of(words(or(other(items( starts(high(and(tapers(off.( A"few"occur"very"often" while"many"others"occur" rarely.
  28. 28. ’The’%is%the%most%used%word%in%the% English%language%(7%%frequency) ‘The’%is%used%twice%as%often%as%the%next% most%used%word%in%English%(3.5%) And%three%times%as%much%as%the%next% most%used%word Etcetera,%etcetera,%etcetera%infinitium
  29. 29. The$Most$Used$ Words$In$The$ English$Language$ Follows$a$Zipfian$ Distribution
  30. 30. Brown&Corpus&– 1&Million&Word&Text
  31. 31. It#starts#off# looking something like this 0 10000 20000 30000 40000 50000 60000 70000 80000 The Of And To A In That Is Was He For It With As His On Occurrence Occurrence
  32. 32. Till$it’s$this$) Brown$Corpus$ Words$Ranked$ by$Frequency$ Distribution 'The’
  33. 33. And$it$doesn’t$just$apply$to$ English$either
  34. 34. It#applies#to#ALL# languages
  35. 35. Even%languages% untranslated%as%yet
  36. 36. Zipf’s Law*&* 30*Languages* of*Wikipedia* (Image*credit*Wikiwand)
  37. 37. By Zipfs Law A"few"words" will"be"used" very"frequently But"account"for" a"large"part"of" the"vocabulary Many"words" will"be"used" infrequently But"account"for" a"small"part"of" the"vocabulary
  38. 38. In#most# languages#the# most#popular# 100#words#make# up around 50%# of#all text
  39. 39. “In$the$Brown$Corpus,$consisting$of$over$one$ million$words,$half$of$the$word$volume$ consists$of$repeated$uses$of$only$135$words.”$ (Fagan$&$Gencay,$2011)
  40. 40. Historically,,of,course,, many,of,these,words, were,considered,to, add,no,value,(stop, words)
  41. 41. ASIDE&' Any&value from most of&these& words&is&from& ‘contextual&glue’&in& natural&language& ‘post&BERT&and& friends’
  42. 42. Zipf’s Law*is*Aligned*with*Pareto’s*80:20* Rule*Law*(The*Law*of*The*Vital*Few)
  43. 43. By#Pareto#* 20%#of#the#words# account#for#80%#of#the#vocabulary
  44. 44. Pareto'is'also'EVERYWHERE'too'– Examples 20%$of$drivers$cause$80%$of$all$traffic$accidents. 20%$of$a$company’s$products$represent$80%$of$the$sales. 20%$of$employees$are$responsible$for$80%$of$the$results.
  45. 45. 20%$of$pea$pods$produce$ 80%$of$the$peas
  46. 46. Even…20%)of)a) carpet)receives) 80%)of)the)wear
  47. 47. So#Why#Does#This# ‘Zipfyness’#Exist?
  48. 48. Zipf’s'Law' Appears'to' Apply'to...Lots' and'lots'of' types'of'data
  49. 49. Zipf’s Law*&*The*Population*of*Cities*In*Countries Appears'to'always'hold' true'only'when'the'cities' are'economically'related' (e.g.'in'the'same'country' as'each'other)
  50. 50. • Zipf%rank%plot%for%276% metropolitan%areas%in%the%United% States,%after%results%of%the%census%in% 2000.%Source:%factfinder.census.gov.% The%straight%line%has%slope%1.11.
  51. 51. This%biased%spread%holds%true% for%income%distribution%!
  52. 52. On#the#Internet
  53. 53. So…$There$are these$same$ predictable$patterns$across$many$ aspects$of$the$web,$within$ccTLDs$ &$even$website$&$webpage$ features$themselves
  54. 54. Pretty&much&every&aspect&follows&a& Power&Law
  55. 55. And$Zipfs Law$is$a$Power$Law “A power law*is*a*functional relationship between*two*quantities,*where*a*relative*change*in*one* quantity*results*in*a*proportional*relative*change*in*the*other*quantity,*independent*of*the*initial* size*of*those*quantities:*one*quantity*varies*as*a power of*another.”*(Source:*Wikipedia)
  56. 56. These%Laws%Don’t%Change%In%‘Scale%Free’%Networks%Such% as%The%Web%Grows%or%Ages%Either This%file%is%licensed%under%the Creative%Commons Attribution6Share%Alike%3.0%Unported license.
  57. 57. A"Small"Example"in"All"The"Words"of"a"Blog"Site
  58. 58. Etymology(Nerd’s( Study • 490$page$blog • Top$40$most$frequently$occurring$words$ordered$by$rank • Followed$a$Zipfian$distribution
  59. 59. But$It$Goes$Well$Beyond$Just$ The$‘Words’$on$The$Internet
  60. 60. Zipfian'Distribution'Examples'On'The'Internet Number'of' inlinks'to' pages Number'of' outlinks from'' pages Size'of' websites No.'of'pages' in'websites Number'of' words'on'a' page Number'of' videos,'images' &'audio'files
  61. 61. In#word# distribution#on# a#ccTLD In#internal#link# distribution#on# within#a#ccTLD In#external#link# distribution# within#a#ccTLD In#the# popularity#of# sites In#the# popularity#of# queries In#the# popularity#of# web#pages
  62. 62. Zipfs Law)and)The) Internet Whilst)the)data)is)older) Zipfs Law)still)prevails Adamic,)L.A.)and) Huberman,)B.A.,)2002.) Zipf's law)and)the) Internet. Glottometrics, 3(1),)pp.143G150.
  63. 63. In…$EVERYTHING
  64. 64. I"asked"an"eminent"search"engineer"to"be"sure" this"was"correct."He"replied"yes
  65. 65. Some%Baeza)Yates%Papers%on Web%Characteristics Characterization, of,National,Web, Domains,(2007) Link,Analysis,in, National,Web, Domains,(2005), Characteristics, of,the,Web,of, Spain,(2005) Bias,on,The, Web,(2018)
  66. 66. Some%Caveats
  67. 67. It#is#true.#All#aspects#of#the# web#show#this#power#law# but#it#is#less#precise#on# webpages#at#the#very#far# ends#of#the#distribution# (Baeza<Yates,#2005,2007)
  68. 68. Shame&Law&in&web& pages&(Minimal& Shame)&Baeza5Yates,& 2007 Where%authors%or%organisations do%some% minimal%work%because%they%don’t%want%to%be% ashamed%of%their%work%(minimal%effort).%Makes% Zipfs Law%less%precise%on%the%small%pages%or% websites.
  69. 69. What%About%At%The% Long,%Long,%Long%Tail? Ricardo(explained(it’s(because(the(are(too(many( unique(value(for(it(to(be(exact
  70. 70. ccTLD%Power% Laws%Versus%a% ‘Whole% Internet’% Power%Law • Each%ccTLD%will%have%a%different%power%law% (different%exponent)% • Although%in%general%the%sum%of%ccTLD%power% laws%is%not%a%power%law,%in%many%cases%it%is% because%you%have%a%dominant%one.% • For%example%for%the%whole%Web,%the%USA%will% dominate.%Experimental%data%shows%this. Many%thanks%to%Ricardo%BaezaJYates%for%providing% this%response when questioned
  71. 71. Given&a&known&Power&Law& Distribution&in&EVERYTHING&on&the& web&how&might&this&be&reflected&in& search&engine&systems?
  72. 72. Given&That&By&Zipf’s &&Pareto’s&Law A"few"sites"will"be" sites"will"be"visited"" very"frequently But"account"for"a" large"part"of" search"traffic Many"sites"will"be" visited" infrequently And"account"for"a" small"part"of" overall"search" traffic
  73. 73. They% Know%ALL% About% Zipfs%Law “No$paper$on$statistics$of$web$pages$ is$complete$without$a$graph$ showing$a$power6law$distribution.”$ (Fetterly,$2005)
  74. 74. Zipf’s'Law'Helps' with'Search' Systems'Built'for' Scale
  75. 75. Since&EVERYTHING&Is& ‘Naturally’&Much&More& Predictable
  76. 76. Priorities(are(VERY( obvious(via(Zipf’s Law
  77. 77. Given&That&By&Zipf’s &&Pareto’s&Law A"few"sites"will"be" sites"will"be"visited"" very"frequently But"account"for"a" large"part"of" search"traffic Many"sites"will"be" visited" infrequently But"account"for"a" small"part"of" search"traffic
  78. 78. Some%‘things’%are%simply%NOT%as%important%as% others Cities,'towns'&' villages' (Population) Many'Low'quality' content'URLs Sites:'Think'CNN' vs'Mom'&'Pop' Blog Few'High'vs'Many,' Many'Low'or'No' PageRank'URLs Head'terms'versus' long'tail'search' terms
  79. 79. Much% Literature% Confirms% Zipf’s Law%is% Considered% in: Crawling Indexing Caching Ranking Quality5Determination
  80. 80. Zipfs&Law&&&Crawl& Frequency& Scheduling&– AKA& ‘Crawl&Budget’? Crawl& budget John& Mueller Gary& Illyes Link& Juice Meme& with& cats Duplicate& content
  81. 81. Crawl&Budget&is& absolutely&tied&to& Zipfian&Skewed& ‘Demand’
  82. 82. ‘Most’'Sites'Will'NOT'Have'High'Crawl' Budget…'EVER https://support.google.com/webmasters/answer/9689511?hl=en
  83. 83. The$Rate$of$Change$on$ Sites$Will$Likely$Follow$a$ Power$Law$Too$ (Predictable)
  84. 84. As#Will#The#Rate#of# Content#Creation#on# Sites#(Predictable)
  85. 85. ‘Quality’*Issues*Can*Arise*as*a* Result*of*Too*Many* Automatically*Generated* Pages*Defying*Zipf’s*Law*(The* rate*of*creation*is*obviously* not*by*human*hand)
  86. 86. Zipfian'Quality'Thresholds' Meeting'will'‘probably’'shift' for'the'site.'Quality; impacted;Crawling'will' ‘probably’'be'Impacted
  87. 87. You$Threw$a$‘Long$Tail’$ of$Low,$Low$Importance$ URLs$Into$The$Mix
  88. 88. You$Just$ Bought$ Some$Places$ At$The$Back$ of$The$‘Crawl$ Schedule’$ (Or$Didn’t$ Make$The$ Grade$At$All)
  89. 89. 'Crawl'Budget’'Issues' are'Often'OVERALL'‘Site' Quality’'Issues
  90. 90. Since&You’re& Being&Judged& on&The&‘Whole’& Cake
  91. 91. Aside&from&’Demand’&&& Overall&Quality
  92. 92. Crawl&‘budget’& is mostly about ‘Importance’( which&follows& Zipf’s Law)& (aside&from& host&load& capabilities) Demand PageRank+(or+ equivalent) Inclusion+in+ sitemap Internal+links External+links Canonical+tag Redirections Quality+ parents
  93. 93. All#of#These# Things#Likely# Are#Impacted# by#Zipf’s Law
  94. 94. The Zipfian Distribution of Many Things
  95. 95. What%About%Zipfyness%in% Indexing%Systems?
  96. 96. One$Index.$Multiple$ Tiers$Within$It$(Probably) • “Each&tier&is&the&next&level&of&document&popularity& and&usually&popularity&is&similar&to&a&Zipf&distribution&in& its&central&part”& • (This&is&Ricardo’s&answer&to&me&when&I&asked&about& this&last&week)
  97. 97. What%About%Web% Caching?
  98. 98. Given&That&By&Zipf’s &&Pareto’s&Law A"few"sites"will"be" sites"will"be"visited"" very"frequently But"account"for"a" large"part"of" search"traffic Many"sites"will"be" visited" infrequently But"account"for"a" small"part"of" search"traffic
  99. 99. Search' Engine'Web' Caching'&' Zipfian' Distribution Web$caching$policies$make$use$of$ Zipfs Law$(many$papers$confirm) A$small$amount$of$pages$will$be$ very$popular$(stored$in$near$ computer$memory) A$lot$of$pages$will$be$rarely$called$ (stored$on$disk)$and$probably$in$a$ low$level$tier
  100. 100. Web$Caching$Uses$Zipf’s$Law$Too “It$is$important$to$note$the$ effectiveness$of$caching$ relies$heavily$on$the$ existence$of$Zipf’s$law”$ (Adamic$&$Huberman,$2002)$$ Adamic,$L.A.$and$Huberman,$ B.A.,$2002.$Zipf's$law$and$the$ Internet. Glottometrics, 3(1),$ pp.143L150.
  101. 101. What%About%Zipfian% Distribution%in%Ranking?
  102. 102. Yep.%Web% Ranking%is% Zipfy%Too
  103. 103. By#Zipf’s#Law#in# Ranking#Position#2#is# ‘probably’#judged# ‘half’#as#relevant#as# Position#1
  104. 104. Despite'Some'Debate'PageRank'is'Largely' Thought'To'Follow'a'Zipfian'Type'Power'Law “We$suggest$that$power/law$distributions$of$PageRank$in$Web$graphs$have$ been$observed$because$the$typical$damping$factor$used$in$practice$is$ between$0.85$and$0.90.”$(Becchetti &$Castillo,$2006)
  105. 105. On#’Long#Tail’# SERPs#that#half#is# ‘probably’#very# small#though
  106. 106. ‘Could'be’'Why' ‘Same1Site' Same1Intent’' URLs'Appear' For'The'Same' Query THEORY' ALERT
  107. 107. Regardless)of)Theory) Search)Engines)Absolutely) Use)Prioritised ‘Thinning) Out’)Systems)For)Rankings
  108. 108. Top$K&Ranking&Systems& (2&(or&multi)&Stage& Shortlisting)
  109. 109. Stage&1&(Full&Ranking) Relevance(Inclusion((Is(the(page(relevant(at(all?) Costs(less Using(‘scale(systems’ Gather(a(Top>K(bunch(of(URLs(for(re>ranking Stage&2&(Re2Ranking) Precision(refinement(amongst(Top>K((top(x(number(of( results(from(stage(1 Utilising more(computationally(expensive(machine( learning(resources Likely(judging(Top>K(on(further(features((added(value)
  110. 110. First&Stage&Ranking&Gets&Rid&of& MOST&of&the&Candidates&For&The& Top&10&/&20
  111. 111. So why&does&all this even& matter?
  112. 112. And what(can(we(do( about(it?
  113. 113. Zipfs Law)Is)Aligned) with)An) Exceptionally)Strong) bias)of)‘Importance) Identification’) Beyond)The)‘Norm’
  114. 114. You$Are$Being$Judged$On$ The$‘Whole’$Cake
  115. 115. 'Overall’) Matters
  116. 116. The$‘Whole$Pie’$ ‘Probably’$ Includes$EVERY$ URL$You$Ever$ Created
  117. 117. Whilst'‘Your' Index' Inventory’' ‘Importance’' Is'Paramount
  118. 118. Google&Might&Think&You’re&a&Time4Waster&Aside&From& What&Is&Indexed
  119. 119. Particularly* Problematic* on*‘Long*Tail’* Content
  120. 120. Much%of%That%‘Discovered%Not%Indexed’%Content%Is% Often%In%The%‘Unimportant’%End%of%The%Zipfian%Curve A"few"URLs"which"can"satisfy"many"users URLs that can satisfy few users"needs"but" there’s"a"lot"of"them"(long,"long"tail) The"majority"of"URLs"– Medium"importance
  121. 121. As#A#Site#Grows#You# Can#End#Up#With# Messy#‘Entrails’
  122. 122. If#You#Keep#Throwing#These#Unimportant# URLs#into#The#Mix#You#Will#NEVER#Get# Enough#Googlebot#Crawl#to#Reconsolidate
  123. 123. As#sites#grow#the#challenge#is# around#healthy#‘index5 worthy’#‘inventory’#(of# varying#content#types)# management
  124. 124. Keeping'A'Strong' ‘Importance’'Heart' Within'A'Site
  125. 125. At#this#stage#SEO#is#as#much#about#saying# what#is#NOT#important#as#it#is#about#saying# what#IS#important
  126. 126. You$Need$To$Think$ Carefully$About$How$You$ Will$Manage$Content$as$It$ Grows,$Ages$&$Expires
  127. 127. Simply'Deleting'Content' is'Rarely'The'Answer
  128. 128. It#Can#Be#A# Disaster
  129. 129. If#You#Change#Anything in#Your# Site#You#Change#The#Zipfian# Distribution#– EVERYTHING Matters
  130. 130. For$every$action$ there$is$an$equal$and$ opposite$reaction$is$ TRUE$– Newton’s$ Third$Law
  131. 131. If#You#Prune# Without#Thought# You#Change#the# Zipfian#Distribution# in#Your#Site In#the#words In#the#internal#links In#the#out#links#from#pages In#the#‘topical#themes’#prevalent#in# your#site In#the#‘needs#met’#prevalence
  132. 132. Instead(of(deletion(try(‘importance’ suppression
  133. 133. You$Archive
  134. 134. Archiving)Can)Retain)Categorical)Value)Whilst) Dampening)‘Importance’ • https://www.searchenginejo urnal.com/google5john5 mueller5rank5important5 pages/345192/
  135. 135. What%About% Redirection?
  136. 136. Strong'‘Entity’' Redirection'is'Powerful
  137. 137. But$Inconsistent$ Redirections$ Cause$Confusion
  138. 138. Be#Careful#You#Do#Not# Keep#Redirecting#To# Products#Which#Also# Go#Too
  139. 139. Consider)Redirection)To) ‘Entity1Aligned’)Specific) Subcategories)Which) WILL)NOT)Go)Away
  140. 140. Build&‘Legacy& Redirection&Review&&& Consolidation’&Into& Your&‘Business&As& Usual’&Activities
  141. 141. Often&You&Will&Find& ‘Forked&&&Split’& Inconsistent&Legacy& Redirection&Patterns
  142. 142. So#Continually#Gather#Everything#Back#Together
  143. 143. Point&Everything& In&The&Right& Direction
  144. 144. Be#Aware#That#Often#On#The#Long# Tail#Google#Will#Let#You#Handle#The# Redirection#On#Your#Side#Without# Updating#The#SERPs
  145. 145. And$Will$Use$ Page$Titles$&$ Snippets$ From$ ‘Redirected: From’$URLs$ with$ ‘Redirected: To’$URLs
  146. 146. When%SHOULD%You%Delete% Content?
  147. 147. When%You%Stop% Providing%The% Service%or%Product% Category%Altogether
  148. 148. You$Mostly$Can’t$Rank$ For$Things$Which$Have$ No$Presence$In$Your$Site
  149. 149. OR…$Improve$the$ template$with$ minimum$boilerplate$ &$maximum$added$ value$overall
  150. 150. A"GREAT"dynamic" template"built"for" scale"can"be"super" powerful"on"many" levels."Particularly" the"long"tail
  151. 151. BUT…%It%MUST%be%done%well% (not%spam)%&%designed%to%meet% ALL%the%important%specific% intent%needs
  152. 152. An#excellent# product#or# subcategory# template#is#a# super#‘long6 tail#net’
  153. 153. BEWARE&OR& DELIGHT? It’s&easy&to& add&(or&lose)& at&scale& ‘value?add’& on&templates
  154. 154. A"small"change" dynamically"to"many" pages"can"kill"or"cure
  155. 155. Badly&Designed&Templated& Dynamic&Menus&&&Filters& Can&Be&Disastrous
  156. 156. Plan%Programmatic% Approaches%With% Care.%Lots%of%Testing% &%Tentative%Scaling
  157. 157. If#Things#Go#Wrong# They#Will#Likely#Go#VERY# Wrong#7 Yikes
  158. 158. Employing*Zipfs*Law*Thinking*in*Quality* Thresholds • Skewed&proportions??&Reset& the&quality&proportions • If&you&have&less&control&over& one&section&improve&another • ‘Overall’&matters • Consider&‘everything&that& has&gone&before’&too&(legacy& cruft&can&be&devastating)&
  159. 159. Pro$Tip:)Don’t)Just)Wait) For)Googlebot)in)Log) File)Analysis)When) Trying)to)Find)&)Fix) Dynamic)Cruft
  160. 160. Use$The$Crawling$ Patterns$of$Other$Bots$ to$Find$Crufty9URLs$ From$the$Past.$ Redirect$Them$Before$ Googlebot$Arrives$Next$ Time$Round
  161. 161. You$Will$Rarely$Find$Long$Legacy$Cruft$From$A$Crawl$of$ A$Current$Site$or$in$GSC.$They$are$NOT$individually$ important$enough$to$make$the$1k$sample
  162. 162. Microsoft)Index)Explorer)is)a)Gem)For)Legacy)Fixing)in) a)Well)Organised Map)of)The)Site)Past)&)Present
  163. 163. Become&a& Cruftbuster
  164. 164. Employing*Zipfs*Law*Thinking*in*Quality* Thresholds • Skewed& proportions??& Improve the&quality& proportions • If&you&have&less& control&over&one& section&improve& another
  165. 165. And$Don’t$JUST$ Have$Dynamic$ Content
  166. 166. If#All#You#Can#Offer#is# Thin#Product#Pages# How#Are#You#Any# More#Important#Than# Others?
  167. 167. Quality(Supplements(– Redeem(your(site( with(‘high(quality’(sections(to(supplement( less;creative(transactional(pages
  168. 168. You$Are$Building$More$Than$An$ Ecommerce Site You$are$building$a$valuable$resource$on$a$domain$ of$knowledge.$The$by9product$is$you$monetise with$ecommerce
  169. 169. You$need$to$ meet$as$many$ niche0relevant$ informational$ needs$as$possible
  170. 170. Since&You&Are&Being&Judged&On& The&‘Whole’&‘Needs9Meeting’& Cake
  171. 171. But$Be$Careful$You$Do$Not$‘Inadvertently’$Morph$Into$a$ Different$Type$of$Site https://www.sistrix.com/blog/disciplined5how5dailymail5co5uk5got5placed5 amongst5peers/
  172. 172. Whole&Pie&Proportions&Matter&Too
  173. 173. Build&Strong&Theme&Clusters&&& 'Little&Bow&Ties’&(Strongly& Connected&Components)&(Hubs)
  174. 174. Whole&Pie&Quality&is& Paramount
  175. 175. Keep$Google$(&$Users)$Away$From$Poor$ Quality$&$Too$Much$‘Low$Importance’
  176. 176. Since&Some&‘things’&are&simply&NOT&as& important&as&others Cities,'towns'&' villages' (Population) Many'Low'quality' content'URLs Sites:'Think'CNN' vs'Mom'&'Pop' Blog Few'High'vs'Many,' Many'Low'or'No' PageRank'URLs Head'terms'versus' long'tail'search' terms
  177. 177. Indicate)‘IMPORTANCE’)&)Quality • Avoid&linking&internally& to&low&importance& pages • Avoid&linking&to& products&which&are&out& of&date&or&out&of&stock • Avoid linking to ‘far&too& long&a&tail’&too&much
  178. 178. Internally)Linking)to) Zero)Inventory)is)a)Bad) Sign)– But) Programmatic) Solutions)Help)With) Resource)Limitations If#stock#level#<#0#don’t#link#in# relatedness
  179. 179. “Stay&out&of&the&black&and&into&the& red,&Nothing&in&this&game&for&two& in&a&bed.”&(Jim&Bowen,&Bullseye)
  180. 180. Zipf’s'Law'Means'Even'We' Need'Scaleable'Solutions
  181. 181. Employ'Programmatic'Internal' Linking'in'Approaches'On' Ecommerce'Templates'For' Scale
  182. 182. But$NOT$Spam$ – Plan$Well$– Map$to$ Demand
  183. 183. Your%Important%Pages%Should%Be%Like%‘Brazil%Nuts’
  184. 184. But$Flat$ Architectures$Send$a$ Signal$That$ Everything$is$of$Equal$ Importance
  185. 185. That’s'Simply' NOT True
  186. 186. Undertake)Recency,) Frequency)&) Monetary)Value) Analysis)– The)Results) with)be)Zipfy
  187. 187. Some%‘things’%are%simply%MUCH%MORE%important%than% others In#‘local#search’# human#population Seasonal#product# importance# (Halloween#does#not# matter#in#March) Product#importance# to#the#business#(Some# offerings#make#a#loss) Audience#importance# to#the#business#(Some# CLTVs#are#lossD making) Product#importance# to#top#audience A#select#'few'#top# tasks#of#that#audience# (top#pains#&#gains)
  188. 188. It’s%Important% to%VERY%MUCH% Emphasise% Importance
  189. 189. Employing*‘Zipfs*Law* Thinking’*in*Internal* Linking*in*Ecommerce Important)categories)absolutely)must) enjoy)emphasized)internal)linking
  190. 190. Wrong&Page&Ranking& – Probably&’Skewed’& Importance&Signals
  191. 191. Or…$Over' optimization$(Google$ finds$another$page$ to$rank)
  192. 192. If#You#Insist#on#Flat# Architectures#&#Few# Categories#– Then#You#Must# Use#Linking#&#Pagination#Very# Well
  193. 193. Employing*Zipfs*Law*in*Pagination If#you#must#use#pagination#utilize#programmatic#approaches#to#keep#the# most#important#products#high#in#the#paginated#series: • E.g: • By#(genuine)#availability • By (genuine)#popularity • By seasonality
  194. 194. Automated)Categorical) &)Subcategorical) Clustering)From) Product)Pages)is)Both) Powerful)&)Scaleable
  195. 195. Quality(Sub+ categorization(Is(Gold( and(Can(Help(with( Avoiding(Many(Issues
  196. 196. Too#Many#Products#in#Too#Few#Categories Could#well be#one#of#the#reasons#why# deep#paginated#results#appear#in#SERPs#– Google’s#attempts#to#surface#inventory#in# search
  197. 197. Such%as:% Competing% with% Yourself%in% eCommerce Products)equally)important Competing)with)each)other Competing)with)their)category Old)product)floating)around)in)SERPs)– Entity) redirect)to)highly)relevant)subcategory
  198. 198. Quality(Sub+ categorization(Is(Specific( Enough(for(Quality( ‘Entity(Redirection’
  199. 199. And$Specific$ Enough$To$Rank$ Against$Product$ Pages
  200. 200. Nudge&Forward&a&Clear&Winner&– The&Subcategory
  201. 201. But$Utilise$ ‘Overflow$ SEO’ Don’t&create&subcategories& just&because&you&can Nobody&wants&a& subcategory&or&category& with&zero&products Grow&and&contract&into& subcategorization&as&your& inventory&dictates
  202. 202. Don’t&Try&To&Wear& Shoes&Bigger&Than& Your&True&‘Value&Add’& Zipfian&Footprint
  203. 203. And$Finally…
  204. 204. Zipf argued+Zipf’s Law+ could+be+in+part+to+do+ with+‘laziness’+(Zipf’s Law+AKA ‘The+Principle+ of+Least+Effort’)
  205. 205. Zipf’s Principle-of- Least-Effort- alludes-to: Make%it%easy%to%find%what%I% need Make%sure%there%is% consistency%so%I%can%get% quicker%at%finding%what%I%need% easily%through%repetition
  206. 206. Consider)‘The)Law)of)Least)Effort’)in) Copy
  207. 207. Zipfian' Approaches'in' Primary' Navigations ‘Top%Task’% Identification Limited%‘High% Importance’% categories Links%to%‘View% all%x’%in%menu% structures
  208. 208. Use$popular$ words$ people$use.$ Not$your$ company$ jargon
  209. 209. So#Does#Zipfian# Distribution# Explain#SEO?
  210. 210. Let’s&Look&At&The&Evidence…&Zipfs Law&applies& to: Word%frequency%distribution The%internet%at%large Link%graphs%(internal%&%external) Crawl%frequency%(probably) Web%caching%policies%and%likely%index%inclusion Probably%first%stage%ranking%at%least 'Importance’%(and%unimportance)%identification
  211. 211. Thank&You
  212. 212. References
  213. 213. • Adamic,(L.A.(and(Huberman,(B.A.,(2002.(Zipf's law(and(the(Internet. Glottometrics, 3(1),( pp.143C150. • BaezaCYates,(R.,(Castillo(Ocaranza,(C.(and(López(Martínez,(V.,(2005.(Characteristics(of(the( Web(of(Spain. • BaezaCYates,(R.,(Gionis,(A.,(Junqueira,(F.,(Murdock,(V.,(Plachouras,(V.(and(Silvestri,(F.,( 2007,(July.(The(impact(of(caching(on(search(engines.(In Proceedings0of0the030th0annual0 international0ACM0SIGIR0conference0on0Research0and0development0in0information0 retrieval (pp.(183C190). • BaezaCYates,(R.,(Castillo,(C.(and(Efthimiadis,(E.N.,(2007.(Characterization(of(national(web( domains. ACM0Transactions0on0Internet0Technology0(TOIT), 7(2),(pp.9Ces. • BaezaCYates,(R.,(Boldi,(P.(and(Chierichetti,(F.,(2015,(May.(Essential(web(pages(are(easy(to( find.(In Proceedings0of0the024th0International0Conference0on0World0Wide0Web (pp.(97C 107). • BaezaCYates,(R.,(2018.(Bias(on(the(web. Communications0of0the0ACM, 61(6),(pp.54C61. • Becchetti,(L.(and(Castillo,(C.,(2006,(May.(The(distribution(of(PageRank(follows(a(powerC law(only(for(particular(values(of(the(damping(factor.(In Proceedings0of0the015th0 international0conference0on0World0Wide0Web (pp.(941C942). • Zipf,(G.K.,(2016. Human0behavior and0the0principle0of0least0effort:0An0introduction0to0 human0ecology.(Ravenio Books.
  214. 214. Appendix
  215. 215. Zipfian' Distribution' &'Efficiency' Scaling'in' SEO There%is%no%shame%in%using%well%thought% through%and%intelligent%programmatic% approaches%for%scale%in%SEO On%long%tail%content%(low%search%volume% but%many%combined%searches)%it%is%a% ‘realistic’%approach Build%‘high%quality’%scalable%templates
  216. 216. Subcategorisation. &.Zipfs.Law • Target'‘torso'terms’'(e.g.'black'dresses,' diamante'shoes,'lacy'dresses) • Internally'linking'subcategory'to'subcategory' can'be'super'relevant • Subcategories'provide'an'excellent'way'to' ‘dampen’'importance'and'raise'importance'to' ‘fewer’'categories'(moving'‘things’'down'to' subcategory'level'in'the'architecture) • Often'a'more'natural'match'to'queries • Provides'a'more'natural'featureErich'‘ontology’ • Builds'out'the'semantic'richness'of'a'site'in' internal'anchors
  217. 217. Maximal Shame*Law* &*Zipfian* Distribution (With&regards&to&Zipfian&distribution& in&web&pages):&“To&be&precise&it&is& not&true&at&the&beginning&(e.g.,&small& pages)&because&what&I&call&the& shame&law&(you&do&some&minimal& work)&and&it&is&not&true&at&the&end&of& the&long&tail&because&it&is&very&long& (e.g.,&many&unique&values).”&(BaezaF Yates,&2020)
  218. 218. Maximal'Shame'Power'Law'(Baeza3Yates “One%phenomenon%that%has%appeared%before%in%our%own%studies%and%now%is% completely%clear%is%the%smaller%power%law%exponent%at%the%beginning%of%several%of% the%measures%presented.%In%fact,%this%happens%for%file%sizes%up%to%25Kb,%pages%perA site%up%to%(15–30),%pages%perAdomain%up%to%10%(except%South%Korea),%number%of% outlinks in%a%page%up%to%10%to%40,%and%average%number%of%internal%links%persite up%to% 15%to%30,%where%a%range%is%given%to%show%the%variability%for%different%countries.%We% argue%that%this%is%due%to%another%empirical%power%law%that%we%call%maximal%shame% which%forces%people%to%work%a%bit%more%than%the%minimum%until%they%feel%good% about%their%work.%Notice%that%this%maximal%shame%can%be%for%an%individual%or%for%a% group%(e.g.,%in%the%case%of%a%Web%site).” BaezaAYates,%R.,%Castillo,%C.%and%Efthimiadis,%E.N.,%2007.%Characterization%of%national% web%domains. ACM$Transactions$on$Internet$Technology$(TOIT), 7(2),%pp.9Aes.
  219. 219. Automated)Internal) Linking)&)Strong) Templates There%is%no%shame%in%using%well%thought% through%and%intelligent%programmatic% approaches%for%scale Avoid%linking%internally%to%products% which%have%a%short%shelf%life%unless%you% have%programmatically%set%up%the% system%to%change%based%on%availability Internally%linking%to%unavailable% products%is%probably%not%a%good%quality% signal
  220. 220. Be#Prepared# To#Give#Some# Things#Up Some%‘things’%WILL%matter%Zipfianly% less%than%others

×