SlideShare a Scribd company logo
1 of 23
Finding Nobel prize window by
         PageRank




           FUJITA Yuji, Turnstone Research Inst., Nihon Univ.
Graph and Network
●   Graph theory
    –   Part of mathmatics
●   Network science
    –   Inter-disciplinary study of
        ●   Graph theory
        ●   Physics
        ●   Social science
        ●   Informatics
        ●   particular topics from finance, biology, ...
Graph theory
    Date back to 1730's
●   Objectives
    –   Lower dimensional topological structure
    –   Combinatorial and topological studies
●   Topics
    –   Four colour theorem
    –   Invariants


                                                  From Wikipedia
Network science
●   Objectives
    –   Statistics and dynamics
    –   Social, Financial, Technological themes
●   Topics
    –   6 degrees of separation
    –   Scale-free networks
    –   PageRank
                          Title:syms.eps
                          Creator:gnuplot 4.0 patchlevel 0
                          CreationDate:Sun Jan 13 23:04:28 2008
Bibliometrics
●   Quantitative evaluation of (academic) documents
●   Conventional approach: number of citation

●   Citation network
    –   Node: paper Edge: citation
    –   directed graph
●   More true metric: PageRank
Citation vs PageRank




Best cited do not have the best score
Top articles
Clinical     Effects of an angiotensin-converting-enzyme inhibitor,
Medicine     ramipril, on cardiovascular events in high-risk patients
Clinical     Vitamin E supplementation and cardiovascular events in high-
Medicine     risk patients
Immunology   Cytotoxic T lymphocyte-associated antigen 4 plays an
             essential role in the function of CD25(+)CD4(+) regulatory
             cells that control intestinal inflammation
Immunology   Immunologic self-tolerance maintained by CD25(+)CD4(+)
             regulatory T cells constitutively expressing cytotoxic T
             lymphocyte-associated antigen 4
Physics      String theory and noncommutative geometry
Physics      Large-N limit of non-commutative gauge theories
Molecular    Smac, a mitochondrial protein that promotes cytochrome c-
Biology &    dependent caspase activation by eliminating IAP inhibition
Genetics
Molecular    Identification of DIABLO, a mammalian protein that promotes
Biology &    apoptosis by binding to and antagonizing IAP proteins
Genetics
Molecular    Systematic variation in gene expression patterns in human
Graph expression
●   Embedding: drawing on sphere/space
●   Matrix
PageRank overview
●   Link from a great node is more important
      ↔ degree as a score
●   But how can it be done? - the process can be
    lost in a loop..




                        Figure from “The PageRank Citation Ranking: Bringing Order to the Web”
Finite state Markov chain
●   Node: status, Transition matrix: moving along
    the edge
    –   Row: linked (cited) vector
    –   Column: link (cite) vector
●   Probability vector refreshed by multiplying the
    transition matrix
Steady state gives PageRank
●   Some Markov chain has a unique steady state
●   Steady state given by eigenvector
    –   A vector such that Mx = ax
●   Eigenvector given by linear algebra
    –   Widely known how to compute
Why PageRank works?
●   Not all citations are equally significant
●   Less citation can be a signal of even more
    great work
    –   Fundamental work not cited directly
●   Academic cascade
Meanings of citation
●   Brainchild
●   History
●   Respect
●   Identity


    something more than <a>tag</a>
To reach the top
●   Many great children
    –   Each child give birth to many works


    = great scientific achievement
Limitations
●   Prof. Yamanaka's work (CELL, 2006) has poor
    PageRank score, which is a shame to say at
    least.
●   SPAM issues; not so serious as naiive citation
    count
To practice
●   Get citation data
    –   Product or scrape
●   Transition matrix
    –   Random surfer model
●   Iterate matrix-vector product operation
    –   Sparse matrix operation
Data
●   Tomson-Reuter, Elsevier, …
●   Scrape the web (arXive..)
●   Common SQL server will hold the data
●   NLP required
Transition matrix
●   Not all transition matrix has unique
    eigenvector
●   Random surfer model: let the graph be
    connected and get out of loop


                 +                   =
Adaptation to papers
●   Old paper cannot cite newer one
    –   Non-uniform random surfing
●   Adjust decay rate
Sparse matrix
●   Most of the elements are Zeros
●   Compressed form reduces space and time

●   libcsparse
    –   made by UFL people and others, distributed under
        LGPL
Reference
L Page, S Brin, R Motwani, T      The PageRank citation ranking: bringing order to
Winograd                          the web.
Dylan Walker1,2 , Huafeng         Ranking Scientific Publications Using a Simple
Xie2,3 , Koon-Kiu Yan1,2 ,        Model of Network Traffic
Sergei Maslov2
P. Chen,1, ∗ H. Xie,2, 3, † S.    Finding Scientific Gems with Google
Maslov,3, ‡ and S. Redner1, §
Hajime BABA                       Google の秘密 - PageRank 徹底解説
Acknowledgment
●   Mr. Kazuhisa Takei for ruby interface of
    libcsparse in ffi
●   Dr. Mari Jibu for citation data handling
●   Dr. Wataru Souma for network scientific
    suggestions and comments
●   Dr. Yoshi Fujiwara for choosing this topic and
    invitation
●   Free software developers
About me
●   2010- Turnstone Research, Inst.
●   2011- Nihon Univ. researcher
●   2009-2010 finance sector
●   2007-2009 Network analysis at NiCT
●   2001-2007 Venture firm CEO
●   1994-2002 Discrete math graduate student

●   Ski, climbing, bicycle, art

More Related Content

Similar to finding nobel prize window by PageRank

Data Science - Learning path by thedatascienceportal
Data Science - Learning path by thedatascienceportalData Science - Learning path by thedatascienceportal
Data Science - Learning path by thedatascienceportalPranjal Pandey
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
 
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...Minsuk Kahng
 
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataJian Wu
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analyticsRik Van Bruggen
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeNational Institute of Informatics
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRathachai Chawuthai
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
 
Project Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social NetworksProject Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social Networksamirhhz
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsHesen Peng
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientistsinside-BigData.com
 
Assembling and Applying an Education Graph based on Learning Resources in Uni...
Assembling and Applying an Education Graph based on Learning Resources in Uni...Assembling and Applying an Education Graph based on Learning Resources in Uni...
Assembling and Applying an Education Graph based on Learning Resources in Uni...Tom Heath
 
Resume(short)
Resume(short)Resume(short)
Resume(short)butest
 
Lec01-Algorithems - Introduction and Overview.pdf
Lec01-Algorithems - Introduction and Overview.pdfLec01-Algorithems - Introduction and Overview.pdf
Lec01-Algorithems - Introduction and Overview.pdfMAJDABDALLAH3
 
A seminar on neo4 j
A seminar on neo4 jA seminar on neo4 j
A seminar on neo4 jRishikese MR
 

Similar to finding nobel prize window by PageRank (20)

Data Science - Learning path by thedatascienceportal
Data Science - Learning path by thedatascienceportalData Science - Learning path by thedatascienceportal
Data Science - Learning path by thedatascienceportal
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
 
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
 
Ngsp
NgspNgsp
Ngsp
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Project Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social NetworksProject Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social Networks
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientists
 
Assembling and Applying an Education Graph based on Learning Resources in Uni...
Assembling and Applying an Education Graph based on Learning Resources in Uni...Assembling and Applying an Education Graph based on Learning Resources in Uni...
Assembling and Applying an Education Graph based on Learning Resources in Uni...
 
Resume(short)
Resume(short)Resume(short)
Resume(short)
 
Lec01-Algorithems - Introduction and Overview.pdf
Lec01-Algorithems - Introduction and Overview.pdfLec01-Algorithems - Introduction and Overview.pdf
Lec01-Algorithems - Introduction and Overview.pdf
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
A seminar on neo4 j
A seminar on neo4 jA seminar on neo4 j
A seminar on neo4 j
 

More from Yuji Fujita

Visualizing large-scale structure of a million-firms economic network
Visualizing large-scale structure of a million-firms economic networkVisualizing large-scale structure of a million-firms economic network
Visualizing large-scale structure of a million-firms economic networkYuji Fujita
 
academic landscape time lapse
academic landscape time lapseacademic landscape time lapse
academic landscape time lapseYuji Fujita
 
Accidentsanalysis
AccidentsanalysisAccidentsanalysis
AccidentsanalysisYuji Fujita
 

More from Yuji Fujita (6)

Talk
TalkTalk
Talk
 
Visualizing large-scale structure of a million-firms economic network
Visualizing large-scale structure of a million-firms economic networkVisualizing large-scale structure of a million-firms economic network
Visualizing large-scale structure of a million-firms economic network
 
academic landscape time lapse
academic landscape time lapseacademic landscape time lapse
academic landscape time lapse
 
2012.08.28
2012.08.282012.08.28
2012.08.28
 
Accidentsanalysis
AccidentsanalysisAccidentsanalysis
Accidentsanalysis
 
Longtail
LongtailLongtail
Longtail
 

finding nobel prize window by PageRank

  • 1. Finding Nobel prize window by PageRank FUJITA Yuji, Turnstone Research Inst., Nihon Univ.
  • 2. Graph and Network ● Graph theory – Part of mathmatics ● Network science – Inter-disciplinary study of ● Graph theory ● Physics ● Social science ● Informatics ● particular topics from finance, biology, ...
  • 3. Graph theory Date back to 1730's ● Objectives – Lower dimensional topological structure – Combinatorial and topological studies ● Topics – Four colour theorem – Invariants From Wikipedia
  • 4. Network science ● Objectives – Statistics and dynamics – Social, Financial, Technological themes ● Topics – 6 degrees of separation – Scale-free networks – PageRank Title:syms.eps Creator:gnuplot 4.0 patchlevel 0 CreationDate:Sun Jan 13 23:04:28 2008
  • 5. Bibliometrics ● Quantitative evaluation of (academic) documents ● Conventional approach: number of citation ● Citation network – Node: paper Edge: citation – directed graph ● More true metric: PageRank
  • 6. Citation vs PageRank Best cited do not have the best score
  • 7. Top articles Clinical Effects of an angiotensin-converting-enzyme inhibitor, Medicine ramipril, on cardiovascular events in high-risk patients Clinical Vitamin E supplementation and cardiovascular events in high- Medicine risk patients Immunology Cytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammation Immunology Immunologic self-tolerance maintained by CD25(+)CD4(+) regulatory T cells constitutively expressing cytotoxic T lymphocyte-associated antigen 4 Physics String theory and noncommutative geometry Physics Large-N limit of non-commutative gauge theories Molecular Smac, a mitochondrial protein that promotes cytochrome c- Biology & dependent caspase activation by eliminating IAP inhibition Genetics Molecular Identification of DIABLO, a mammalian protein that promotes Biology & apoptosis by binding to and antagonizing IAP proteins Genetics Molecular Systematic variation in gene expression patterns in human
  • 8. Graph expression ● Embedding: drawing on sphere/space ● Matrix
  • 9. PageRank overview ● Link from a great node is more important ↔ degree as a score ● But how can it be done? - the process can be lost in a loop.. Figure from “The PageRank Citation Ranking: Bringing Order to the Web”
  • 10. Finite state Markov chain ● Node: status, Transition matrix: moving along the edge – Row: linked (cited) vector – Column: link (cite) vector ● Probability vector refreshed by multiplying the transition matrix
  • 11. Steady state gives PageRank ● Some Markov chain has a unique steady state ● Steady state given by eigenvector – A vector such that Mx = ax ● Eigenvector given by linear algebra – Widely known how to compute
  • 12. Why PageRank works? ● Not all citations are equally significant ● Less citation can be a signal of even more great work – Fundamental work not cited directly ● Academic cascade
  • 13. Meanings of citation ● Brainchild ● History ● Respect ● Identity something more than <a>tag</a>
  • 14. To reach the top ● Many great children – Each child give birth to many works = great scientific achievement
  • 15. Limitations ● Prof. Yamanaka's work (CELL, 2006) has poor PageRank score, which is a shame to say at least. ● SPAM issues; not so serious as naiive citation count
  • 16. To practice ● Get citation data – Product or scrape ● Transition matrix – Random surfer model ● Iterate matrix-vector product operation – Sparse matrix operation
  • 17. Data ● Tomson-Reuter, Elsevier, … ● Scrape the web (arXive..) ● Common SQL server will hold the data ● NLP required
  • 18. Transition matrix ● Not all transition matrix has unique eigenvector ● Random surfer model: let the graph be connected and get out of loop + =
  • 19. Adaptation to papers ● Old paper cannot cite newer one – Non-uniform random surfing ● Adjust decay rate
  • 20. Sparse matrix ● Most of the elements are Zeros ● Compressed form reduces space and time ● libcsparse – made by UFL people and others, distributed under LGPL
  • 21. Reference L Page, S Brin, R Motwani, T The PageRank citation ranking: bringing order to Winograd the web. Dylan Walker1,2 , Huafeng Ranking Scientific Publications Using a Simple Xie2,3 , Koon-Kiu Yan1,2 , Model of Network Traffic Sergei Maslov2 P. Chen,1, ∗ H. Xie,2, 3, † S. Finding Scientific Gems with Google Maslov,3, ‡ and S. Redner1, § Hajime BABA Google の秘密 - PageRank 徹底解説
  • 22. Acknowledgment ● Mr. Kazuhisa Takei for ruby interface of libcsparse in ffi ● Dr. Mari Jibu for citation data handling ● Dr. Wataru Souma for network scientific suggestions and comments ● Dr. Yoshi Fujiwara for choosing this topic and invitation ● Free software developers
  • 23. About me ● 2010- Turnstone Research, Inst. ● 2011- Nihon Univ. researcher ● 2009-2010 finance sector ● 2007-2009 Network analysis at NiCT ● 2001-2007 Venture firm CEO ● 1994-2002 Discrete math graduate student ● Ski, climbing, bicycle, art

Editor's Notes

  1. 歴史的にはグラフ理論が先行しているが , ネットワーク科学自体は社会ネットワーク分析がコンピュータ登場以前から社会学者によって実践されてきた . 利用可能な情報や処理能力の増大から , 統計的な手法が意味を持ったり , 統計力学が援用されるようになったのは , インターネットの普及以降
  2. 歴史的にはグラフ理論が先行しているが , ネットワーク科学自体は社会ネットワーク分析がコンピュータ登場以前から社会学者によって実践されてきた . 利用可能な情報や処理能力の増大から , 統計的な手法が意味を持ったり , 統計力学が援用されるようになったのは , インターネットの普及以降
  3. 歴史的にはグラフ理論が先行しているが, ネットワーク科学自体は社会ネットワーク分析がコンピュータ登場以前から社会学者によって実践されてきた. 利用可能な情報や処理能力の増大から, 統計的な手法が意味を持ったり, 統計力学が援用されるようになったのは, インターネットの普及以降
  4. The Protein Data Bank Effects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patients The genome sequence of Drosophila melanogaster String theory and noncommutative geometry The complete atomic structure of the large ribosomal subunit at 2.4 angstrom resolution Smac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibition Identification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteins The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme Cytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationnil
  5. 任意のグラフは 3 次元に埋め込み可能 . 種数 0 の曲面 ( 平面 ) に埋め込み可能なものや , そうでないものなど , 幾何学的な表現を与えたものの他に , 行列として表現することもできる . そして今回世話になるのは , こっちのほう .
  6. 先立つものが決まってないと今みてるノードのスコアも決まらないけど , それが決まらないと結局先立つノードのスコアも決まらないよ ? どうすりゃいいの ?