Case study of Rujhaan.com (A social news app )

Case Study of Rujhaan.com
November 2014 Meetup
Rahul Jain
@rahuldausa

About Me…
• Big-data/Search Consultant based out of Hyderabad, India
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big
data solutions (Apache Hadoop and Spark)
• Organizer of two Meetup groups in Hyderabad
• Hyderabad Apache Solr/Lucene
• Big Data Hyderabad

What it does?
Rujhaan which means "#interest" is a news app that
aggregates the Trending #News, #trends with #buzz
around them from social media.
It also works as a content discovery where user can see
information based on his interest (under development).

What I am going to talk
• Introduction
• Software Stack
• Crawler
• Apache Solr
• MongoDB
• Redis
• Machine Learning stack
• Classification
• Clustering
• NER
• POS Tagging

How it look ?
http://www.rujhaan.com

Trends : Arpita Khan
http://www.rujhaan.com/topic/Arpita-Khan.html

Trends : Phil Hughes
http://www.rujhaan.com/topic/Phillip-Hughes.html

Major challenge:
Response time of 500ms is Critical

High level Flow: Processing
Fetch
Managed Cache
Internet
2
1
3 4
Topics
Extraction 1
8
5
Language
Detectio
6
Classification/
Clustering
7
Parse
MongoDB
HTML
Cleaner
Junk/Sp
am
Cleaner
(Text)
n
Scoring
Summary (Most
Meaningful text
of Story)
Social
Media
Apache
Solr
9
0
1
1

High level Flow: View
HAProxy
Redis
Managed Cache
Internet
2
1
3
Nginx
MongoDB
Tomcat
(App)
Apache
Solr
4
5

Current Traffic Stats
Traffic:
• 16k users/month
• ~38k pageviews/month
• 200k requests/day by 24+ bots
• Traffic growing by 60-70%/month
• Alexa rank : ~211000

Application Stack
• Crawler
• Apache Solr
• MongoDB
• Redis

Crawler
• A web crawler (also known as a web spider or ant) is a program, which browses the
World Wide Web in a methodical, automated manner.
• Web crawlers are mainly used to create a copy of all the visited pages for later
processing by a search engine, that will index the downloaded pages to provide fast
searches.
http://www.codeproject.com/Articles/13486/A-Simple-Crawler-Using-C-Sockets

How it work?
http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web

Search@ApacheSolr
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing (SolrCloud),
Replication, and load balanced querying
• http://lucene.apache.org/solr
17

High level overview
Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light

Apache Solr - Features
• full-text search
• faceted search (similar to GroupBy clause in RDBMS)
• scalability
– caching
– replication
– distributed search
• near real-time indexing
• geospatial search
• and many more : highlighting, database integration, rich document
(e.g., Word, PDF) handling
19

Database: #MongoDB
• Document Oriented NoSQL
database
• Dynamic Schema
• JSON based
• Fast read and write
• Quite suitable for Non
Relational data
Stats:
• 2 million tweets
• 70k news articles
• ~25GB rawhtml unstructured data
• ~16GB structured data

Why NoSQL
• Large Volume of Data
• Dynamic Schemas
• Auto-sharding
• Replication
• Horizontally Scalable
* Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost

Major NoSQL Categories
• Document databases
• pair each key with a complex data structure
known as a document.
• MongoDB
• Graph databases
• store information about networks, such as social
connections
• Neo4j
Contd.

Major NoSQL Categories
• Key-Value stores
• Every single item in the database is stored as an
attribute name (or "key"),
• Riak , Voldemort, Redis
• Wide-column stores
• store data in columns together, instead of row
• Google’s Bigtable, Cassandra and HBase

Sample Record (JSON)
{
"_id" : ObjectId("53f087c69144ca452acadfb0"),
"id" : "7a622c50e95d4debb1376d4f6e2d0a47",
"title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04",
"summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including
revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million
in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose
3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent
quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its
third quarter, with revenues forecasted to land in the $98 to $99 million range. ",
"link" : "http://techcrunch.com/2014/07/30/yelp-swings-to-profitability-in-strong-q2-with-88-8m-in-revenue-
eps-of-0-04/",
"category_label" : "business",
“image_url”:” http://tctechcrunch2011.files.wordpress.com/2014/04/yelp-earnings.jpg”,
“score”: 38.0,
“boost”:1.0,
“keywords”:[“news”, “yelp”, “revenue”]
}

Cache: #Redis
• Advanced In-Memory key-value store
• Insane fast
• Response time in order of 5-10ms
• Provides Cache behavior (set, get) with
advance data structures like hashes, lists,
sets, sorted sets, bitmaps etc.
• http://redis.io/

Machine Learning
• Classification
• Clustering
• NER (Named Entity Recognition)
• Summarization (Relevant text)
• Topics Extraction

Classification
• classify a document into a predefined category.
– For e.g news can be classified into business, politics,
finance etc.
• documents can be text, images
• Popular one is Naive Bayes Classifier.
• Steps:
– Step1 : Train the program (Building a Model) using a
training set with a category for e.g. sports, cricket, news,
– Classifier will compute probability for each word, the
probability that it makes a document belong to each of
considered categories
– Step2 : Test with a test data set against this Model
• http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Clustering
• clustering is the task of grouping a set of objects in
such a way that objects in the same group (called
a cluster) are more similar to each other
• objects are not predefined
• For e.g. these keywords
– “man’s shoe”
– “women’s shoe”
– “women’s t-shirt”
– “man’s t-shirt”
– can be cluster into 2 categories “shoe” and “t-shirt” or
“man” and “women”
• Popular ones are K-means clustering and Hierarchical
clustering

K-means Clustering
• partition n observations into k clusters in which each observation belongs
to the cluster with the nearest mean, serving as a prototype of the cluster.
• http://en.wikipedia.org/wiki/K-means_clustering
http://pypr.sourceforge.net/kmeans.html

Summarization
• Finding the most relevant text related to story/article
• There can be multiple approaches related to accuracy.
• Below is our approach:
Cleaned
Text
1 Find low 3
2
value cluster
4
5
Cluster based
on stop words
Score each
cluster
Take Highest
score cluster
Sentence
Extractor
Some more
Scoring…
Summary
text
6
7
*Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)

POS (Part of Speech) Tagging
• process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, its
definition, as well as its context
• relationship with adjacent and related words in a
phrase, sentence, or paragraph.
• 9 parts of speech in English: noun, verb, article,
adjective, preposition, pronoun, adverb,
conjunction, and interjection.
• “This is a sample sentence” will be output as
• This/DT is/VBZ a/DT sample/NN sentence/NN
• We use Stanford MaxentTagger
• http://nlp.stanford.edu/software/tagger.shtml
Number Tag Description
1. CC Coordinating
conjunction
2. CD Cardinal number
3. DT Determiner
4. JJ Adjective
8. JJR Adjective,
comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VBD Verb, past tense
32. VBZ Verb, 3rd person
singular present

NER
• Identifying the Named Entities like Person name, location, organization from a text
• Need a pre built trained model.

Machine Learning Stack
• Stanford NER & Tagger
• LingPipe
• OpenNLP
• Carrot2

We are Hiring!
rockstar@rujhaan.com
35
Want to make an impact on millions of
lives ?
Join Us

Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
36
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies.
http://www.meetup.com/Big-Data-Hyderabad/

Case study of Rujhaan.com (A social news app )

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Case study of Rujhaan.com (A social news app )

Ähnlich wie Case study of Rujhaan.com (A social news app ) (20)

Mehr von Rahul Jain

Mehr von Rahul Jain (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Case study of Rujhaan.com (A social news app )