Scaling API-first – The story of a global engineering organization
SociaLite: High-level Query Language for Big Data Analysis
1. SociaLite: High-level Query Language
for Big Data Analysis
Jiwon Seo, *Jongsoo Park, Jaeho Shin, Stephen Guo, and Monica S. Lam
STANFORD MOBISOCIAL RESEARCH GROUP
* INTEL PARALLEL R ESEARCH LA B
2. Problems in existing platforms
Too difficult (low-level primitives)
Inefficient (not network bound)
Too many (sub) frameworks
Graph analysis
Data mining (or machine learning)
Relational query
Why Another Big Data Platform?
3. SociaLite is a high-level query language
Easy & efficient
Compiled to distributed code
1,000x hadoop
Hadoop compatible
Pythonintegration
Good for
Graph analysis
Data mining
Relational queries
Introducing SociaLite
17. Built-in aggregate functions
min, max, sum, avg, argmin
User-defined functions
in Java or Python
Aggregation
18. Head table also appears in rule body
Foo(a,c) :- Foo(a,b), Bar(b,c).
Semantics
– rule executed repeatedly until no changes to Foo
Recursive Rules
19. SociaLite: Datalog Extensions for Efficient Social Network Analysis, ICDE’13
Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis, VLDB’14
Recursive Rules
`Edge(int s, (int t, double len)) indexby s.
Path(int n, double dist) indexby n. `
`Path(t, $min(d)) :- t=$SRC, d=0;
:- Path(n, d1), Edge(n, t, d2), d=d1+d2.`
Shortest Path algorithm in recursion + aggregation
20. SociaLite queries in Python code
`Queries are quoted in backtick`
Python SociaLite
Python functions, variables are accessible in
SociaLite queries
SociaLite tables are readable from Python
Python Integration (Jython)
22. Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i=42, s=“the answer”.`
23. Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i=42, s=“the answer”.`
v=“Python variable”
`Foo[i](s) :- i=43, s=$v.`
24. Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i=42, s=“the answer”.`
v=“Python variable”
`Foo[i](s) :- i=43, s=$v.`
@returns(str)
def func(): return “Python func”
`Foo[i](s) :- i=44, s=$func().`
25. Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i=42, s=“the answer”.`
v=“Python variable”
`Foo[i](s) :- i=43, s=$v.`
@returns(str)
def func(): return “Python func”
`Foo[i](s) :- i=44, s=$func().`
for i, s in `Foo[i](s)`:
print i, s
36. SociaLite is
Distributed query language
Easy and efficient
Integration with Python
Algorithms in SociaLite (graph, data mining)
Competitive performance
Summary
44. Table column can be
Bloom filter
Sketches
Approximaton
45. Bloom Filter
Probabilistic set data structure
Elements represented as bits
Cannot enumerate elements
Quickly (approximately) computes set membership
can have false-positives, but not false-negatives
Approximaton
46. Analysis example
Social Network (friendship)
Each person’s friends-of-friends
Count the # of people in startup
Call it a Startup Score
Approximaton
A