SlideShare ist ein Scribd-Unternehmen logo
1 von 23
1
Compiled Python UDFs for Impala
Uri Laserson
20 May 2014
Impala User-defined Functions (UDFs)
• Tuple => Scalar value
• Substring
• sin, cos, pow, …
• Machine-learning models
• Supports Hive UDFs (Java)
• Relatively unpleasurable
• Slower
• Impala (native) UDFs
• C++ interface designed for efficiency
• Similar to Postgres UDFs
• Runs any LLVM-compiled code
2
LLVM compiler infrastructure
3
LLVM: C++ example
4
bool StringEq(FunctionContext* context,
const StringVal& arg1,
const StringVal& arg2) {
if (arg1.is_null != arg2.is_null)
return false;
if (arg1.is_null)
return true;
if (arg1.len != arg2.len)
return false;
return (arg1.ptr == arg2.ptr) ||
memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0;
}
LLVM: IR output
5
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }
%"class.impala::FunctionContextImpl" = type opaque
%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }
%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtable
define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"*
nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {
entry:
%is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0
%0 = load i8* %is_null, align 1, !tbaa !0, !range !3
%is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0
%1 = load i8* %is_null1, align 1, !tbaa !0, !range !3
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry
%tobool = icmp eq i8 %0, 0
br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end
%len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1
%2 = load i32* %len, align 4, !tbaa !4
%len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1
%3 = load i32* %len8, align 4, !tbaa !4
%cmp9 = icmp eq i32 %2, %3
br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7
%ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2
%4 = load i8** %ptr, align 8, !tbaa !5
%ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2
%5 = load i8** %ptr12, align 8, !tbaa !5
%cmp13 = icmp eq i8* %4, %5
br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11
%conv17 = sext i32 %2 to i64
%call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17)
%cmp18 = icmp eq i32 %call, 0
br label %return
LLVM: IR output
6
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }
%"class.impala::FunctionContextImpl" = type opaque
%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }
%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtable
define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"*
nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {
entry:
%is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0
%0 = load i8* %is_null, align 1, !tbaa !0, !range !3
%is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0
%1 = load i8* %is_null1, align 1, !tbaa !0, !range !3
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry
%tobool = icmp eq i8 %0, 0
br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end
%len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1
%2 = load i32* %len, align 4, !tbaa !4
%len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1
%3 = load i32* %len8, align 4, !tbaa !4
%cmp9 = icmp eq i32 %2, %3
br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7
%ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2
%4 = load i8** %ptr, align 8, !tbaa !5
%ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2
%5 = load i8** %ptr12, align 8, !tbaa !5
%cmp13 = icmp eq i8* %4, %5
br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11
%conv17 = sext i32 %2 to i64
%call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17)
%cmp18 = icmp eq i32 %call, 0
br label %return
Data type compatibility
7
struct AnyVal {
bool is_null;
};
struct StringVal : public AnyVal {
int len;
uint8_t* ptr;
};
%AnyVal = type { i8 }
%StringVal = type { %AnyVal, i32, i8* }
; or
%StringVal = type { { i8 }, i32, i8* }
C++LLVMIR
Register and execute the function
8
CREATE FUNCTION StringEq(STRING, STRING)
RETURNS BOOLEAN
LOCATION '/path/to/bitcode.ll’
SYMBOL=’StringEq’;
SELECT StringEq(a, b) FROM mytable;
Numba compiler
9
NumbaPython
Impyla: Python Library for Impala
• pip install impyla
• DB API v2.0 (PEP 249) compatible
• Prototype sklearn API for Impala ML
• Numba integration (described here)
• See blog post:
http://blog.cloudera.com/blog/2014/04/a-new-
python-client-for-impala/
10
LLVM: Python example
11
@udf(IntVal(FunctionContext, StringVal))
def hour_from_weird_date_format(context, date):
return int(split(date, '-')[1])
ship_udf(cursor, hour_from_weird_data_format,
'/path/to/store/udf.ll', 'my.impala.host')
cur.execute('SELECT hour_from_weird_data_format(date) ’ +
‘AS hour FROM mytable LIMIT 100’)
Model Scoring: BigML on Census Data
12
MLaaS
Model Scoring: BigML on Census Data
13
Example: 100 Node Decision Tree
14
def predict_income(impala_function_context, age, workclass, final_weight, education, education_num, marital_status, occupation, relationship,
race, sex, hours_per_week, native_country, income):
if (marital_status is None):
return '<=50K'
if (marital_status == 'Married-civ-spouse'):
if (education_num is None):
return '<=50K'
if (education_num > 12):
if (hours_per_week is None):
return '>50K'
if (hours_per_week > 31):
if (age is None):
return '>50K'
if (age > 28):
if (education_num > 13):
if (age > 58):
return '>50K'
if (age <= 58):
return '>50K'
if (education_num <= 13):
if (occupation is None):
return '>50K'
if (occupation == 'Exec-managerial'):
return '>50K'
if (occupation != 'Exec-managerial'):
return '>50K'
if (age <= 28):
if (age > 24):
if (occupation is None):
return '<=50K'
if (occupation == 'Tech-support'):
return '>50K'
if (occupation != 'Tech-support'):
return '<=50K'
if (age <= 24):
if (final_weight is None):
return '<=50K'
if (final_weight > 492053):
return '>50K'
if (final_weight <= 492053):
return '<=50K'
if (hours_per_week <= 31):
if (sex is None):
return '<=50K'
if (sex == 'Male'):
Batch Scoring with PySpark
15
# parse the text data
observations = sc.textFile('/path/to/census_data').map(parse_obs)
# perform batch scoring
predictions = observations.map(lambda tup: predict_income(*tup))
# trigger computation
distinct = predictions.distinct().collect()
Batch Scoring with Impala
16
# compile the scoring function
predict_income = udf(signature)(predict_income)
ship_udf(cursor, predict_income, ...)
# perform batch scoring
cursor.execute(‘SELECT DISTINCT predict_income(age, ... ) ‘ +
‘FROM census_text’)
distinct = cursor.fetchall()
Execution Time
17
execution_time =
per_job_overhead +
N * ( per_record_exec + memcmp_exec )
PySpark vs. Impala Performance
18
Tree size
(nodes)
Spark
execution
time (s)
Impala
execution
time (s)
Fold
differenc
e
Impala
compilati
on time
(s)
Bytecode
size
(bytes)
Percent
memcmp
nodes
0 160 9 17x 0 4
100 175 22 8x 1 2254 22%
500 178 27 7x 4 9803 35%
1000 184 32 6x 16 23495 34%
1500 188 35 5x 18 28301 34%
2000 196 37 5x 31 42442 33%
Execution Time
19
execution_time =
per_job_overhead +
N * ( per_record_exec + memcmp_exec )
Spark: 24 threads / node
[ ]
Impala: 1 thread / node
PySpark vs. Impala Performance
20
Tree size
(nodes)
Spark
execution
time (s)
Impala
execution
time (s)
Fold
differenc
e
Impala
compilati
on time
(s)
Bytecode
size
(bytes)
Percent
memcmp
nodes
0 160 9 17x 0 4
100 175 22 8x 1 2254 22%
500 178 27 7x 4 9803 35%
1000 184 32 6x 16 23495 34%
1500 188 35 5x 18 28301 34%
2000 196 37 5x 31 42442 33%
Current Status
• Support for all Impala UDF data types (e.g., IntVal,
StringVal, etc.)
• Support for casts to/from primitive types:
• Any operations on primitives should work on Impala types
• Support for NULL types as Python None
• Proof-of-principle support for Python string module
• len
• split
• Concatenation
• Call out to any extern C functions
• Proposed directions
• Array handling
• Numpy support
• What else?
21
UDFs with Impala + Numba
• Simplicity of Python interface/syntax
• Performance of compiled language like C++
• Developed at: https://github.com/cloudera/impyla
• Please try it and tell us what features would be useful
• Please contribute!
22
pip install impyla
23

Weitere ähnliche Inhalte

Was ist angesagt?

Python meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/OPython meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/OBuzzcapture
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"Fwdays
 
PHP Performance Trivia
PHP Performance TriviaPHP Performance Trivia
PHP Performance TriviaNikita Popov
 
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineeringNSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineeringNoSuchCon
 
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014Susan Potter
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architectureJung Kim
 
Introduction to Swift programming language.
Introduction to Swift programming language.Introduction to Swift programming language.
Introduction to Swift programming language.Icalia Labs
 
A deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleA deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleSaúl Ibarra Corretgé
 
PyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for TornadoPyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for Tornadoemptysquare
 
PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021Ayesh Karunaratne
 
Evolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionEvolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionAnnibale Panichella
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!Boy Baukema
 
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...dantleech
 
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014Fantix King 王川
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPatrick Allaert
 
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...James Titcumb
 

Was ist angesagt? (20)

Python meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/OPython meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/O
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"
 
PHP Performance Trivia
PHP Performance TriviaPHP Performance Trivia
PHP Performance Trivia
 
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineeringNSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
 
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architecture
 
Introduction to Swift programming language.
Introduction to Swift programming language.Introduction to Swift programming language.
Introduction to Swift programming language.
 
A deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleA deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio module
 
node ffi
node ffinode ffi
node ffi
 
PyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for TornadoPyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for Tornado
 
PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021
 
Evolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionEvolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash Reproduction
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!
 
asyncio internals
asyncio internalsasyncio internals
asyncio internals
 
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
 
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
 

Ähnlich wie Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)

Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate CompilersFunctional Thursday
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987乐群 陈
 
Introduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoaIntroduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoaFlorent Pillet
 
Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)ujihisa
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick reviewCe.Se.N.A. Security
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelizationAlbert DeFusco
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfabdulrahamanbags
 
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2PVS-Studio
 
Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)David de Boer
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]RootedCON
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Rust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command LineRust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command LineMatt Provost
 
EcmaScript unchained
EcmaScript unchainedEcmaScript unchained
EcmaScript unchainedEduard Tomàs
 

Ähnlich wie Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14) (20)

Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987
 
Introduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoaIntroduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoa
 
Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick review
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to c part -3
Introduction to c   part -3Introduction to c   part -3
Introduction to c part -3
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
 
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
 
Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
 
CompilersAndLibraries
CompilersAndLibrariesCompilersAndLibraries
CompilersAndLibraries
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Rust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command LineRust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command Line
 
OpenMP
OpenMPOpenMP
OpenMP
 
EcmaScript unchained
EcmaScript unchainedEcmaScript unchained
EcmaScript unchained
 

Mehr von Uri Laserson

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 

Mehr von Uri Laserson (6)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)

  • 1. 1 Compiled Python UDFs for Impala Uri Laserson 20 May 2014
  • 2. Impala User-defined Functions (UDFs) • Tuple => Scalar value • Substring • sin, cos, pow, … • Machine-learning models • Supports Hive UDFs (Java) • Relatively unpleasurable • Slower • Impala (native) UDFs • C++ interface designed for efficiency • Similar to Postgres UDFs • Runs any LLVM-compiled code 2
  • 4. LLVM: C++ example 4 bool StringEq(FunctionContext* context, const StringVal& arg1, const StringVal& arg2) { if (arg1.is_null != arg2.is_null) return false; if (arg1.is_null) return true; if (arg1.len != arg2.len) return false; return (arg1.ptr == arg2.ptr) || memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0; }
  • 5. LLVM: IR output 5 ; ModuleID = '<stdin>' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.7.0" %"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* } %"class.impala::FunctionContextImpl" = type opaque %"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* } %"struct.impala_udf::AnyVal" = type { i8 } ; Function Attrs: nounwind readonly ssp uwtable define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 { entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
  • 6. LLVM: IR output 6 ; ModuleID = '<stdin>' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.7.0" %"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* } %"class.impala::FunctionContextImpl" = type opaque %"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* } %"struct.impala_udf::AnyVal" = type { i8 } ; Function Attrs: nounwind readonly ssp uwtable define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 { entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
  • 7. Data type compatibility 7 struct AnyVal { bool is_null; }; struct StringVal : public AnyVal { int len; uint8_t* ptr; }; %AnyVal = type { i8 } %StringVal = type { %AnyVal, i32, i8* } ; or %StringVal = type { { i8 }, i32, i8* } C++LLVMIR
  • 8. Register and execute the function 8 CREATE FUNCTION StringEq(STRING, STRING) RETURNS BOOLEAN LOCATION '/path/to/bitcode.ll’ SYMBOL=’StringEq’; SELECT StringEq(a, b) FROM mytable;
  • 10. Impyla: Python Library for Impala • pip install impyla • DB API v2.0 (PEP 249) compatible • Prototype sklearn API for Impala ML • Numba integration (described here) • See blog post: http://blog.cloudera.com/blog/2014/04/a-new- python-client-for-impala/ 10
  • 11. LLVM: Python example 11 @udf(IntVal(FunctionContext, StringVal)) def hour_from_weird_date_format(context, date): return int(split(date, '-')[1]) ship_udf(cursor, hour_from_weird_data_format, '/path/to/store/udf.ll', 'my.impala.host') cur.execute('SELECT hour_from_weird_data_format(date) ’ + ‘AS hour FROM mytable LIMIT 100’)
  • 12. Model Scoring: BigML on Census Data 12 MLaaS
  • 13. Model Scoring: BigML on Census Data 13
  • 14. Example: 100 Node Decision Tree 14 def predict_income(impala_function_context, age, workclass, final_weight, education, education_num, marital_status, occupation, relationship, race, sex, hours_per_week, native_country, income): if (marital_status is None): return '<=50K' if (marital_status == 'Married-civ-spouse'): if (education_num is None): return '<=50K' if (education_num > 12): if (hours_per_week is None): return '>50K' if (hours_per_week > 31): if (age is None): return '>50K' if (age > 28): if (education_num > 13): if (age > 58): return '>50K' if (age <= 58): return '>50K' if (education_num <= 13): if (occupation is None): return '>50K' if (occupation == 'Exec-managerial'): return '>50K' if (occupation != 'Exec-managerial'): return '>50K' if (age <= 28): if (age > 24): if (occupation is None): return '<=50K' if (occupation == 'Tech-support'): return '>50K' if (occupation != 'Tech-support'): return '<=50K' if (age <= 24): if (final_weight is None): return '<=50K' if (final_weight > 492053): return '>50K' if (final_weight <= 492053): return '<=50K' if (hours_per_week <= 31): if (sex is None): return '<=50K' if (sex == 'Male'):
  • 15. Batch Scoring with PySpark 15 # parse the text data observations = sc.textFile('/path/to/census_data').map(parse_obs) # perform batch scoring predictions = observations.map(lambda tup: predict_income(*tup)) # trigger computation distinct = predictions.distinct().collect()
  • 16. Batch Scoring with Impala 16 # compile the scoring function predict_income = udf(signature)(predict_income) ship_udf(cursor, predict_income, ...) # perform batch scoring cursor.execute(‘SELECT DISTINCT predict_income(age, ... ) ‘ + ‘FROM census_text’) distinct = cursor.fetchall()
  • 17. Execution Time 17 execution_time = per_job_overhead + N * ( per_record_exec + memcmp_exec )
  • 18. PySpark vs. Impala Performance 18 Tree size (nodes) Spark execution time (s) Impala execution time (s) Fold differenc e Impala compilati on time (s) Bytecode size (bytes) Percent memcmp nodes 0 160 9 17x 0 4 100 175 22 8x 1 2254 22% 500 178 27 7x 4 9803 35% 1000 184 32 6x 16 23495 34% 1500 188 35 5x 18 28301 34% 2000 196 37 5x 31 42442 33%
  • 19. Execution Time 19 execution_time = per_job_overhead + N * ( per_record_exec + memcmp_exec ) Spark: 24 threads / node [ ] Impala: 1 thread / node
  • 20. PySpark vs. Impala Performance 20 Tree size (nodes) Spark execution time (s) Impala execution time (s) Fold differenc e Impala compilati on time (s) Bytecode size (bytes) Percent memcmp nodes 0 160 9 17x 0 4 100 175 22 8x 1 2254 22% 500 178 27 7x 4 9803 35% 1000 184 32 6x 16 23495 34% 1500 188 35 5x 18 28301 34% 2000 196 37 5x 31 42442 33%
  • 21. Current Status • Support for all Impala UDF data types (e.g., IntVal, StringVal, etc.) • Support for casts to/from primitive types: • Any operations on primitives should work on Impala types • Support for NULL types as Python None • Proof-of-principle support for Python string module • len • split • Concatenation • Call out to any extern C functions • Proposed directions • Array handling • Numpy support • What else? 21
  • 22. UDFs with Impala + Numba • Simplicity of Python interface/syntax • Performance of compiled language like C++ • Developed at: https://github.com/cloudera/impyla • Please try it and tell us what features would be useful • Please contribute! 22 pip install impyla
  • 23. 23

Hinweis der Redaktion

  1. Much easier than C++ workflow. UDF in principle available to others.
  2. Much easier than C++ workflow. UDF in principle available to others.
  3. Much easier than C++ workflow. UDF in principle available to others.
  4. Significant fold change. Fold change gets closer to 1 and stabilizes as memcmp dominates execution. Compilation time linear in size of byte code. Every extra 500 nodes is about an extra few seconds of work each Wall clock time
  5. Significant fold change. Fold change gets closer to 1 and stabilizes as memcmp dominates execution. Compilation time linear in size of byte code. But working to improve it. Every extra 500 nodes is about an extra few seconds of work each Wall clock time