7. Data type compatibility
7
struct AnyVal {
bool is_null;
};
struct StringVal : public AnyVal {
int len;
uint8_t* ptr;
};
%AnyVal = type { i8 }
%StringVal = type { %AnyVal, i32, i8* }
; or
%StringVal = type { { i8 }, i32, i8* }
C++LLVMIR
8. Register and execute the function
8
CREATE FUNCTION StringEq(STRING, STRING)
RETURNS BOOLEAN
LOCATION '/path/to/bitcode.ll’
SYMBOL=’StringEq’;
SELECT StringEq(a, b) FROM mytable;
10. Impyla: Python Library for Impala
• pip install impyla
• DB API v2.0 (PEP 249) compatible
• Prototype sklearn API for Impala ML
• Numba integration (described here)
• See blog post:
http://blog.cloudera.com/blog/2014/04/a-new-
python-client-for-impala/
10
14. Example: 100 Node Decision Tree
14
def predict_income(impala_function_context, age, workclass, final_weight, education, education_num, marital_status, occupation, relationship,
race, sex, hours_per_week, native_country, income):
if (marital_status is None):
return '<=50K'
if (marital_status == 'Married-civ-spouse'):
if (education_num is None):
return '<=50K'
if (education_num > 12):
if (hours_per_week is None):
return '>50K'
if (hours_per_week > 31):
if (age is None):
return '>50K'
if (age > 28):
if (education_num > 13):
if (age > 58):
return '>50K'
if (age <= 58):
return '>50K'
if (education_num <= 13):
if (occupation is None):
return '>50K'
if (occupation == 'Exec-managerial'):
return '>50K'
if (occupation != 'Exec-managerial'):
return '>50K'
if (age <= 28):
if (age > 24):
if (occupation is None):
return '<=50K'
if (occupation == 'Tech-support'):
return '>50K'
if (occupation != 'Tech-support'):
return '<=50K'
if (age <= 24):
if (final_weight is None):
return '<=50K'
if (final_weight > 492053):
return '>50K'
if (final_weight <= 492053):
return '<=50K'
if (hours_per_week <= 31):
if (sex is None):
return '<=50K'
if (sex == 'Male'):
15. Batch Scoring with PySpark
15
# parse the text data
observations = sc.textFile('/path/to/census_data').map(parse_obs)
# perform batch scoring
predictions = observations.map(lambda tup: predict_income(*tup))
# trigger computation
distinct = predictions.distinct().collect()
20. PySpark vs. Impala Performance
20
Tree size
(nodes)
Spark
execution
time (s)
Impala
execution
time (s)
Fold
differenc
e
Impala
compilati
on time
(s)
Bytecode
size
(bytes)
Percent
memcmp
nodes
0 160 9 17x 0 4
100 175 22 8x 1 2254 22%
500 178 27 7x 4 9803 35%
1000 184 32 6x 16 23495 34%
1500 188 35 5x 18 28301 34%
2000 196 37 5x 31 42442 33%
21. Current Status
• Support for all Impala UDF data types (e.g., IntVal,
StringVal, etc.)
• Support for casts to/from primitive types:
• Any operations on primitives should work on Impala types
• Support for NULL types as Python None
• Proof-of-principle support for Python string module
• len
• split
• Concatenation
• Call out to any extern C functions
• Proposed directions
• Array handling
• Numpy support
• What else?
21
22. UDFs with Impala + Numba
• Simplicity of Python interface/syntax
• Performance of compiled language like C++
• Developed at: https://github.com/cloudera/impyla
• Please try it and tell us what features would be useful
• Please contribute!
22
pip install impyla
Much easier than C++ workflow.
UDF in principle available to others.
Much easier than C++ workflow.
UDF in principle available to others.
Much easier than C++ workflow.
UDF in principle available to others.
Significant fold change.
Fold change gets closer to 1 and stabilizes as memcmp dominates execution.
Compilation time linear in size of byte code.
Every extra 500 nodes is about an extra few seconds of work each
Wall clock time
Significant fold change.
Fold change gets closer to 1 and stabilizes as memcmp dominates execution.
Compilation time linear in size of byte code. But working to improve it.
Every extra 500 nodes is about an extra few seconds of work each
Wall clock time