Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain

GEAviation
Digital
17 Oct 2019
Dr Lucas Partridge
Dr Peter Knight
Bridging the Gap
Between
Data Scientists and
Software Engineers
Deploying legacy Python algorithms
to Apache Spark with minimum pain
#UnifiedDataAnalytics #SparkAISummit

About us
Peter Knight (Data Scientist)
- predicts wear on aircraft engines to minimize unplanned downtime.
Lucas Partridge (Software Engineer)
- helps the Data Scientists scale up their algorithms for big data.
2GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019

Outline
• About GE Aviation
• The problem
• Starting point
• Approach taken
• Some code
• Challenges
• Benefits
• Conclusions and recommendations

4GE Aviation - Bridging the Gap between Data Scientists and
Software Engineers | 17 Oct 2019
General Electric -
Aviation
• 48k employees
• $30.6B revenue - 2018
• >33k commercial engines
“Every two seconds, an
aircraft powered by GE
technology takes off
somewhere in the world”

General problem
• GE Aviation has 100s of data scientists and engineers developing
Python algorithms.
• But most develop and test their algorithms on their local machines and
don’t have the time to learn Spark.
• Spark = good candidate to make these algorithms scale as the engine
fleet grows.
• But how do we deploy these legacy algorithms to Spark as quickly as
possible?

Specific problem
• Forecasting when aircraft engines should be removed for maintenance.
• So we can predict what engine parts will be needed, where, and when.
• ‘Digital Twin’ model exists for each important part to be tracked.
• Tens of engine lines, tens of tracked parts → 100s of algorithms to scale!

Starting point – a typical legacy Python algorithm
def execute(input_data):
# Calculate results from input data
# …
return results
7
a Pandas DataFrame!
also a Pandas DataFrame!
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019

The legacy Python algorithms
• Used Pandas DataFrames.
• Were run on laptops. Didn’t exploit Spark.
• Each algorithm was run independently.
• Each fetched its own data and read from, and wrote to, csv files.
• Some Java Hadoop Map-Reduce and R ones too - not covered here.

• Often failed at runtime.
• Typically processed data for more than one asset (engine) at a time;
they often tried to process all engines!
• All the data would be read into a single Pandas DataFrame
→ ran out of memory!
9
Bang!

• Weren’t consistently written
• function arguments vs globals
• using different names for the same data column.
• Had complex config in JSON – hard to do what-if runs.
• Other considerations:
• The problem domain suggested the need for a pipeline of algorithms.
• Few data scientists and software engineers know about Spark, much less about
ML Pipelines!

Working towards a solution
• Studied representative legacy algorithms
• structure
• how do they process data – columns required, sorting of data rows
• are any tests available?! E.g., csv files of inputs and expected outputs.
• Assumed we couldn’t alter the legacy code at all
• so decided to wrap rather than port them to PySpark
• i.e., legacy Python algorithm is called in parallel across the cluster.

To wrap or to port a legacy algorithm?
• Performance is critical.
• Algorithm is small, simple and easy to test on Spark.
• The algorithm’s creator is comfortable working directly with Spark.
• Spark skills are available for foreseeable future.
Port when…
Wrap when…
• You wish to retain the ability to run, test and update the
algorithm outside Spark (e.g., on laptop, or in other Big Data
frameworks).
• An auto-code generation tool is available for generating all the
necessary wrapper code.

Initially tried wrapping with RDD.mapPartitions()…
• Call it after repartitioning the input data by engine id. This worked but…
• Could get unexpected key skew effects unless you experiment with the
way your data is partitioned.
• The data for more than one asset (engine) at a time could be passed into
the wrapped algorithm.
• ok if the algorithm can handle data for more than asset; otherwise not.
• We really wanted to use @pandas_udf if Spark 2.3+ was available, and it’s
‘grouped map’ usage means that the data for only one asset gets passed to
the algorithm.

…so then we switched to RDD.groupByKey()
• Where key = asset (engine) id.
• So the data for only one asset gets passed to the algorithm.
• This more closely mirrors the behaviour of @pandas_udf, so this code should be
easier to convert to use @pandas_udf later on.
• And it will work with algorithms that can only cope with the data for one asset at a
time.

Forecasting engine removals – solution components
• Multiple Digital Twin models – each one models the wear on a single engine part.
• Input Data Predictor - asks all the digital twin models what input data they need,
and then predicts those values n years into the future.
• Aggregator – compares all the predictions to estimate when a given engine should
be removed due to the wear of a particular part.
• → All of these were made into ML Pipeline Transformers…
Input Data
Predictor
PipelineModel
of Digital Twin
models
Aggregator
PipelineModel
Persist results
Historic
data

Strategy taken
• Passed data to algorithms rather than have each algorithm fetch its
own data.
• algorithm shouldn’t have to know where the data comes from.
• Got representative digital twin models working in isolation, using
temporary table of predicted input data as input.
• Prototyped in notebook environment (Apache Zeppelin).
• Eventually incorporated the pipeline into a spark-submit job using a new
hierarchy of classes…

Class hierarchy
17
Params
pyspark.ml
Key
Parent class
Child class
A Abstract class
Estimator ATransformer A
GE code
GeAnalyticA
EngineWearModel A
GroupByKeyEngineWearModel A HadoopMapReduce
EngineWearModel
A
AnEstimator
DigitalTwinEnginePartXEngineTypeP
Code you write
DigitalTwinEnginePartYEngineTypeP
HasEsnCol
esnCol
HasDatetimeCol
datetimeCol
HasFleetCol
fleetCol
Sample mixin classes

Zooming in….
18
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
DigitalTwinEnginePartXEngineTypeP
Concrete methods:
_handleMissingData()
_processInputColumns()
_processResultsColumns()
EngineWearModel
Abstract methods:
_transform()
_handleMissingData()
_processInputColumns()
_processResultsColumns()
A

_transform() method
Note: Implement these methods in each DigitalTwinXXX class
def _transform(self, dataset):
no_nulls_data = self._handleMissingData(dataset)
data_with_processed_input_columns = self._processInputColumns(no_nulls_data)
asset_col = self.getEsnCol()
grouped_input_rdd = data_with_processed_input_columns
.rdd.map(lambda row: (row[asset_col], row)).groupByKey().mapValues(list)
results_rdd = grouped_input_rdd.mapValues(
self._runAnalyticForOneAsset(self.getFailFast(), asset_col))
results_df = self._convertRddOfPandasDataFramesToSparkDataFrame(results_rdd)
processed_results = self._processResultsColumns(results_df)
output_df = dataset.join(processed_results, asset_col, 'left_outer')
return output_df
Concrete methods:
_transform()
A

Invoking the legacy
Pandas-based algorithm
def _runAnalyticForOneAsset(self, failFast, assetCol):
# Import the named legacy algorithm:
pandas_based_analytic_module = importlib.import_module(
self.getExecuteModuleName()) # A param set by each digital twin class.
def _assetExecute(assetData):
# Convert row data for asset into a Pandas DataFrame:
rows = list(assetData)
column_names = rows[0].__fields__
input_data = pd.DataFrame(rows, columns=column_names)
try:
results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm.
except Exception as e:
asset_id = input_data[assetCol].iloc[0]
ex = Exception("Encountered %s whilst processing asset id '%s'"
% (e.__class__.__name__, asset_id), e.args[0])
if failFast:
raise ex # Fail immediately, report error to driver node.
else:
# Log error message silently in the Spark executor's logs:
error_msg = "Silently ignoring this error: %s" % ex
print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg)
return error_msg
return results
return _assetExecute
Concrete methods:
_transform()
A
a Pandas
DataFrame
also a Pandas
DataFrame

def _runAnalyticForOneAsset(self, failFast, assetCol):
# Import the named legacy algorithm:
pandas_based_analytic_module = importlib.import_module(
self.getExecuteModuleName()) # A param set by each digital twin class.
def _assetExecute(assetData):
# Convert row data for asset into a Pandas DataFrame:
rows = list(assetData)
column_names = rows[0].__fields__
input_data = pd.DataFrame(rows, columns=column_names)
try:
results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm.
except Exception as e:
asset_id = input_data[assetCol].iloc[0]
ex = Exception("Encountered %s whilst processing asset id '%s'"
% (e.__class__.__name__, asset_id), e.args[0])
if failFast:
raise ex # Fail immediately, report error to driver node.
else:
# Log error message silently in the Spark executor's logs:
error_msg = "Silently ignoring this error: %s" % ex
print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg)
return error_msg
return results
return _assetExecute

Converting the results back
into a Spark DataFrame
Concrete methods:
_transform()
A
def _convertRddOfPandasDataFramesToSparkDataFrame(self, resultsRdd):
errors_rdd = resultsRdd.filter(lambda results: not (isinstance(results[1], pd.DataFrame)))
if not (errors_rdd.isEmpty()):
print("Possible errors: %s" % errors_rdd.collect())
valid_results_rdd = resultsRdd.filter(lambda results: isinstance(results[1], pd.DataFrame))
if valid_results_rdd.isEmpty():
raise RuntimeError("ABORT! No valid results were obtained!")
# Convert the Pandas dataframes into lists and flatten into one list.
flattened_results_rdd = valid_results_rdd.flatMapValues(
lambda pdf: (r.tolist() for r in pdf.to_records(index=False))).values()
# Create Spark DataFrame, using a schema made from that of the first Pandas DataFrame.
spark = SparkSession.builder.getOrCreate()
first_pdf = valid_results_rdd.first()[1] # Pandas DataFrame
first_pdf_schema = spark.createDataFrame(first_pdf).schema
return spark.createDataFrame(flattened_results_rdd, first_pdf_schema)

23
Algorithms before and after wrapping
Item or activity BEFORE
Standalone legacy algorithm
AFTER
Same algorithm wrapped for PySpark ML Pipeline
Hosting location On single node or laptop In platform that runs spark-submit jobs on schedule.
Configuration Held in separate JSON config file for each
algorithm
Stored in params of ML Pipeline which can be saved and loaded
from disk for the whole pipeline. Config is part of the pipeline
itself.
Acquisition of input data Each algorithm fetched its own input data: made a
separate Hive query, wrote its input data to csv,
then read it into a single in-memory Pandas
DataFrame for all applicable engines.
PySpark spark.sql(“SELECT …”) statement for data required by all
the algorithms in the pipeline. Passed as a Spark DataFrame into
transform() method for the whole pipeline.
All asset (engine) data Held in-memory on single machine Spread across executors of Spark cluster
Writing of results Each algorithm wrote its output to csv which was
then loaded into Hive as a separate table.
Each algorithm appends a column of output to the Spark
DataFrame that’s passed from one transform() to the next in the
pipeline.
Programming paradigm Written as an execute() function which called other
functions.
Inherits from a specialised pyspark.ml.Transformer class

But it wasn’t all a bed of roses! Challenges…
• Pipeline wasn’t really a simple linear pipeline
• Digital twin models operate independently – so could really be run in parallel.
• Many digital twins need to query data that’s in a different shape to the data that’s
passed into the transform() method for the whole ML Pipeline.
• Converting the Pandas DataFrames back into Spark DataFrames
without hitting data-type conversion issues at runtime was tricky!
Input Data
Predictor
PipelineModel
of Digital Twin
models
Aggregator
Historic
data
Persist results
Other
data

More challenges
• Debugging can be tricky! Tips:
• failFast flag
– True - stop processing if any asset throws an exception. Useful when debugging.
– False - silently log an error message for any asset that throws an exception, but continue
processing for other assets. Useful in production.
• run with fewer engines and/or fleets when testing; gradually expand out.
• Even simple things have to be encoded as extra transformers in the
pipeline or added as extra params.
• e.g., persisting data, when required, between different stages in the pipeline

Benefits of this approach
• Much more reliable – don’t run out of memory any more!
• Will scale with the number of assets as the engine fleet grows.
• Whole forecasting scenario runs as a single ML PipelineModel - one per
engine type/config.
• Consistent approach (and column names!) across the algorithms.

Key benefit
Data scientists who know little/nothing about Spark...
• can still develop and test their algorithm outside Spark on their own
laptop, and…
• yet still have it deployed to Spark to scale with Big Data☺.
You don’t have to rewrite each algorithm in PySpark to use the power of
Spark.

Potential next steps
• Auto-generate the wrapper code for new Pandas-based algorithms; e.g.,
from a Data Science Workbench UI. Or, at the very least, create formal
templates that encapsulate the lessons learned.
• Allow the same test data csv files on a laptop to be used unaltered for
testing in the deployed Spark environment. Need to verify that the
ported algorithms actually work!
• Switch to using @pandas_udf on later versions of Spark.

Potential next steps
• Look to optimize the entire pipeline, e.g., by removing Spark actions
where possible, such as persisting intermediate results.
• Many existing ‘algorithms’ – especially the digital twin models - are
themselves really codified workflows or pipelines of lower-level
algorithms.
• so you could convert each algorithm into a pipeline of lower-level algorithms.
• what are different algorithms now would simply become different pipelines; or
even the same pipeline of transformers that’s just configured for a different engine
part.

Conclusions and recommendations
• Consider wrapping rather than porting to PySpark, especially if the Data
Scientists want to develop/test outside Spark.
• ML Pipelines offers a useful paradigm for running workflows of
algorithms and saving/reloading them.
• If algorithm can handle > 1 asset at a time then RDD.mapPartitions()
might suffice. Otherwise use RDD.groupByKey() or @pandas_udf.
• Push reusable code into a class hierarchy so each concrete wrapper
class needs very little code.

Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain

Ähnlich wie Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain