SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
GEAviation
Digital
17 Oct 2019
Dr Lucas Partridge
Dr Peter Knight
Bridging the Gap
Between
Data Scientists and
Software Engineers
Deploying legacy Python algorithms
to Apache Spark with minimum pain
#UnifiedDataAnalytics #SparkAISummit
About us
Peter Knight (Data Scientist)
- predicts wear on aircraft engines to minimize unplanned downtime.
Lucas Partridge (Software Engineer)
- helps the Data Scientists scale up their algorithms for big data.
2GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Outline
• About GE Aviation
• The problem
• Starting point
• Approach taken
• Some code
• Challenges
• Benefits
• Conclusions and recommendations
3GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
4GE Aviation - Bridging the Gap between Data Scientists and
Software Engineers | 17 Oct 2019
General Electric -
Aviation
• 48k employees
• $30.6B revenue - 2018
• >33k commercial engines
“Every two seconds, an
aircraft powered by GE
technology takes off
somewhere in the world”
General problem
• GE Aviation has 100s of data scientists and engineers developing
Python algorithms.
• But most develop and test their algorithms on their local machines and
don’t have the time to learn Spark.
• Spark = good candidate to make these algorithms scale as the engine
fleet grows.
• But how do we deploy these legacy algorithms to Spark as quickly as
possible?
5GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Specific problem
• Forecasting when aircraft engines should be removed for maintenance.
• So we can predict what engine parts will be needed, where, and when.
• ‘Digital Twin’ model exists for each important part to be tracked.
• Tens of engine lines, tens of tracked parts → 100s of algorithms to scale!
6GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Starting point – a typical legacy Python algorithm
def execute(input_data):
# Calculate results from input data
# …
return results
7
a Pandas DataFrame!
also a Pandas DataFrame!
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
The legacy Python algorithms
• Used Pandas DataFrames.
• Were run on laptops. Didn’t exploit Spark.
• Each algorithm was run independently.
• Each fetched its own data and read from, and wrote to, csv files.
• Some Java Hadoop Map-Reduce and R ones too - not covered here.
8GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
The legacy Python algorithms
• Often failed at runtime.
• Typically processed data for more than one asset (engine) at a time;
they often tried to process all engines!
• All the data would be read into a single Pandas DataFrame
→ ran out of memory!
9
Bang!
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
The legacy Python algorithms
• Weren’t consistently written
• function arguments vs globals
• using different names for the same data column.
• Had complex config in JSON – hard to do what-if runs.
• Other considerations:
• The problem domain suggested the need for a pipeline of algorithms.
• Few data scientists and software engineers know about Spark, much less about
ML Pipelines!
10GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Working towards a solution
• Studied representative legacy algorithms
• structure
• how do they process data – columns required, sorting of data rows
• are any tests available?! E.g., csv files of inputs and expected outputs.
• Assumed we couldn’t alter the legacy code at all
• so decided to wrap rather than port them to PySpark
• i.e., legacy Python algorithm is called in parallel across the cluster.
11GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
To wrap or to port a legacy algorithm?
12GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
• Performance is critical.
• Algorithm is small, simple and easy to test on Spark.
• The algorithm’s creator is comfortable working directly with Spark.
• Spark skills are available for foreseeable future.
Port when…
Wrap when…
• You wish to retain the ability to run, test and update the
algorithm outside Spark (e.g., on laptop, or in other Big Data
frameworks).
• An auto-code generation tool is available for generating all the
necessary wrapper code.
Initially tried wrapping with RDD.mapPartitions()…
• Call it after repartitioning the input data by engine id. This worked but…
• Could get unexpected key skew effects unless you experiment with the
way your data is partitioned.
• The data for more than one asset (engine) at a time could be passed into
the wrapped algorithm.
• ok if the algorithm can handle data for more than asset; otherwise not.
• We really wanted to use @pandas_udf if Spark 2.3+ was available, and it’s
‘grouped map’ usage means that the data for only one asset gets passed to
the algorithm.
13GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
…so then we switched to RDD.groupByKey()
• Where key = asset (engine) id.
• So the data for only one asset gets passed to the algorithm.
• This more closely mirrors the behaviour of @pandas_udf, so this code should be
easier to convert to use @pandas_udf later on.
• And it will work with algorithms that can only cope with the data for one asset at a
time.
14GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Forecasting engine removals – solution components
• Multiple Digital Twin models – each one models the wear on a single engine part.
• Input Data Predictor - asks all the digital twin models what input data they need,
and then predicts those values n years into the future.
• Aggregator – compares all the predictions to estimate when a given engine should
be removed due to the wear of a particular part.
• → All of these were made into ML Pipeline Transformers…
15GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Input Data
Predictor
PipelineModel
of Digital Twin
models
Aggregator
PipelineModel
Persist results
Historic
data
Strategy taken
• Passed data to algorithms rather than have each algorithm fetch its
own data.
• algorithm shouldn’t have to know where the data comes from.
• Got representative digital twin models working in isolation, using
temporary table of predicted input data as input.
• Prototyped in notebook environment (Apache Zeppelin).
• Eventually incorporated the pipeline into a spark-submit job using a new
hierarchy of classes…
16GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Class hierarchy
17
Params
pyspark.ml
Key
Parent class
Child class
A Abstract class
Estimator ATransformer A
GE code
GeAnalyticA
EngineWearModel A
GroupByKeyEngineWearModel A HadoopMapReduce
EngineWearModel
A
AnEstimator
DigitalTwinEnginePartXEngineTypeP
Code you write
DigitalTwinEnginePartYEngineTypeP
HasEsnCol
esnCol
HasDatetimeCol
datetimeCol
HasFleetCol
fleetCol
Sample mixin classes
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Zooming in….
18
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
DigitalTwinEnginePartXEngineTypeP
Concrete methods:
_handleMissingData()
_processInputColumns()
_processResultsColumns()
EngineWearModel
Abstract methods:
_transform()
_handleMissingData()
_processInputColumns()
_processResultsColumns()
A
_transform() method
19GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Note: Implement these methods in each DigitalTwinXXX class
def _transform(self, dataset):
no_nulls_data = self._handleMissingData(dataset)
data_with_processed_input_columns = self._processInputColumns(no_nulls_data)
asset_col = self.getEsnCol()
grouped_input_rdd = data_with_processed_input_columns
.rdd.map(lambda row: (row[asset_col], row)).groupByKey().mapValues(list)
results_rdd = grouped_input_rdd.mapValues(
self._runAnalyticForOneAsset(self.getFailFast(), asset_col))
results_df = self._convertRddOfPandasDataFramesToSparkDataFrame(results_rdd)
processed_results = self._processResultsColumns(results_df)
output_df = dataset.join(processed_results, asset_col, 'left_outer')
return output_df
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
Invoking the legacy
Pandas-based algorithm
20GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
def _runAnalyticForOneAsset(self, failFast, assetCol):
# Import the named legacy algorithm:
pandas_based_analytic_module = importlib.import_module(
self.getExecuteModuleName()) # A param set by each digital twin class.
def _assetExecute(assetData):
# Convert row data for asset into a Pandas DataFrame:
rows = list(assetData)
column_names = rows[0].__fields__
input_data = pd.DataFrame(rows, columns=column_names)
try:
results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm.
except Exception as e:
asset_id = input_data[assetCol].iloc[0]
ex = Exception("Encountered %s whilst processing asset id '%s'"
% (e.__class__.__name__, asset_id), e.args[0])
if failFast:
raise ex # Fail immediately, report error to driver node.
else:
# Log error message silently in the Spark executor's logs:
error_msg = "Silently ignoring this error: %s" % ex
print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg)
return error_msg
return results
return _assetExecute
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
a Pandas
DataFrame
also a Pandas
DataFrame
21GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
def _runAnalyticForOneAsset(self, failFast, assetCol):
# Import the named legacy algorithm:
pandas_based_analytic_module = importlib.import_module(
self.getExecuteModuleName()) # A param set by each digital twin class.
def _assetExecute(assetData):
# Convert row data for asset into a Pandas DataFrame:
rows = list(assetData)
column_names = rows[0].__fields__
input_data = pd.DataFrame(rows, columns=column_names)
try:
results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm.
except Exception as e:
asset_id = input_data[assetCol].iloc[0]
ex = Exception("Encountered %s whilst processing asset id '%s'"
% (e.__class__.__name__, asset_id), e.args[0])
if failFast:
raise ex # Fail immediately, report error to driver node.
else:
# Log error message silently in the Spark executor's logs:
error_msg = "Silently ignoring this error: %s" % ex
print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg)
return error_msg
return results
return _assetExecute
Converting the results back
into a Spark DataFrame
22GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
def _convertRddOfPandasDataFramesToSparkDataFrame(self, resultsRdd):
errors_rdd = resultsRdd.filter(lambda results: not (isinstance(results[1], pd.DataFrame)))
if not (errors_rdd.isEmpty()):
print("Possible errors: %s" % errors_rdd.collect())
valid_results_rdd = resultsRdd.filter(lambda results: isinstance(results[1], pd.DataFrame))
if valid_results_rdd.isEmpty():
raise RuntimeError("ABORT! No valid results were obtained!")
# Convert the Pandas dataframes into lists and flatten into one list.
flattened_results_rdd = valid_results_rdd.flatMapValues(
lambda pdf: (r.tolist() for r in pdf.to_records(index=False))).values()
# Create Spark DataFrame, using a schema made from that of the first Pandas DataFrame.
spark = SparkSession.builder.getOrCreate()
first_pdf = valid_results_rdd.first()[1] # Pandas DataFrame
first_pdf_schema = spark.createDataFrame(first_pdf).schema
return spark.createDataFrame(flattened_results_rdd, first_pdf_schema)
23
Algorithms before and after wrapping
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Item or activity BEFORE
Standalone legacy algorithm
AFTER
Same algorithm wrapped for PySpark ML Pipeline
Hosting location On single node or laptop In platform that runs spark-submit jobs on schedule.
Configuration Held in separate JSON config file for each
algorithm
Stored in params of ML Pipeline which can be saved and loaded
from disk for the whole pipeline. Config is part of the pipeline
itself.
Acquisition of input data Each algorithm fetched its own input data: made a
separate Hive query, wrote its input data to csv,
then read it into a single in-memory Pandas
DataFrame for all applicable engines.
PySpark spark.sql(“SELECT …”) statement for data required by all
the algorithms in the pipeline. Passed as a Spark DataFrame into
transform() method for the whole pipeline.
All asset (engine) data Held in-memory on single machine Spread across executors of Spark cluster
Writing of results Each algorithm wrote its output to csv which was
then loaded into Hive as a separate table.
Each algorithm appends a column of output to the Spark
DataFrame that’s passed from one transform() to the next in the
pipeline.
Programming paradigm Written as an execute() function which called other
functions.
Inherits from a specialised pyspark.ml.Transformer class
But it wasn’t all a bed of roses! Challenges…
• Pipeline wasn’t really a simple linear pipeline
• Digital twin models operate independently – so could really be run in parallel.
• Many digital twins need to query data that’s in a different shape to the data that’s
passed into the transform() method for the whole ML Pipeline.
• Converting the Pandas DataFrames back into Spark DataFrames
without hitting data-type conversion issues at runtime was tricky!
24GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Input Data
Predictor
PipelineModel
of Digital Twin
models
Aggregator
Historic
data
Persist results
Other
data
More challenges
• Debugging can be tricky! Tips:
• failFast flag
– True - stop processing if any asset throws an exception. Useful when debugging.
– False - silently log an error message for any asset that throws an exception, but continue
processing for other assets. Useful in production.
• run with fewer engines and/or fleets when testing; gradually expand out.
• Even simple things have to be encoded as extra transformers in the
pipeline or added as extra params.
• e.g., persisting data, when required, between different stages in the pipeline
25GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Benefits of this approach
• Much more reliable – don’t run out of memory any more!
• Will scale with the number of assets as the engine fleet grows.
• Whole forecasting scenario runs as a single ML PipelineModel - one per
engine type/config.
• Consistent approach (and column names!) across the algorithms.
26GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Key benefit
Data scientists who know little/nothing about Spark...
• can still develop and test their algorithm outside Spark on their own
laptop, and…
• yet still have it deployed to Spark to scale with Big Data☺.
You don’t have to rewrite each algorithm in PySpark to use the power of
Spark.
27GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Potential next steps
• Auto-generate the wrapper code for new Pandas-based algorithms; e.g.,
from a Data Science Workbench UI. Or, at the very least, create formal
templates that encapsulate the lessons learned.
• Allow the same test data csv files on a laptop to be used unaltered for
testing in the deployed Spark environment. Need to verify that the
ported algorithms actually work!
• Switch to using @pandas_udf on later versions of Spark.
28GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Potential next steps
• Look to optimize the entire pipeline, e.g., by removing Spark actions
where possible, such as persisting intermediate results.
• Many existing ‘algorithms’ – especially the digital twin models - are
themselves really codified workflows or pipelines of lower-level
algorithms.
• so you could convert each algorithm into a pipeline of lower-level algorithms.
• what are different algorithms now would simply become different pipelines; or
even the same pipeline of transformers that’s just configured for a different engine
part.
29GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Conclusions and recommendations
• Consider wrapping rather than porting to PySpark, especially if the Data
Scientists want to develop/test outside Spark.
• ML Pipelines offers a useful paradigm for running workflows of
algorithms and saving/reloading them.
• If algorithm can handle > 1 asset at a time then RDD.mapPartitions()
might suffice. Otherwise use RDD.groupByKey() or @pandas_udf.
• Push reusable code into a class hierarchy so each concrete wrapper
class needs very little code.
30GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksBuilding an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksDatabricks
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsDatabricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Databricks
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Databricks
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 

Was ist angesagt? (20)

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksBuilding an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 

Ähnlich wie Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Scaling face recognition with big data - Bogdan Bocse
 Scaling face recognition with big data - Bogdan Bocse Scaling face recognition with big data - Bogdan Bocse
Scaling face recognition with big data - Bogdan BocseITCamp
 
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooAutomated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooJason Dai
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Amazon Web Services
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC
 
Greenplum for Kubernetes PGConf india 2019
Greenplum for Kubernetes PGConf india 2019Greenplum for Kubernetes PGConf india 2019
Greenplum for Kubernetes PGConf india 2019Goutam Tadi
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsIgor José F. Freitas
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoDB Database
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Aerospike
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdfPramodhN3
 
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...Databricks
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analyticsSouth West Data Meetup
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
 

Ähnlich wie Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain (20)

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Scaling face recognition with big data - Bogdan Bocse
 Scaling face recognition with big data - Bogdan Bocse Scaling face recognition with big data - Bogdan Bocse
Scaling face recognition with big data - Bogdan Bocse
 
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooAutomated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
Greenplum for Kubernetes PGConf india 2019
Greenplum for Kubernetes PGConf india 2019Greenplum for Kubernetes PGConf india 2019
Greenplum for Kubernetes PGConf india 2019
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
 
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain

  • 1. GEAviation Digital 17 Oct 2019 Dr Lucas Partridge Dr Peter Knight Bridging the Gap Between Data Scientists and Software Engineers Deploying legacy Python algorithms to Apache Spark with minimum pain #UnifiedDataAnalytics #SparkAISummit
  • 2. About us Peter Knight (Data Scientist) - predicts wear on aircraft engines to minimize unplanned downtime. Lucas Partridge (Software Engineer) - helps the Data Scientists scale up their algorithms for big data. 2GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 3. Outline • About GE Aviation • The problem • Starting point • Approach taken • Some code • Challenges • Benefits • Conclusions and recommendations 3GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 4. 4GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 General Electric - Aviation • 48k employees • $30.6B revenue - 2018 • >33k commercial engines “Every two seconds, an aircraft powered by GE technology takes off somewhere in the world”
  • 5. General problem • GE Aviation has 100s of data scientists and engineers developing Python algorithms. • But most develop and test their algorithms on their local machines and don’t have the time to learn Spark. • Spark = good candidate to make these algorithms scale as the engine fleet grows. • But how do we deploy these legacy algorithms to Spark as quickly as possible? 5GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 6. Specific problem • Forecasting when aircraft engines should be removed for maintenance. • So we can predict what engine parts will be needed, where, and when. • ‘Digital Twin’ model exists for each important part to be tracked. • Tens of engine lines, tens of tracked parts → 100s of algorithms to scale! 6GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 7. Starting point – a typical legacy Python algorithm def execute(input_data): # Calculate results from input data # … return results 7 a Pandas DataFrame! also a Pandas DataFrame! GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 8. The legacy Python algorithms • Used Pandas DataFrames. • Were run on laptops. Didn’t exploit Spark. • Each algorithm was run independently. • Each fetched its own data and read from, and wrote to, csv files. • Some Java Hadoop Map-Reduce and R ones too - not covered here. 8GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 9. The legacy Python algorithms • Often failed at runtime. • Typically processed data for more than one asset (engine) at a time; they often tried to process all engines! • All the data would be read into a single Pandas DataFrame → ran out of memory! 9 Bang! GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 10. The legacy Python algorithms • Weren’t consistently written • function arguments vs globals • using different names for the same data column. • Had complex config in JSON – hard to do what-if runs. • Other considerations: • The problem domain suggested the need for a pipeline of algorithms. • Few data scientists and software engineers know about Spark, much less about ML Pipelines! 10GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 11. Working towards a solution • Studied representative legacy algorithms • structure • how do they process data – columns required, sorting of data rows • are any tests available?! E.g., csv files of inputs and expected outputs. • Assumed we couldn’t alter the legacy code at all • so decided to wrap rather than port them to PySpark • i.e., legacy Python algorithm is called in parallel across the cluster. 11GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 12. To wrap or to port a legacy algorithm? 12GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 • Performance is critical. • Algorithm is small, simple and easy to test on Spark. • The algorithm’s creator is comfortable working directly with Spark. • Spark skills are available for foreseeable future. Port when… Wrap when… • You wish to retain the ability to run, test and update the algorithm outside Spark (e.g., on laptop, or in other Big Data frameworks). • An auto-code generation tool is available for generating all the necessary wrapper code.
  • 13. Initially tried wrapping with RDD.mapPartitions()… • Call it after repartitioning the input data by engine id. This worked but… • Could get unexpected key skew effects unless you experiment with the way your data is partitioned. • The data for more than one asset (engine) at a time could be passed into the wrapped algorithm. • ok if the algorithm can handle data for more than asset; otherwise not. • We really wanted to use @pandas_udf if Spark 2.3+ was available, and it’s ‘grouped map’ usage means that the data for only one asset gets passed to the algorithm. 13GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 14. …so then we switched to RDD.groupByKey() • Where key = asset (engine) id. • So the data for only one asset gets passed to the algorithm. • This more closely mirrors the behaviour of @pandas_udf, so this code should be easier to convert to use @pandas_udf later on. • And it will work with algorithms that can only cope with the data for one asset at a time. 14GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 15. Forecasting engine removals – solution components • Multiple Digital Twin models – each one models the wear on a single engine part. • Input Data Predictor - asks all the digital twin models what input data they need, and then predicts those values n years into the future. • Aggregator – compares all the predictions to estimate when a given engine should be removed due to the wear of a particular part. • → All of these were made into ML Pipeline Transformers… 15GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 Input Data Predictor PipelineModel of Digital Twin models Aggregator PipelineModel Persist results Historic data
  • 16. Strategy taken • Passed data to algorithms rather than have each algorithm fetch its own data. • algorithm shouldn’t have to know where the data comes from. • Got representative digital twin models working in isolation, using temporary table of predicted input data as input. • Prototyped in notebook environment (Apache Zeppelin). • Eventually incorporated the pipeline into a spark-submit job using a new hierarchy of classes… 16GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 17. Class hierarchy 17 Params pyspark.ml Key Parent class Child class A Abstract class Estimator ATransformer A GE code GeAnalyticA EngineWearModel A GroupByKeyEngineWearModel A HadoopMapReduce EngineWearModel A AnEstimator DigitalTwinEnginePartXEngineTypeP Code you write DigitalTwinEnginePartYEngineTypeP HasEsnCol esnCol HasDatetimeCol datetimeCol HasFleetCol fleetCol Sample mixin classes GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 18. Zooming in…. 18 GroupByKeyEngineWearModel Concrete methods: _transform() _runAnalyticForOneAsset() _convertRddOfPandasDataFramesToSparkDataFrame() A GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 DigitalTwinEnginePartXEngineTypeP Concrete methods: _handleMissingData() _processInputColumns() _processResultsColumns() EngineWearModel Abstract methods: _transform() _handleMissingData() _processInputColumns() _processResultsColumns() A
  • 19. _transform() method 19GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 Note: Implement these methods in each DigitalTwinXXX class def _transform(self, dataset): no_nulls_data = self._handleMissingData(dataset) data_with_processed_input_columns = self._processInputColumns(no_nulls_data) asset_col = self.getEsnCol() grouped_input_rdd = data_with_processed_input_columns .rdd.map(lambda row: (row[asset_col], row)).groupByKey().mapValues(list) results_rdd = grouped_input_rdd.mapValues( self._runAnalyticForOneAsset(self.getFailFast(), asset_col)) results_df = self._convertRddOfPandasDataFramesToSparkDataFrame(results_rdd) processed_results = self._processResultsColumns(results_df) output_df = dataset.join(processed_results, asset_col, 'left_outer') return output_df GroupByKeyEngineWearModel Concrete methods: _transform() _runAnalyticForOneAsset() _convertRddOfPandasDataFramesToSparkDataFrame() A
  • 20. Invoking the legacy Pandas-based algorithm 20GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 def _runAnalyticForOneAsset(self, failFast, assetCol): # Import the named legacy algorithm: pandas_based_analytic_module = importlib.import_module( self.getExecuteModuleName()) # A param set by each digital twin class. def _assetExecute(assetData): # Convert row data for asset into a Pandas DataFrame: rows = list(assetData) column_names = rows[0].__fields__ input_data = pd.DataFrame(rows, columns=column_names) try: results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm. except Exception as e: asset_id = input_data[assetCol].iloc[0] ex = Exception("Encountered %s whilst processing asset id '%s'" % (e.__class__.__name__, asset_id), e.args[0]) if failFast: raise ex # Fail immediately, report error to driver node. else: # Log error message silently in the Spark executor's logs: error_msg = "Silently ignoring this error: %s" % ex print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg) return error_msg return results return _assetExecute GroupByKeyEngineWearModel Concrete methods: _transform() _runAnalyticForOneAsset() _convertRddOfPandasDataFramesToSparkDataFrame() A a Pandas DataFrame also a Pandas DataFrame
  • 21. 21GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 def _runAnalyticForOneAsset(self, failFast, assetCol): # Import the named legacy algorithm: pandas_based_analytic_module = importlib.import_module( self.getExecuteModuleName()) # A param set by each digital twin class. def _assetExecute(assetData): # Convert row data for asset into a Pandas DataFrame: rows = list(assetData) column_names = rows[0].__fields__ input_data = pd.DataFrame(rows, columns=column_names) try: results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm. except Exception as e: asset_id = input_data[assetCol].iloc[0] ex = Exception("Encountered %s whilst processing asset id '%s'" % (e.__class__.__name__, asset_id), e.args[0]) if failFast: raise ex # Fail immediately, report error to driver node. else: # Log error message silently in the Spark executor's logs: error_msg = "Silently ignoring this error: %s" % ex print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg) return error_msg return results return _assetExecute
  • 22. Converting the results back into a Spark DataFrame 22GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 GroupByKeyEngineWearModel Concrete methods: _transform() _runAnalyticForOneAsset() _convertRddOfPandasDataFramesToSparkDataFrame() A def _convertRddOfPandasDataFramesToSparkDataFrame(self, resultsRdd): errors_rdd = resultsRdd.filter(lambda results: not (isinstance(results[1], pd.DataFrame))) if not (errors_rdd.isEmpty()): print("Possible errors: %s" % errors_rdd.collect()) valid_results_rdd = resultsRdd.filter(lambda results: isinstance(results[1], pd.DataFrame)) if valid_results_rdd.isEmpty(): raise RuntimeError("ABORT! No valid results were obtained!") # Convert the Pandas dataframes into lists and flatten into one list. flattened_results_rdd = valid_results_rdd.flatMapValues( lambda pdf: (r.tolist() for r in pdf.to_records(index=False))).values() # Create Spark DataFrame, using a schema made from that of the first Pandas DataFrame. spark = SparkSession.builder.getOrCreate() first_pdf = valid_results_rdd.first()[1] # Pandas DataFrame first_pdf_schema = spark.createDataFrame(first_pdf).schema return spark.createDataFrame(flattened_results_rdd, first_pdf_schema)
  • 23. 23 Algorithms before and after wrapping GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 Item or activity BEFORE Standalone legacy algorithm AFTER Same algorithm wrapped for PySpark ML Pipeline Hosting location On single node or laptop In platform that runs spark-submit jobs on schedule. Configuration Held in separate JSON config file for each algorithm Stored in params of ML Pipeline which can be saved and loaded from disk for the whole pipeline. Config is part of the pipeline itself. Acquisition of input data Each algorithm fetched its own input data: made a separate Hive query, wrote its input data to csv, then read it into a single in-memory Pandas DataFrame for all applicable engines. PySpark spark.sql(“SELECT …”) statement for data required by all the algorithms in the pipeline. Passed as a Spark DataFrame into transform() method for the whole pipeline. All asset (engine) data Held in-memory on single machine Spread across executors of Spark cluster Writing of results Each algorithm wrote its output to csv which was then loaded into Hive as a separate table. Each algorithm appends a column of output to the Spark DataFrame that’s passed from one transform() to the next in the pipeline. Programming paradigm Written as an execute() function which called other functions. Inherits from a specialised pyspark.ml.Transformer class
  • 24. But it wasn’t all a bed of roses! Challenges… • Pipeline wasn’t really a simple linear pipeline • Digital twin models operate independently – so could really be run in parallel. • Many digital twins need to query data that’s in a different shape to the data that’s passed into the transform() method for the whole ML Pipeline. • Converting the Pandas DataFrames back into Spark DataFrames without hitting data-type conversion issues at runtime was tricky! 24GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019 Input Data Predictor PipelineModel of Digital Twin models Aggregator Historic data Persist results Other data
  • 25. More challenges • Debugging can be tricky! Tips: • failFast flag – True - stop processing if any asset throws an exception. Useful when debugging. – False - silently log an error message for any asset that throws an exception, but continue processing for other assets. Useful in production. • run with fewer engines and/or fleets when testing; gradually expand out. • Even simple things have to be encoded as extra transformers in the pipeline or added as extra params. • e.g., persisting data, when required, between different stages in the pipeline 25GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 26. Benefits of this approach • Much more reliable – don’t run out of memory any more! • Will scale with the number of assets as the engine fleet grows. • Whole forecasting scenario runs as a single ML PipelineModel - one per engine type/config. • Consistent approach (and column names!) across the algorithms. 26GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 27. Key benefit Data scientists who know little/nothing about Spark... • can still develop and test their algorithm outside Spark on their own laptop, and… • yet still have it deployed to Spark to scale with Big Data☺. You don’t have to rewrite each algorithm in PySpark to use the power of Spark. 27GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 28. Potential next steps • Auto-generate the wrapper code for new Pandas-based algorithms; e.g., from a Data Science Workbench UI. Or, at the very least, create formal templates that encapsulate the lessons learned. • Allow the same test data csv files on a laptop to be used unaltered for testing in the deployed Spark environment. Need to verify that the ported algorithms actually work! • Switch to using @pandas_udf on later versions of Spark. 28GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 29. Potential next steps • Look to optimize the entire pipeline, e.g., by removing Spark actions where possible, such as persisting intermediate results. • Many existing ‘algorithms’ – especially the digital twin models - are themselves really codified workflows or pipelines of lower-level algorithms. • so you could convert each algorithm into a pipeline of lower-level algorithms. • what are different algorithms now would simply become different pipelines; or even the same pipeline of transformers that’s just configured for a different engine part. 29GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
  • 30. Conclusions and recommendations • Consider wrapping rather than porting to PySpark, especially if the Data Scientists want to develop/test outside Spark. • ML Pipelines offers a useful paradigm for running workflows of algorithms and saving/reloading them. • If algorithm can handle > 1 asset at a time then RDD.mapPartitions() might suffice. Otherwise use RDD.groupByKey() or @pandas_udf. • Push reusable code into a class hierarchy so each concrete wrapper class needs very little code. 30GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019