Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

VerticaPy_original - Anritsu.pdf

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Cardinality-HL-Overview
Cardinality-HL-Overview
Wird geladen in …3
×

Hier ansehen

1 von 32 Anzeige

Weitere Verwandte Inhalte

Ähnlich wie VerticaPy_original - Anritsu.pdf (20)

Aktuellste (20)

Anzeige

VerticaPy_original - Anritsu.pdf

  1. 1. Machine Learning with Python & Vertica using VerticaPy Matteo Monaldi, Data Scientist Bring Analytics to the next level
  2. 2. Data Science Challenges
  3. 3. Data Science Lifecycle 3 Data Quality ? Enough Data ? Business Understanding Data Mining & Cleaning Data Exploration Data Preparation Machine Learning Model Evaluation Good Enough ? Good Enough ? Data Visualization Model Deployment & Maintenance True True True True False False False False End of Project Business Analyst Data Engineer Data Scientist Data Visualization Process Data Science Process
  4. 4. Vertica Machine Learning
  5. 5. 5 Vertica Supports the Entire Data Science Process Deployment/ Management Data Analysis/ Exploration Data Preparation Model Training Model Evaluation Business Understanding In-Database Scoring Speed at Scale Security Statistical Summary Sessionization Pattern Matching Date/Time Algebra Window/Partition Date Type Handling Sequences and more… Outlier Detection Normalization Imbalanced Data Processing Sampling Test/Validation Split Time Series Missing Value Imputation and more… K-Means Support Vector Machines Logistic, Linear, Ridge Regression Naïve Bayes Cross Validation and more… Model-level Stats ROC Tables Error Rate Lift Table Confusion Matrix R-Squared MSE XGBoost Filtering Feature Selection Correlation Matrices Table-Like Management Versioning Authorization and more… Random Forests and more… Principal Component Analysis PMML Import/Export TensorFlow Import
  6. 6. Advantages of Vertica in-database Machine Learning 6 NODE 1 NODE 2 NODE N Schema Tables Models Schema Tables Models Schema Tables Models Network Eliminating overhead of data transfer Data Security and Provenance Model Storage and Management Serving concurrent users Highly scalable ML functionalities Avoiding maintenance cost of a seperate system
  7. 7. Sampling vs. full dataset 7 Large data results in better generalization Downsampling Full Dataset Lack of generalization & Over-fitting Generalization & Data-driven Anomaly Detection Moving Windows Sessionization Interpolations Missing Value Normalization Supervised Learning
  8. 8. SELECT * FROM models; Vertica Model Management ALTER MODEL mymodel RENAME to mykmeansmodel; ALTER MODEL mykmeansmodel OWNER TO user1; ALTER MODEL mykmeansmodel SET SCHEMA public; DROP MODEL myLinearModel; List Models SELECT Summarize_Model( 'LogisticRegModel'); Summarize Models Change Model Name, Owner and Schema Delete Models dbadmin public dbadmin public dbadmin public myKmeansModel logisticRegModel linearRegModel kmeans logistic regression linear regression 845 780 399 schema_name owner_name model_name model_type model_size x2 x1 Intercept predictor 0.0048 -0.0345 2.548 coefficient 0.0126 0.0000 0.0000 p_value x2 -0.9769 0.0678 2020-07-14 2021-05-07 2021-02-04 create_date Manage Model Security View Models Alter Models Summarize Models Drop Models Get Models Attributes SELECT GET_MODEL_ATTRIBUTES(USING PARAMETERS model_name = 'LogisticRegModel'); Get Model Attributes ======= details ======= predictor|coefficient|std_err |z_value |p_value ---------+-----------+--------+--------+--------Intercept| 2.54895 | 0.39019| 6.53263| 0.00000 age| -0.03453| 0.00563|-6.13772| 0.00000fare| 0.00408| 0.00174| 2.34950| 0.01880 pclass | -0.97692| 0.11551|-8.45772| 0.00000============== regularization ============== type| lambda ---- +--------none| 1.00000 =========== call_string =========== logistic_reg('public.lr_titanic', 'titanic', '"survived"', '"age", "fare", "pclass"' USING PARAMETERS optimizer='newton', epsilon=1e-06, max_iterations=100, regularization='none', lambda=1, alpha=0.5) =============== Additional Info =============== Name|Value ------------------+----- iteration_count | 4 rejected_row_count| 238 accepted_row_count| 996 GRANT USAGE ON MODEL myLinearModel TO user2; Model’s Security 8
  9. 9. Multiple Ways of Solving Data Science Challenges Vertica has a unique value in the Machine Learning space: In-database Machine Learning, in-database Data Science in Python, user-defined functions in 4 programming languages (C++, R, Python & Java), integration with TensorFlow & PMML allowing models import and export. SQL Front End VerticaPy - Python API User Defined Functions Import TensorFlow Import & Export PMML 9 SQL 𝝀
  10. 10. VerticaPy
  11. 11. Bring Analytics to the next level with VerticaPy Python Front-end & SQL Back-end https://www.vertica.com/python/ https://github.com/vertica/VerticaPy VerticaPy In DataBase Data Science Database Connection Jupyter Notebooks Vertica DataBase High Security Open-Source Jupyter/Python – popular tool of choice for data scientists & analysts Much of the heavy computation is done by Vertica. Model storage & management in Vertica. Data stays in Vertica (security, integrity, scalability). Users can contribute. No added cost on software. Constantly updated roadmap. Aggregates Build Models vModel vDataFrame DYNAMIC Data Exploration Data Preparation Model Evaluation 11
  12. 12. VerticaPy In Database Data Science SELECT SUMMARIZE_NUMCOL("SepalLengthCm", "PetalWidthCm", "PetalLengthCm", "SepalWidthCm") OVER () FROM "public"."iris" Example – Summary Statistics from verticapy import vDataframe # Object Creation vdf = vDataframe("public.iris", dsn = "VerticaDSN") # Describe vdf.describe() Client Output Distributed Execution In [*]: https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 12
  13. 13. VerticaPy allows easy Data Preparation & Code Deployment https://www.vertica.com/python/ https://github.com/vertica/VerticaPy from verticapy.datasets import load_titanic titanic = load_titanic() # Doing some Data Preparation titanic.fillna() titanic["family_size"] = "parch + sibsp + 1" titanic.normalize() # Current vDataFrame relation display(titanic.current_relation()) In [1]: SELECT ("fare" - 33.9637936739659) / (52.6247198802501) AS "fare", "sex", ("body" - 164.14406779661) / (29.7495704692386) AS "body", "pclass", ("age" - 30.1524573721163) / (12.9740056939445) AS "age", "name", "cabin", "parch", "survived", "boat", "ticket", "embarked", "home.dest", "sibsp", ("family_size" - 1.88249594813614) / (1.58407574155133) AS "family_size" FROM (SELECT COALESCE("fare", 33.9637936739659) AS "fare", "sex", COALESCE("body", 164.14406779661) AS "body", ("pclass" - 2.28444084278768) / (0.842485636190292) AS "pclass", COALESCE("age", 30.1524573721163) AS "age", "name", COALESCE("cabin", 'C23 C25 C27') AS "cabin", ("parch" - 0.378444084278768) / (0.868604707790392) AS "parch", ("survived" - 0.364667747163695) / (0.481532018641288) AS "survived", COALESCE("boat", '13') AS "boat", "ticket", COALESCE("embarked", 'S') AS "embarked", COALESCE("home.dest", 'New York, NY') AS "home.dest", ("sibsp" - 0.504051863857374) / (1.04111727241629) AS "sibsp", "parch" + "sibsp" + 1 AS "family_size" FROM "public"."titanic") VERTICAPY_SUBTABLE 13
  14. 14. VerticaPy is a complete statistical software https://www.vertica.com/python/ https://github.com/vertica/VerticaPy Correlations Heteroscedascity Trend & Stationarity Normality Cramer’s V Biserial Point Kendall Spearman Pearson Breush-Pagan Goldfeld-Quandt White’s Lagrange Engle Augmented Dickey-Fuller Mann-Kendall Normaltest Regular Joins Joins XGBoost Random Forest Tree Based Models Linear Regression Logistic Regression LinearSVC Linear Models KMeans Bisecting KMeans Clustering Time Series Joins Spatial Joins 14
  15. 15. Everything you need to visualize your data COMPARISON CHARTS DISTRIBUTION CHARTS RELATIONSHIP CHARTS TREND CHARTS GEOSPATIAL CHARTS ANIMATED CHARTS https://www.vertica.com/python/ https://github.com/vertica/VerticaPy PROPORTION CHARTS
  16. 16. 16 Export your Graphics in various formats Print Chart Download PNG image Download JPEG image Download PDF document Download SVG vector image '<!DOCTYPE html>n<html lang="en">n <head>n <meta charset="utf-8" />n <link href="https://www.highcharts.com/highslide/highslide.css" rel="stylesheet" />n <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/highcharts.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/highcharts-more.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/modules/heatmap.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/modules/exporting.js"></script>n </head>n <body style="margin:0;padding:0">n <div id="container" style="width:600px;height:400px;">Loading....</div>nnn <script>n $(function(){nnnnnn Highcharts.setOptions({"global": {}, "lang": {}});n var option = {"chart": {"renderTo": "container", "width": 600, "height": 400, "inverted": true}, "colors": ["#263133", "#FE5016", "#0073E7", "#19A26B", "#FCDB1F", "#000000", "#2A6A74", "#861889", "#00B4E0", "#90EE90", "#FF7F50", "#B03A89"], "credits": {"enabled": false}, "drilldown": {}, "exporting": {}, "labels": {}, "legend": {"enabled": false}, "loading": {}, "navigation": {}, "pane": {}, "plotOptions": {}, "series": {}, "subtitle": {}, "title": {"text": ""}, "tooltip": {"headerFormat": "", "pointFormat": "{point.y}"}, "xAxis": {"reversed": false, "title": {"enabled": true, "text": "Contract"}, "maxPadding": 0.05, "showLastLabel": true, "categories": ["Month-to-month", "One year", "Two year"]}, "yAxis": {"title": {"text": "Churn_Rate (%)"}}};nnn nn var chart = new Highcharts.Chart(option);nn var data = [{"data": [42.7096774193548, 11.2695179904956, 2.83185840707965], "type": "bar", "name": "Churn_Rate (%)", "colorByPoint": true}];n var dataLen = data.length;n for (var ix = 0; ix < dataLen; ix++) {n chart.addSeries(data[ix]);n }n nnnnnnnn n });n </script>nn </body>n</html>' Build HTML page Exporting to HTML allows to get a code which can be injected independantly in many GUIs as long as they support Javascript. # High Charts Graphics are objects # with numerous methods class Highchart(builtins.object) buildcontainer() buildhtml() buildhtmlheader() save_file(filename = 'Chart') https://www.vertica.com/python/ https://github.com/vertica/VerticaPy
  17. 17. VerticaPy Delphi: Automated Machine Learning https://www.vertica.com/python/ https://github.com/vertica/VerticaPy Logistic Regression XGBoost Random Forest SVM Naive Bayes EFFICIENT MODEST EFFICIENT & PERFORMANT PERFORMANT SCORE TIME STD [PRECISION] Auto Data Preparation Uses One Hot Encoding, Label Encoding, Missing Values Imputation and other data preparation techniques to preprocess the data Auto Grid Search CV Uses different parameters grids to test many combinations and it finds an optimal grid. Auto Variables Selection After the selection of the algorithm, it uses the Stepwise algorithm to find a set of features. 17
  18. 18. Geospatial
  19. 19. SupportedSpatialObjects GEOMETRY (x,y) Spatial object with coordinates expressed as (x,y) pairs, defined in the Cartesian plane. All calculations use Cartesian coordinates. As it is a projection on the Cartesian plane, the computations are easier to compare to the Geography type. Use this type when you can! GEOGRAPHY (longitude,latitude) Spatial object defined as on the surface of a perfect sphere, or a spatial object in the WGS84 coordinate system. Coordinates are expressed in longitude/latitude angular values, measured in degrees. All calculations are in metres. The maximum size of a GEOMETRY or GEOGRAPHY data type is 10 MB ! You cannot modify the size or data type of a GEOMETRY or GEOGRAPHY column after creation. 19
  20. 20. 20 %load_ext verticapy.sql In [1]: from verticapy.geo import * from verticapy.datasets import load_world, load_cities cities = vDataFrame("map.cities") world = vDataFrame("map.world") # Creating Index create_index(world, "id", "geometry", "world_polygons", True) # Computing the intersections intersect(cities, "world_polygons", "id", x="lat", y="lon”) In [3]: VerticaPy SQL Magic %%sql SELECT STV_Create_Index(id, geometry USING PARAMETERS index = 'world_polygons’, overwrite = true, max_mem_mb=256) OVER() FROM pols; SELECT STV_Intersect(lat, lon USING PARAMETERS index = ' world_polygons’); In [2]: VerticaPy vDataFrame Spatial Joins 1 2 3 4 5 6 point_id 1 2 3 4 5 6 polygon_id 3 3 3 3 2 1 INTERSECT 7
  21. 21. Time Series
  22. 22. Challenges with Irregular Time Series 6 5 3 0 9:26 3.5 6 9:20 9:00 5 9:41 16:32 16:40 17:00 17:19 1 ∆𝑡 = 8𝑚𝑛𝑠 ∆𝑡 = 6ℎ51𝑚𝑛𝑠 Inconsistent Models Time Series models are most of the time auto-regressive. Irregular TS leads to wrong variable definition. Wrong Aggregations Computing aggregations may become difficult as we may reach inconsistent variables. Huge Gaps Irregular Time Series may lead to huge gaps between records. It could change the variable definition. Difficult Joins Joins may become difficult as they need an exact match. In Vertica, TS Joins allow to counter this problem. https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 22
  23. 23. Challenges with Irregular Time Series value 09:00 5 09:20 6 09:26 3.5 09:41 0 16:32 3 16:40 6 17:00 5 17:19 1 value 10 11 12 12.5 13 13.5 14 14.5 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 ? ? ? ? ? ? Impossibility to join the different data sources https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 23
  24. 24. from verticapy import vDataFrame sm_weather = vDataFrame("sm"."weather") sm_consumption = vDataFrame("sm"." consumption") sm_weather.join(sm_consumption, how = "left", on_interpolate = {"dateUTC": "dateUTC"}, expr2 = ["temperature", "humidity"]) Solution – Time Series Join value 09:00 5 09:20 6 09:26 3.5 09:41 0 16:32 3 value 10 11 12 12.5 13 09:00 09:30 10:00 10:30 11:00 16:30 JOINING TO THE CLOSEST TIMESTAMP SELECT * FROM "sm"." weather " AS x LEFT JOIN "sm".”consumption" AS y ON x."dateUTC" INTERPOLATE PREVIOUS VALUE y."dateUTC" VerticaPy SQL Magic VerticaPy vDataFrame https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 24
  25. 25. from verticapy import vDataFrame # Creating the vDataFrame sm_consumption = vDataFrame("sm"." consumption") # Time Series Slicing & Interpolation sm_consumption.asfreq(ts = "dateUTC", rule = "30 minutes", method = {"value": "linear"}, by = ["meterID"]) Solution – Time Series Slicing & Interpolation SELECT slice_time AS "dateUTC", "meterID", TS_FIRST_VALUE("value", 'linear') AS "value" FROM "sm"."consumption" TIMESERIES slice_time AS '30 minutes' OVER (PARTITION BY "meterID" ORDER BY "dateUTC") VerticaPy SQL Magic VerticaPy vDataFrame value 09:00 5 09:20 6 09:26 3.5 09:41 0 16:32 3 value 09:00 5 09:30 2.56666667 10:00 0.13868613 10:30 0.35766424 11:00 0.576642336 https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 25
  26. 26. Model Deployment
  27. 27. Deploy your Pipeline in Vertica https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 27 SELECT "age", "tenure", PREDICT_LOGISTIC_REG("age", "tenure" USING PARAMETERS model_name = 'telco.lr') AS "churn_score" FROM (SELECT APPLY_NORMALIZE("age", "tenure" USING PARAMETERS model_name = 'telco.normalizer') FROM "telco"."customers") x Vertica built-in SQL SELECT PREDICT_TENSORFLOW(”customer_id", "age", "tenure" USING PARAMETERS model_name = 'telco.lr’, num_passthru_cols = 1) OVER(PARTITION BEST) FROM "telco"."customers"; Tensor Flow Integration SELECT "age", "tenure", PREDICT_PMML("age", "tenure" USING PARAMETERS model_name = 'telco.lr’) AS "churn_score" FROM "telco"."customers"; PMML Integration SELECT "age", "tenure", 1 / (1 + EXP(– (0.5 * "age" – 0.3 * "tenure" – 0.9))) AS "churn_score" FROM (SELECT ("age” – 13) / 80 AS "age", ("tenure” – 5) / 48) AS "tenure" FROM "telco"."customers") x Standard SQL SELECT "age", "tenure", H2OModelScore("age", "tenure" USING PARAMETERS model_name = 'telco.lr’) AS "churn_score" FROM "telco"."customers"; H2O Integration SQL SQL SQL Script Scheduler
  28. 28. User Defined Functions
  29. 29. Bring your own Lambda function https://www.vertica.com/python/ https://github.com/vertica/VerticaPy Python Front-end & SQL Back-end VerticaPy In DataBase Data Science Vertica Cluster UDFs are also available in Java, C++ & R Did you know ? Python Function UDF Generation Python UDFs Generation Process the data using blocks Scalable in-DB Computations Stored in all the cluster’s nodes Flexible Python Implementation Block 1 Block 2 Block 3 Block 4 𝝀 29
  30. 30. NEW VIDEOS EVERY WEDNESDAY SQL SUBSCRIBE TO OUR CHANNEL THE PYTHON API FOR VERTICA DATA SCIENCE AT SCALE MACHINE LEARNING • BIG DATA • TIME SERIES • DYNAMIC CHARTS www.github.com/vertica/VerticaPy www.vertica.com/python/ www Github www.linkedin.com/company/verticapy/
  31. 31. Thank you. www.vertica.com

×