Diese Präsentation wurde erfolgreich gemeldet.

Advertising Fraud Detection at Scale at T-Mobile

0

Teilen

1 von 35
1 von 35

Advertising Fraud Detection at Scale at T-Mobile

0

Teilen

Herunterladen, um offline zu lesen

Beschreibung

The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.

Transkript

  1. 1. Advertising Fraud Detection at Scale @ T-Mobile Eric Yatskowitz, Data Scientist Phan Chuong, Data Engineer
  2. 2. Ad Tech Overview
  3. 3. Ad Tech Industry ▪ A lack of regulation is one reason ▪ Ad Tech Industry complexity Why is this industry so rife with fraud?
  4. 4. Ad Tech Complexity
  5. 5. What are the typical fraudsters’ behavior ? Bot Farms Domain Spoofing
  6. 6. How do we detect suspicious behavior? We need data ▪ DMP, bit request data, data from SSP and DSP ▪ Device Network Data We need a model which is adaptive and can detect different anomalies, which requires historical data We need to be able to scale the model on network data size 4-10Tb per day
  7. 7. Building Data Science Products T-MOBILE DATA PLATFORM 3rd-PARTY DATA DATA SCIENCE PRODUCTS csv Parquetorc csv orc Parquet YARN/MESOS MR TEZ SPARK STORM
  8. 8. Building Data Science Product: Working Pipeline ORC PARQUET CSV DEVELOPMODEL SAVEMODELANDOUTPUTS VISUALIZATIONANDBUSINESS INTERPRETATITON READDATA
  9. 9. Spark and Big Data ▪ When working with BIG data, Spark becomes a necessity ▪ Hive or SQL does not support Machine Learning ▪ Python, R can not operate in large data sets ( > 4Gb )
  10. 10. Spark Tuning
  11. 11. Spark Tuning - Overview Resources management Static allocation vs Dynamic allocation Reading Partition sizing & Split strategy Joining & Aggregating Maximizing parallelism & Shuffling strategy Writing Maximizing parallelism & Shuffling strategy
  12. 12. Spark Tuning – Resources Management s.d.enabled = True s.d.initialExecutors s.d.minExecutors s.d.maxExecutors s.d.executorIdleTimeout s.d.cachedExecutorIdleTimeout * s.d. is acronym of spark.dynamicAllocation
  13. 13. Spark Tuning – Reading from HDFS spark.files.openCostInBytes (oCIB) Default = 4 Mb spark.files.maxPartitionBytes (mPB) Default = 128 Mb - 10,000 20,000 30,000 40,000 50,000 60,000 0.13 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 Numberoftasks Partition size in Gb mPB = 𝐷𝑎𝑡𝑎𝐵𝑦𝑡𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑖𝑙𝑒𝑠 ∗ 𝑜𝐶𝐼𝐵 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 ∗ 𝐶𝑜𝑟𝑒𝑠
  14. 14. Spark Tuning – Shuffling Strategy spark.sql.shuffle.partitions Default = 200 𝑆ℎ𝑢𝑓𝑓𝑙𝑒𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 = 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 ∗ 𝐶𝑜𝑟𝑒𝑠
  15. 15. df.write… df.write… Spark Tuning – Writing Strategy df.coalesce(num) .write… df.repartition(num).write…
  16. 16. Demo
  17. 17. From Python to PySpark
  18. 18. From Python to PySpark ▪ Many data scientists are most comfortable coding in Python ▪ Spark can seem very intimidating to the newcomer ▪ UDFs provide a useful tool to run Python code in Spark ▪ But it is oftentimes still much more efficient to run PySpark code directly
  19. 19. Python vs PySpark schema Python stages:PySpark stages: Load Data Features Model Pipeline Data Split data = spark.read.option().csv() from … import RandomForestClassifier rf = RandomForestClassifier() from … import Pipeline pipeline = Pipeline() (trainDF, testDF) = df.randomSplit() from … import VectorAssembler Load Data Data Split Model data = pd.read_csv() from … import train_test_split X_train, X_test, y_train, y_test = train_test_split() from … import RandomForestRegressor
  20. 20. Python to PySpark transition
  21. 21. Python UDF vs PySpark PySparkPython UDF Filter Variables ... Read Variables ...
  22. 22. Python UDF vs PySpark (cont.) UDFs can serve as a useful go-between for getting from Python to Spark, but converting to PySpark will almost always be more efficient PySpark SQLPython UDF
  23. 23. Normalized Entropy: The algorithm App TMO user 1 TMO user 2 TMO user 3 … TMO user 70M Norm. Entropy Score Facebook 28 50 0 … 154 76 - normal Netflix 287 340 78 … 0 54 – normal Free weather app 0 0 1000 … 0 0 - stalker Misc. banking app 1 1 0 … 1 100 - spammer Where P(xi) is the probability the user xi used app = C(xi)/C(X), C(xi) is the number of times the app showed up in user xi’s network, and C(X) is the number of times the app showed up in the entire network Shannon Entropy Normalized Shannon Entropy
  24. 24. Normalized Entropy: Static Allocation config Default Optimized spark.executors.instances 2 100 spark.executor.cores 1 4 spark.executor.memory 1g 2g spark.sql.shuffle.partitions 200 400 Completion time 8 min 23 sec Default Configuration Optimized Configuration
  25. 25. Normalized Entropy: Dynamic Allocation config Default Optimized spark.dynamicAllocation.ma xExecutors infinity 100 spark.executor.cores 1 4 spark.executor.memory 1g 2g spark.sql.shuffle.partitions 200 400 Completion time -- 41 sec 12x faster than default static configuration 1.8x slower than optimized static configuration
  26. 26. Productionization
  27. 27. Build end-to-end product
  28. 28. Performance Tracking import mlflow import mlflow.spark as mlsp mlflow.set_tracking_uri('http://tracking-server/') mlflow.set_experiment('datasource_0’) data = spark.read. (...) .collect() for i in range(len(data)): with mlflow.start_run(run_name=data[i]['date_part']) as run: mlflow.log_metrics({m:vfor (m, v) in data[i].asDict().items()})
  29. 29. Performance Tracking (cont.)
  30. 30. Demo
  31. 31. References ▪ https://www.buzzfeednews.com/article/craigsilverman/google- banned-cootek-adware (slide 2) ▪ https://www.euronews.com/2018/11/28/feds-say-russian- cybercriminals-duped-u-s-companies-out-tens-n940946 (slide 4) ▪ https://www.bleepingcomputer.com/news/security/russian- methbot-operation-makes-up-to-5-million-per-day-from-click-fraud/ (slide 6) ▪ https://www.thedrum.com/news/2019/06/06/cost-global-ad-fraud- could-top-30bn (slide 4) Open Source References Were Used for Describing Ad Fraud Scenarios
  32. 32. Q&A
  33. 33. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Beschreibung

The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.

Transkript

  1. 1. Advertising Fraud Detection at Scale @ T-Mobile Eric Yatskowitz, Data Scientist Phan Chuong, Data Engineer
  2. 2. Ad Tech Overview
  3. 3. Ad Tech Industry ▪ A lack of regulation is one reason ▪ Ad Tech Industry complexity Why is this industry so rife with fraud?
  4. 4. Ad Tech Complexity
  5. 5. What are the typical fraudsters’ behavior ? Bot Farms Domain Spoofing
  6. 6. How do we detect suspicious behavior? We need data ▪ DMP, bit request data, data from SSP and DSP ▪ Device Network Data We need a model which is adaptive and can detect different anomalies, which requires historical data We need to be able to scale the model on network data size 4-10Tb per day
  7. 7. Building Data Science Products T-MOBILE DATA PLATFORM 3rd-PARTY DATA DATA SCIENCE PRODUCTS csv Parquetorc csv orc Parquet YARN/MESOS MR TEZ SPARK STORM
  8. 8. Building Data Science Product: Working Pipeline ORC PARQUET CSV DEVELOPMODEL SAVEMODELANDOUTPUTS VISUALIZATIONANDBUSINESS INTERPRETATITON READDATA
  9. 9. Spark and Big Data ▪ When working with BIG data, Spark becomes a necessity ▪ Hive or SQL does not support Machine Learning ▪ Python, R can not operate in large data sets ( > 4Gb )
  10. 10. Spark Tuning
  11. 11. Spark Tuning - Overview Resources management Static allocation vs Dynamic allocation Reading Partition sizing & Split strategy Joining & Aggregating Maximizing parallelism & Shuffling strategy Writing Maximizing parallelism & Shuffling strategy
  12. 12. Spark Tuning – Resources Management s.d.enabled = True s.d.initialExecutors s.d.minExecutors s.d.maxExecutors s.d.executorIdleTimeout s.d.cachedExecutorIdleTimeout * s.d. is acronym of spark.dynamicAllocation
  13. 13. Spark Tuning – Reading from HDFS spark.files.openCostInBytes (oCIB) Default = 4 Mb spark.files.maxPartitionBytes (mPB) Default = 128 Mb - 10,000 20,000 30,000 40,000 50,000 60,000 0.13 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 Numberoftasks Partition size in Gb mPB = 𝐷𝑎𝑡𝑎𝐵𝑦𝑡𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑖𝑙𝑒𝑠 ∗ 𝑜𝐶𝐼𝐵 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 ∗ 𝐶𝑜𝑟𝑒𝑠
  14. 14. Spark Tuning – Shuffling Strategy spark.sql.shuffle.partitions Default = 200 𝑆ℎ𝑢𝑓𝑓𝑙𝑒𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 = 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 ∗ 𝐶𝑜𝑟𝑒𝑠
  15. 15. df.write… df.write… Spark Tuning – Writing Strategy df.coalesce(num) .write… df.repartition(num).write…
  16. 16. Demo
  17. 17. From Python to PySpark
  18. 18. From Python to PySpark ▪ Many data scientists are most comfortable coding in Python ▪ Spark can seem very intimidating to the newcomer ▪ UDFs provide a useful tool to run Python code in Spark ▪ But it is oftentimes still much more efficient to run PySpark code directly
  19. 19. Python vs PySpark schema Python stages:PySpark stages: Load Data Features Model Pipeline Data Split data = spark.read.option().csv() from … import RandomForestClassifier rf = RandomForestClassifier() from … import Pipeline pipeline = Pipeline() (trainDF, testDF) = df.randomSplit() from … import VectorAssembler Load Data Data Split Model data = pd.read_csv() from … import train_test_split X_train, X_test, y_train, y_test = train_test_split() from … import RandomForestRegressor
  20. 20. Python to PySpark transition
  21. 21. Python UDF vs PySpark PySparkPython UDF Filter Variables ... Read Variables ...
  22. 22. Python UDF vs PySpark (cont.) UDFs can serve as a useful go-between for getting from Python to Spark, but converting to PySpark will almost always be more efficient PySpark SQLPython UDF
  23. 23. Normalized Entropy: The algorithm App TMO user 1 TMO user 2 TMO user 3 … TMO user 70M Norm. Entropy Score Facebook 28 50 0 … 154 76 - normal Netflix 287 340 78 … 0 54 – normal Free weather app 0 0 1000 … 0 0 - stalker Misc. banking app 1 1 0 … 1 100 - spammer Where P(xi) is the probability the user xi used app = C(xi)/C(X), C(xi) is the number of times the app showed up in user xi’s network, and C(X) is the number of times the app showed up in the entire network Shannon Entropy Normalized Shannon Entropy
  24. 24. Normalized Entropy: Static Allocation config Default Optimized spark.executors.instances 2 100 spark.executor.cores 1 4 spark.executor.memory 1g 2g spark.sql.shuffle.partitions 200 400 Completion time 8 min 23 sec Default Configuration Optimized Configuration
  25. 25. Normalized Entropy: Dynamic Allocation config Default Optimized spark.dynamicAllocation.ma xExecutors infinity 100 spark.executor.cores 1 4 spark.executor.memory 1g 2g spark.sql.shuffle.partitions 200 400 Completion time -- 41 sec 12x faster than default static configuration 1.8x slower than optimized static configuration
  26. 26. Productionization
  27. 27. Build end-to-end product
  28. 28. Performance Tracking import mlflow import mlflow.spark as mlsp mlflow.set_tracking_uri('http://tracking-server/') mlflow.set_experiment('datasource_0’) data = spark.read. (...) .collect() for i in range(len(data)): with mlflow.start_run(run_name=data[i]['date_part']) as run: mlflow.log_metrics({m:vfor (m, v) in data[i].asDict().items()})
  29. 29. Performance Tracking (cont.)
  30. 30. Demo
  31. 31. References ▪ https://www.buzzfeednews.com/article/craigsilverman/google- banned-cootek-adware (slide 2) ▪ https://www.euronews.com/2018/11/28/feds-say-russian- cybercriminals-duped-u-s-companies-out-tens-n940946 (slide 4) ▪ https://www.bleepingcomputer.com/news/security/russian- methbot-operation-makes-up-to-5-million-per-day-from-click-fraud/ (slide 6) ▪ https://www.thedrum.com/news/2019/06/06/cost-global-ad-fraud- could-top-30bn (slide 4) Open Source References Were Used for Describing Ad Fraud Scenarios
  32. 32. Q&A
  33. 33. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Weitere Verwandte Inhalte

×