Confessions of a Data Scientist

•Als PPTX, PDF herunterladen•

2 gefällt mir•5,966 views

Salford Systems

data mining challenges that real data scientists have confessed including missing data and data analysis

Technologie

8 challenges that data
scientists have confessed
Salford Systems
http://www.salford-systems.com

#1 Not knowing when to STOP.
 This can be challenging because there is always the
hope that your model and/or results can be improved a
bit more, and a bit more, and just a little bit more. The
point of diminishing return is difficult to identify and
much more time may be spent for a very marginal
benefit.

#2 Guilty of data torture.
 "If you torture data long enough, it will confess." Any
effect can be 'detected' by looking at the data in a
certain, very specific way (even if there is no effect at
all).

#3 Pretending there is a signal.
 A big challenge is what to do when the signal is not
there, but the client expects it. Especially when there is
big $$$ at stake.
 At this point your choices are rather grim:
 tell the truth and lose the contract, continue
procrastinating hoping that the client will keep paying,
or massage your data to the point of seeing something
that can be remotely presented as a success.

#4 Being 'bossed' around.
 When your boss gives you an assignment to prove that
he is right by doing some kind of data torture, it's time
to move on.

#5 Client communication
(or lack thereof).
 How to communicate to the client that the Petabyte of
data assembled over the years does not have a key
variable that is needed in order to answer his business
question.
 This is especially difficult when the client is the person
in charge of all data collection decisions historically.

#6 Modeling method dilemmas.
 The challenge is to choose between a super-fast linear
regression solution available on a Hadoop cluster versus
an ultra slow Neural Net solution available on your
desktop. The former has access to all of the data but
does not take any advantage of it, the latter could be
extremely useful but you will have a heck of a time
educating the IT person in charge on the merits of
sampling and how it culminates in the famous Central
Limit Theorem in statistics.

#7 Being term-savvy.
 It can be difficult to stay up-to-date on all of the
terminology people use these days to give a new birth
to frequency tables and descriptive statistics.
 However, this is where the ultimate utility of Wikipedia
comes to rescue, or even the Google Scholar for the
more intrepid of us.
 If all else fails, you may always invent your own term or
claim that in your domain the term mentioned has a
different meaning.

#8 Open source. 'Nuff said.
 A big challenge is using the Open Source software as
much as possible and hoping that it actually works. Even
worse, spending hours learning how to use it only to
discover that it can't do what you want because of some
obscure memory limitation, a very bizarre bug that
occurs only on your workstation, or a run that takes
forever to complete. Well, at least you did not have to
pay for it, literally...

Like what you’ve read?
 Subscribe to the blog:
 http://info.salford-systems.com/subscribe-to-this-blog

Weitere ähnliche Inhalte

Mehr von Salford Systems

Introduction to Random Forests by Dr. Adele CutlerSalford Systems

Statistically Significant Quotes To RememberSalford Systems

Using CART For Beginners with A Teclo Example DatasetSalford Systems

CART Classification and Regression Trees Experienced User GuideSalford Systems

Evolution of regression ols to gps to marsSalford Systems

Data Mining for Higher EducationSalford Systems

Comparison of statistical methods commonly used in predictive modelingSalford Systems

Molecular data mining tool advances in hivSalford Systems

TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems

SPM v7.0 Feature MatrixSalford Systems

SPM User's Guide: Introducing MARSSalford Systems

Hybrid cart logit model 1998Salford Systems

Session Logs Tutorial for SPMSalford Systems

Some of the new features in SPM 7Salford Systems

TreeNet Overview - Updated October 2012Salford Systems

TreeNet Tree Ensembles and CART Decision Trees: A Winning CombinationSalford Systems

Text mining tutorialSalford Systems

Paradigm shifts in wildlife and biodiversity management through machine learningSalford Systems

Global Modeling of Biodiversity and Climate ChangeSalford Systems

Predicting Hospital Readmission Using TreeNetSalford Systems

Mehr von Salford Systems (20)

Introduction to Random Forests by Dr. Adele Cutler

Statistically Significant Quotes To Remember

Using CART For Beginners with A Teclo Example Dataset

CART Classification and Regression Trees Experienced User Guide

Evolution of regression ols to gps to mars

Data Mining for Higher Education

Comparison of statistical methods commonly used in predictive modeling

Molecular data mining tool advances in hiv

TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination

SPM v7.0 Feature Matrix

SPM User's Guide: Introducing MARS

Hybrid cart logit model 1998

Session Logs Tutorial for SPM

Some of the new features in SPM 7

TreeNet Overview - Updated October 2012

TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination

Text mining tutorial

Paradigm shifts in wildlife and biodiversity management through machine learning

Global Modeling of Biodiversity and Climate Change

Predicting Hospital Readmission Using TreeNet

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Exploring Multimodal Embeddings with MilvusZilliz

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

Architecting Cloud Native ApplicationsWSO2

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

DBX First Quarter 2024 Investor PresentationDropbox

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Why Teams call analytics are critical to your entire businesspanagenda

MS Copilot expands with MS Graph connectorsNanddeep Nachan

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

How to Troubleshoot Apps for the Modern Connected Worker

Exploring Multimodal Embeddings with Milvus

Vector Search -An Introduction in Oracle Database 23ai.pptx

Understanding the FAA Part 107 License ..

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Architecting Cloud Native Applications

Apidays New York 2024 - The value of a flexible API Management solution for O...

presentation ICT roal in 21st century education

FWD Group - Insurer Innovation Award 2024

Artificial Intelligence Chap.5 : Uncertainty

DBX First Quarter 2024 Investor Presentation

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Why Teams call analytics are critical to your entire business

MS Copilot expands with MS Graph connectors

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Confessions of a Data Scientist

1. 8 challenges that data scientists have confessed Salford Systems http://www.salford-systems.com

2. #1 Not knowing when to STOP.  This can be challenging because there is always the hope that your model and/or results can be improved a bit more, and a bit more, and just a little bit more. The point of diminishing return is difficult to identify and much more time may be spent for a very marginal benefit.

3. #2 Guilty of data torture.  "If you torture data long enough, it will confess." Any effect can be 'detected' by looking at the data in a certain, very specific way (even if there is no effect at all).

4. #3 Pretending there is a signal.  A big challenge is what to do when the signal is not there, but the client expects it. Especially when there is big $$$ at stake.  At this point your choices are rather grim:  tell the truth and lose the contract, continue procrastinating hoping that the client will keep paying, or massage your data to the point of seeing something that can be remotely presented as a success.

5. #4 Being 'bossed' around.  When your boss gives you an assignment to prove that he is right by doing some kind of data torture, it's time to move on.

6. #5 Client communication (or lack thereof).  How to communicate to the client that the Petabyte of data assembled over the years does not have a key variable that is needed in order to answer his business question.  This is especially difficult when the client is the person in charge of all data collection decisions historically.

7. #6 Modeling method dilemmas.  The challenge is to choose between a super-fast linear regression solution available on a Hadoop cluster versus an ultra slow Neural Net solution available on your desktop. The former has access to all of the data but does not take any advantage of it, the latter could be extremely useful but you will have a heck of a time educating the IT person in charge on the merits of sampling and how it culminates in the famous Central Limit Theorem in statistics.

8. #7 Being term-savvy.  It can be difficult to stay up-to-date on all of the terminology people use these days to give a new birth to frequency tables and descriptive statistics.  However, this is where the ultimate utility of Wikipedia comes to rescue, or even the Google Scholar for the more intrepid of us.  If all else fails, you may always invent your own term or claim that in your domain the term mentioned has a different meaning.

9. #8 Open source. 'Nuff said.  A big challenge is using the Open Source software as much as possible and hoping that it actually works. Even worse, spending hours learning how to use it only to discover that it can't do what you want because of some obscure memory limitation, a very bizarre bug that occurs only on your workstation, or a run that takes forever to complete. Well, at least you did not have to pay for it, literally...

10. Like what you’ve read?  Subscribe to the blog:  http://info.salford-systems.com/subscribe-to-this-blog

Confessions of a Data Scientist

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Salford Systems

Mehr von Salford Systems (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Confessions of a Data Scientist