TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Confessions of a Data Scientist
1. 8 challenges that data
scientists have confessed
Salford Systems
http://www.salford-systems.com
2. #1 Not knowing when to STOP.
This can be challenging because there is always the
hope that your model and/or results can be improved a
bit more, and a bit more, and just a little bit more. The
point of diminishing return is difficult to identify and
much more time may be spent for a very marginal
benefit.
3. #2 Guilty of data torture.
"If you torture data long enough, it will confess." Any
effect can be 'detected' by looking at the data in a
certain, very specific way (even if there is no effect at
all).
4. #3 Pretending there is a signal.
A big challenge is what to do when the signal is not
there, but the client expects it. Especially when there is
big $$$ at stake.
At this point your choices are rather grim:
tell the truth and lose the contract, continue
procrastinating hoping that the client will keep paying,
or massage your data to the point of seeing something
that can be remotely presented as a success.
5. #4 Being 'bossed' around.
When your boss gives you an assignment to prove that
he is right by doing some kind of data torture, it's time
to move on.
6. #5 Client communication
(or lack thereof).
How to communicate to the client that the Petabyte of
data assembled over the years does not have a key
variable that is needed in order to answer his business
question.
This is especially difficult when the client is the person
in charge of all data collection decisions historically.
7. #6 Modeling method dilemmas.
The challenge is to choose between a super-fast linear
regression solution available on a Hadoop cluster versus
an ultra slow Neural Net solution available on your
desktop. The former has access to all of the data but
does not take any advantage of it, the latter could be
extremely useful but you will have a heck of a time
educating the IT person in charge on the merits of
sampling and how it culminates in the famous Central
Limit Theorem in statistics.
8. #7 Being term-savvy.
It can be difficult to stay up-to-date on all of the
terminology people use these days to give a new birth
to frequency tables and descriptive statistics.
However, this is where the ultimate utility of Wikipedia
comes to rescue, or even the Google Scholar for the
more intrepid of us.
If all else fails, you may always invent your own term or
claim that in your domain the term mentioned has a
different meaning.
9. #8 Open source. 'Nuff said.
A big challenge is using the Open Source software as
much as possible and hoping that it actually works. Even
worse, spending hours learning how to use it only to
discover that it can't do what you want because of some
obscure memory limitation, a very bizarre bug that
occurs only on your workstation, or a run that takes
forever to complete. Well, at least you did not have to
pay for it, literally...
10. Like what you’ve read?
Subscribe to the blog:
http://info.salford-systems.com/subscribe-to-this-blog