3. Hi, I’m Katie. I’m a big nerd, and I do data science.
This is me at my previous job, being
a graduate physics student working
at a particle collider.
It was fun, and I got good at science,
but there was a lot I had to learn at
my first data science job.
6. The next few slides are literal training slides from my
internal roadshow, introducing Rocky to its users.
I’m not proud of what I’m about to show you.
Please be gentle.
7. 7Civis Analytics | Proprietary and Confidential
Inspiration and Use Cases
“Can we put Civis Research together
with modeling? Like, build and score
models as part of the standard Civis
Research workflow”
“It would be dope if, when the
omnibus comes in each week,
we could just automatically
build models of all the
questions”
“I want to build a model for
every variable in GFK”
8. 8Civis Analytics | Proprietary and Confidential
Step 1: Build DVSets
Names of your
dvsets will be
printed to the
logs
*note: you may also see a “credential_id” parameter. This should be kept at its default value, 5263
9. 9Civis Analytics | Proprietary and Confidential
Backend: Setting up a dvset
kiwi tables are basically our
config files
kiwi.tables
- table id
- primary key
kiwi.depvars
- depvar id
- column name
- table id
kiwi.dvsets
- model type
- dvset name
- depvar id
insert (2541992, “voterbase_id”) into kiwi.tables
### makes an API call to get the names of all the columns
insert (“romance”, 2541992),
(“comedy”, 2541992),
(“horror”, 2541992)
into kiwi.depvars
### returns a list of auto-incremented depvar ids
insert
(dv_id_1, “movies_dvset_GBT”, “gradient boosting classifier),
(dv_id_2, “movies_dvset_GBT”, “gradient boosting classifier”),
(dv_id_3, “movies_dvset_GBT”, “gradient boosting classifier”)
into kiwi.dvsets
10. 10Civis Analytics | Proprietary and Confidential
Step 2: Run a DVSet
dvset name here
training table here
make sure dvset table
and training table are on
the same cluster!
put in your username (for
finding your S3
credential)
*note: you may also see a “credential_id” parameter. This should be kept at its default value, 5263
11. 11Civis Analytics | Proprietary and Confidential
Backend: running the dvset
kiwi tables are basically our
config files
kiwi.tables
- table id
- primary key
kiwi.depvars
- depvar id
- column name
- table id
kiwi.dvsets
- model type
- dvset name
- depvar id
### kiwi.dvset → dependent variables → depvar table
### auto-generate SQL code:
create view public.rocky_train as select
depvar_table.comedy, depvar_table.romance, depvar_table.horror,
basefile.*
join ts.modeling_commercial basefile
with depvar_table
on basefile.voterbase_id = depvar_table.voterbase_id
file_id = export_redshift_to_S3(public.rocky_train)
for dv in (“comedy”, “romance”, “horror”):
mp = civis_model.ModelPipeline(
depvar = dv,
workflow = “gradient boosting classifier”
excluded_cols = [all other dvs])
mp.train(file_id = file_id)
12. 12Civis Analytics | Proprietary and Confidential
What’s the right way to parallelize model-building: “map” step
voterbase_id
voterbase_id
voterbase_id
voterbase_id
freq_theaterg
oer
genre_comed
y
genre_scifi
genre_roman
tic
voterbase_id
the usual
basefile stuff
voterbase_idvoterbase_id
freq_theaterg
oer
genre_comed
y
genre_scifi
genre_roman
tic
voterbase_id
the usual
basefile stuff
freq_theaterg
oer
genre_comed
y
genre_comed
y
genre_roman
tic
the usual
basefile stuff
13. 13Civis Analytics | Proprietary and Confidential
Step 5: Take a look at your models
14. Did you get all that?
Of course not. Those slides were terrible.
Even worse, nobody used it.
14
19. 1. The design is based upon an explicit understanding of users, tasks
and environments.
2. Users are involved throughout design and development.
3. The design addresses the whole user experience.
4. The design is driven and refined by user-centered evaluation.
5. The process is iterative.
Principles
20. 20
1 / 5 Understand your users, tasks and environments
21. 21
2 / 5 Keep users involved throughout the process
32. Principle 1: the design is based upon an explicit understanding of
users, tasks, and environments
Our user: a data scientist user is creating models for a business user
to use
Their task: business user wants to cut lists of people based on
modeled predictions
Can we build a tool to help with the list-cutting?
36. In conclusion...
If you’re a data scientist, you should care about people using the things you build,
and you will build things that people use if you’re user-centered in your mindset.
If you’re a business user, give problems not solutions and have a little empathy the
other way: be engaged, be patient, give thoughtful feedback, have fun.
You own the outcome together!