The document discusses obtaining labeled data and introduces weak supervision as an alternative to full manual labeling. It notes that weak supervision uses labeling functions to generate noisy training labels at scale, which can then be combined using a generative model to infer true labels. The document also briefly mentions Snorkel, a system for creating labeling functions, and Snuba, its successor which focuses on scaling to very large datasets.
2. About Me
● Google Engineer (2007-
11)
● Cloudera’s Director of
Data Science (2011-15)
● Slack’s Director of Data
Engineering (2015-
2017)
● Slack Engineer (2017-
11. Search Problems: A Comparison
1. Corpus/queries are
public.
1. Lots of head queries.
1. Web pages want to be
found.
1. Corpus/queries are
private.
1. Almost no head
queries.
1. Messages don’t care
about being found.
There is no what’s next, although consulting the trusty Silicon Valley hierarchy of needs chart, I see a number of Medium Thinkpieces in my not too distant future.
So what we’re talking about today:
My personal life
My startup
And, at the after party, I will be happy to give you my unique and contrarian take on WeWork.
The highly cliched (but essentially accurate) desire of people who leave successful companies is to start another company that implements that one key feature that they thought would make the company but could never actually convince the company to invest in before they left. Because, let’s be honest, if they had convinced the company to do it, they would still be working there.
Unfortunately, I have several such ideas. And I sort of need to get them out of my system, because the point of this time off is to clear my head and get myself ready for what’s next. And so that’s what we’re going to talk a bit about today.
If you don’t know what Slack is, this is Slack. There are these things called channels and people can subscribe to them and then sends messages to one another. It’s sort of like Kafka, but for people.
My first terrible idea: Slack, but for Jupyter notebooks.
My other class of startup ideas are all related to search, and a lot of that is because I spent a good solid year rebuilding Slack search, which you can see my colleague John Gallagher and I talking about here: https://www.youtube.com/watch?v=EQ336PTZfhU
The good news is that there are already a number of startups that are in this space, and I know this because many of them have tried to hire me, so this talk is my way of giving them all the exact same advice about how I think they should approach the hardest part of doing a really good job of enterprise search, especially for the ones who are coming from a large-scale search background at say Google, or a large e-commerce company.
Slack search is only really good at one thing: finding something when a) you know it already exists (possibly because you wrote it yourself), and b) you have a pretty good memory of what terms were involved/what channel it was in/etc. This is often very useful when it is paired with a culture of devops that involves posting pretty much any adhoc command you run into a channel so that the knowledge of the magic can be distributed far and wide.
But no one should mistake this for Google search, or think that the relevance problem in enterprise search is even remotely solved like it is for the web.
A bit about why Slack search is hard and why Google actually has it pretty easy.
The blessing and the curse of Slack search: you can always ask someone who knows.
The problem we have now is that Google’s position on Maslow’s hierarchy of needs is so far removed from the reality of an enterprise search startup that it leads us to think that the bells and whistles are what matter and we no longer see what all of the infrastructure is built on: high-quality click data.
Spelling correction, learn-to-rank algorithms, synonym detection, etc., etc. are all based on the strong signal of the core mechanism of the query-click pairing.
And this foundation is easy to take for granted; we rarely actually talk about it b/c all of our sophisticated machinery is predicated on its existence. It’s the elephant in the room, the water that fish swim in, the air we breathe.
If for no other reason than it gives agency to our users.
I get it- labeling data is terrible. You don’t want to do it. You even feel bad conning your interns into doing it for you. Good for you! It shows you have a conscience.