Deep learning has evolved not linearly but through a series of step-functions: sudden unexpected outbreaks of capability, which fundamentally changed the envelope of what computers are able to do. At TwentyBN, we have created spatio-temporal video models and data infrastructure that allowed us to grow approximately one million labeled videos showing everyday common-sense scenes and situations - many of them extremely subtle. This allowed us to successfully train neural networks end-to-end on a wide range of action understanding tasks, that neither hand-engineering nor neural networks had appeared anywhere near solving just a few months ago. I will show how these recognition tasks now drive commercial value at TwentyBN, and how they drive our long-term AI agenda for learning common sense world knowledge through video.