Speech generally is considered to have three parts to it: vision, aural, and the social construct. In recent years, although the field has been moving at a dramatic pace, progress is being made in silos. The primary reason for this being that speech is considered "spoken text" by practitioners and researchers alike. Most open-source datasets due to their distance from real-world conditions help in spreading this false impression. In this condition, it is not surprising that common and important features of speech like intonation and disfluency do not get captured by this intent. This tutorial aims to provide an appreciation of the "full-stack" of speech - aural, vision and the textual (or social construct) parts with a special emphasis on aspects that may have significance for current and future research.