Hello GPT, over. Copy that, how can I help you? Over.

Technology has mostly relied on “push-to-talk” interactions – you explicitly tell the computer when it’s your turn. That was GPT’s first voice mode. That’s Alexa or Siri. That’s walkie-talkie talk.

It works, but do we like it? It’s clearly not how we usually interact with other people, and we want systems to understand us in our human nature.

It turns out that turn-taking is a highly complex human skill. One way of seeing conversations is as a stream of multiple “interruptible moments” when turn-taking is subtly negotiated. It’s like passing a potato – at each of these interruptible moments, the current speaker can pass the potato to the listener; the current speaker can offer the potato, but the listener might not take it; or the listener can grab the potato “by force,” even if the speaker hasn’t handed it over.

If you’re a computer listening to me…

… and I say “uuuuhm,”

I might have forgotten a word and would like you to help by interrupting me, as in: “I met your previous Spanish teacher, uuuhmm (trying to remember the name, help me)…”

But I might also be filling a pause to avoid being interrupted while I process the next thing I want to say, like in: “I feel like having a burger, but I had that yesterday… uuhm (don’t interrupt me, I’m thinking), maybe I should have pizza tonight.”

… now I say, “I’m heading up to my parents’ cabin.”

I want to continue with “next weekend.” Yet, the sentence I just said is still syntactically complete. Syntax alone isn’t enough to signal that we’re done talking. Look at these possible interruptible moments – each “/” could be a plausible moment to take the turn, with syntactically valid sentences:

B: oh / that sounds nice / for how long / do you plan to stay?

A: maybe the entire weekend / I’m not sure / yet /

B: let me know / I might join you / if that’s okay /

… and if I pause?

Well, it turns out that’s one of the worst ways to predict turn-taking. Pauses within our sentences are actually longer than gaps between turns of two speakers. Observe this yourself in your next conversation, it’s real.

And this varies across languages: the average gap length ranges from 7ms in Japanese to 489ms in Danish.

In fact, because our response time is so short, we must process our responses before taking the turn to speak – otherwise, we’d need somewhere between 600ms and 1500ms to respond, instead of the about half a second we usually take.

A system must continuously check for cues, constantly prepare a response, judge whether it is the right moment to interrupt, and learn how to negotiate overlaps – all in real time. Quite hard. Kids take about six years to learn it; before that, according to researchers, their interaction patterns differ, and they take longer to recognize when it’s their turn.

We use a variety of cues for turn-taking. Yes, syntax and speech gaps, but also prosody, breathing, gaze, gestures, all combined. And we likely learn each individual’s unique interaction style and adapt to them, building a repertoire of interaction patterns with each person we know.

In some languages, sentences end with a rising pitch – but not all. In English, speakers often lower their volume when finishing a turn, but what about other languages?

Audible breathing can signal a desire to speak. Remember when you prepared to interrupt, took a deep breath, missed the timing, and then exhaled heavily?

When we begin speaking, we often look away and then shift our gaze back to the listener. It sounds odd, but I’ve been observing this, and it happens surprisingly often. During a turn, we oscillate between looking at the listener and looking away, but when we want to finish, we typically re-establish eye contact.

Gestures also play a role. Listeners tend not to “steal” the turn when the speaker is performing certain types of gestures.

Turn-taking is a skill, and there’s so much happening in every interaction. Voice Mode is actually pretty impressive, considering it can only rely on our voices =O

Skantze, G. (2021). Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67, 101178. https://doi.org/10.1016/j.csl.2020.101178

Sacks, H., Schegloff, E., Jefferson, G., 1974. A simplest systematics for the organization of turn-taking for conversation. Language (Baltim) 50, 696–735.

Stivers, T., Enfield, N.J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., Hoymann, G., Rossano, F., de Ruiter, J.-P., Yoon, K.-E., Levinson, S.C., 2009. Universals and cultural variation in turn-taking in conversation.. Proc. Natl. Acad. Sci. U.S.A. 106 (26), 10587–10592. https://doi.org/10.1073/pnas.0903616106.

BEATRIZ ZANFORLIN