Video-mediated interpreting has become a prominent form of interpreting; however, the emergence of diverse configurations in the post-pandemic era presents new challenges and calls for further investigation. This study examines a video conference involving four primary interlocutors and an interpreter connected from distributed locations, with an audience observing the interaction, in order to elucidate how turn-transition (TT) is managed by multiple remote participants. The paper first reviews previous studies on TT and video-based interaction from a multimodal perspective and outlines the methodology applied. To contextualise the analysis, background information about the event and its participants is then provided, with particular attention to the interactional ecology and technical setup. Drawing on the multimodal approach in Conversation Analysis, TT instances are analysed by focusing on intra-turn gaps and overlapping turn-beginnings. Findings show that lengthy gaps are common, while overlaps occur seldom. This pattern reflects a controlled behaviour pattern of participants which is shaped by the multiparty and multi-sited configuration. The presence of the audience creates a formal dimension, which also impacts participants’ behaviour. Overall, the study contributes to a more comprehensive understanding of diversified technology-based interpreting practices.