A system that has access to both the microphone and the headphones could coordinate both based on past experiences (e.g., learning). I don't know how much data you would need to make it reliable though.
If you know how the data is going to be transmitted, and assume little network delays, recording yourself before a video session might be of great help. Probably the same problem that TV broadcasters have. It's hard work to get quality real time content.