CLUE ControLling mUltiple streams for TElepresence MUSCLE: MUltiple Stream Control for teLepesencE MUSCAT: MUlti-Stream Control Applied to Telepresence SCULPTS: Semantic Control for mULtiPLe Telepresence Streams Current telepresence systems are based on open standards such as RTP, SIP, H.264, H.323 suite, they cannot easily interoperate with each other without operator assistance and expensive additional equipment which translates from one vendor to another. A major factor in the inability of Telepresence systems to work with each other is that there is no standard description of the multiple streams of audio and video that comprise the media flows. There are not standardized ways to exchange semantic information about what each media stream represents, For example, when using a video conferencing system with two screens and two cameras on each end, there is no standardized way to indicate which of the video streams corresponds to the left or right camera. In a multiple screen conference, the video and audio streams sent from remote participants must be understood by receivers so that they can be presented in a coherent and life-like manner. This includes the ability to present the remote participants at their true size for their apparent distance, while maintaining correct eye contact, gesticular cues, and simultaneously providing a spatial audio sound stage consistent with the video presentation. The receiving device that decides how to display the incoming information needs to understand a number of variables such as the spatial position of the speaker, the field of view of the cameras, the camera zoom, which media stream is related to each of the displays, etc. This working group is charted to specify a way for one endpoint to indicate to another endpoint or conference bridge information about its media streams including: * Spatial relationships of cameras, displays, microphones, and speakers in relation to each other and to likely position of participants * Camera Viewpoint, Field of View, and Depth of Field for both fixed cameras and cameras where this is adjustable (such as pan tilt zoom units) * Aspect ratio of cameras and displays (is this really needed?) * Label “role” of media stream, such whether the media represents a presentation, a presenter, a participant, a moderator, or an observer. The goal is to communicate enough information about each media stream that each receiving system can make reasonable decisions about selecting and displaying media streams. This enables systems to make display choices that optimize the "just like being there" experience. The working group will define the semantics, language and transport mechanism necessary for communicating the necessary information. It will consider whether the existing signaling mechanism (SDP) can be extended, or another messaging method should be used. The scope includes both systems that provide a fully immersive experience, and systems that interwork with them and therefore need to understand the same multiple stream semantics. The focus of this work is on audio and video multiple streams. Other media types may be considered, however development of methodologies for them is not within the scope of this work. Interoperation with standards compliant systems is required, such as SIP-based video conferencing systems. However, backwards compatibility with existing non-standards compliant telepresence systems is not required. This working group is not currently chartered to work on issues of continuous conference control including: far end camera control, indication of fast frame update for video codecs or other rapid switches, floor control, conference roster, or active speaker information. Reuse of existing protocols and backwards compatibility with existing systems is an important factor for the working group to consider. The work will closely coordinate with the appropriate areas and working groups including OPS Area, AVT, MMUSIC, MEDIACTRL, XCON, and SIPCORE. Milestones Nov 2010 Submit information draft to IESG on use cases and requirements Nov 2011 Submit standards track specification to IESG on indicating spatial relationships of screens, cameras (including variable field of view and orientation), speakers and microphones. This includes, semantics, language and transport. Apr 2011 Submit standards track specification to IESG on indicating the "usage" of a stream as define in charter.