Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document collates the target scenarios for the Media Capture task force. Scenarios represent the set of expected functionality that may be achieved by the use of the MediaStream Capture API. A set of un-supported scenarios may also be documented here.
This document builds on the assumption that the mechanism for obtaining fundamental access to local media
capture device(s) is navigator.getUserMedia
(name/behavior subject to this task force), and that
the vehicle for delivery of the content from the local media capture device(s) is a MediaStream
.
Hence the title of this note.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is intended to represent the consensus of the media capture task force on the set of scenarios supported by the MediaStream Capture API. It will eventually be released as a Note.
This document was published by the Device APIs Working Group and Web Real-Time Communications Working Group as a First Public Working Draft. If you wish to make comments regarding this document, please send them to public-media-capture@w3.org (subscribe, archives). All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures (Device APIs Working Group, Web Real-Time Communications Working Group) made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This section is non-normative.
One of the goals of the joint task force between the Device and Policy working group and the Web Real Time Communications working groups is to bring media capture scenarios from both groups together into one unified API that can address all relevant use cases.
The capture scenarios from WebRTC are primarily driven from real-time-communication-based scenarios, such as capturing live chats, teleconferences, and other media streamed from over the network from potentially multiple sources.
The capture scenarios from DAP represent "local" capture scenarios that providing access to a user agent's camera and other related experiences.
Both groups include overlapping chartered deliverables in this space. Namely in DAP, the charter specifies a recommendation-track deliverable:
And WebRTC's charter scope describes enabling real-time communications between web browsers that will require specific client-side technologies:
Note, that the scenarios described in this document specifically exclude declarative capture scenarios, such as those where media capture can be obtained and submitted to a server entirely without the use of script. Such scenarios generally involve the use of a UA-specific app or mode for interacting with the capture device, altering settings and completing the capture. Such scenarios are currently captured by the DAP working group's HTML Media Capture specification.
The scenarios contained in this document are specific to scenarios in which web applications require direct access to the capture device, its settings, and the capture mechanism and output. Such scenarios are crucial to building applications that can create a site-specific look-and-feel to the user's interaction with the capture device, as well as utilize advanced functionality that may not be available in a declarative model.
Some of the scenarios described in this document may overlap existing usage scenarios defined by the IETF RTCWEB Working Group. This document is specifically focused on the capture aspects of media streams, while the linked document is geared toward networking and peer-to-peer RTC scenarios.
In this section, scenarios are presented first as a story that puts the scenario into perspective, and then as a list of specific capture scenarios included in the story.
Every Wednesday at 6:45pm, Adam logs into his video podcast web site for his scheduled 7pm half-hour broadcast "commentary on the US election campaign". These podcasts are available to all his subscribers the next day, but a few of his friends tune-in at 7 to listen to the podcast live. Adam selects the "prepare podcast" option, is notified by the browser that he previously approved access to his webcam and microphone, and situates himself in front of the webcam, using the "self-view" video window on the site. While waiting for 7pm to arrive, the video podcast site indicates that two of his close friends are now online. He approves their request to listen live to the podcast. Finally, at 7pm he selects "start podcast" and launches into his commentary. While capturing locally, Adam switches between several tabs in his browser to quote from web sites representing differing political views. Half-hour later, he wraps up his concluding remarks, and opens the discussion up for comments. One of his friends has a comment, but has requested anonymity, since the comments on the show are also recorded. Adam enables the audio-only setting for that friend and directs him to share his comment. In response to the first comment another of Adam's friends wants to respond. This friend has not requested anonymity, and so Adam enables the audio/video mode for that friend, and hears the rebuttal. After a few back-and-forths, Adam sees that his half-hour is up, thanks his audience, and clicks "end podcast". A few moments later that site reports that the podcast has been uploaded.
TBD
Alice is finishing up a college on-line course on image processing, and for the assignment she has to write code that finds a blue ball in each video frame and draws a box around it. She has just finished testing her code in the browser using her webcam to provide the input and the canvas element to draw the box around each frame of the video input. To finish the assignment, she must upload a video to the assignment page, which requires uploads to have a specific encoding (to make it easier for the TA to review and grade all the videos) and to be no larger than 50MB (small camera resolutions are recommended) and no longer than 30 seconds. Alice is now ready; she enables the webcam, a video preview (to see herself), changes the camera's resolution down to 320x200, starts a video capture, and holds up the blue ball, moving it around to show that the image-tracking code is working. After recording for 30 seconds, Alice uploads the video to the assignment upload page using her class account.
TBD
Albert is on vacation in Italy. He has a device with a front and rear webcam, and a web application that lets him document his trip by way of a video diary. After arriving at the Coliseum, he launches his video diary app. There is no internet connection to his device. The app asks Albert which of his microphones and webcams he'd like to use, and he activates both webcams (front and rear). Two video elements appear side-by-side in the app. Albert uses his device to capture a few still shots of the Coliseum using the rear camera, then starts recording a video, selecting the front-facing webcam to begin explaining where he is. While talking, he selects the rear-facing webcam to capture a video of the Coliseum (without having to turn his device around), and then switches back to the front-facing camera to continue checking in for his diary entry. Albert has a lot to say about the Coliseum, but before finishing, his device warns him that the battery is about to expire. At the same time, the device shuts down the cameras and microphones to conserve battery power. Later, after plugging in his device at a coffee shop, Albert returns to his diary app and notes that his recording from the Coliseum was saved.
Albert's day job is a sports commentator. He works for a local television station and records the local hockey games at various schools. Albert uses a web-based front-end on custom hardware that allows him to connect three cameras covering various angles of the game and a microphone with which he is running the commentary. The application records all of these cameras at once. After the game, Albert prepares the game highlights. He likes to highlight great plays by showing them from multiple angles. The final composited video is shown on the evening news.
While still on his Italy vacation, Albert hears that the Pope might make a public appearance at the vatican. Albert arrives early to claim a spot, and starts his video diary. He activates both front and rear cameras so that he can capture both himself and the camera's view. He then sets up the view in his video diary so that the front-facing camera displays in a small frame contained in one corner of the larger rear-facing camera's view rectangle (picture-in-picture). Albert excitely describes the sense of the crowd around him while simultaneously capturing the Pope's appearance. Afterward, Albert is happy that he didn't miss the moment by having to switch between cameras.
As part of a routine business video conference call, Amanda initiates a connection to the five other field agents in her company via the company's video call web site. Amanda is the designated scribe and archivist; she is responsible for keeping the meeting minutes and also saving the associated meeting video for later archiving. As each field agent connects to the video call web site, and after granting permission, their video feed is displayed on the site. After the five other field agents checkin, Amanda calls the meeting to order and starts the meeting recorder. The recorder captures all participant's audio, and selects a video channel to record based on dominance of the associated video channel's audio input level. As the meeting continues, several product prototypes are discussed. One field agent has created draft product sketch that he shows to the group by sending the image over his video feed. This image spurs a fast-paced debate and Amanda misses several of the participant's discussion points in the minutes. She calls for a point of order, and requests that the participants wait while she catches up. Amanda pauses the recording, rewinds it by thirty seconds, and then re-plays it in order to catch the parts of the debate that she missed in the minutes. When done, she resumes the recording and the meeting continues. Toward the end of the meeting, one field agent leaves early and his call is terminated.
During the video conference call, Amanda invites a member of the product development team to demonstrate a new visual design editor for the prototype. The design editor is not yet finished, but has the UI elements in place. It currently only compiles on that developer's computer, but Amanda wants the field agents' feedback since they will ultimately be using the tool. The developer is able to select the screen as a local media source and send that video to the group as he demonstrates the UI elements.
While visiting a manufacturer's web site in order to download drivers for his new mouse, Austin unexpectedly gets prompted by his browser to allow access to his device's webcam. Thinking that this is strange (why is the page trying to use my webcam?), Austin denies the request. Several weeks later, Austin reads an article in the newspaper in which the same manufacturer is being investigated by a business-sector watchdog agency for poor business practice. Apparently this manufacturer was trying to discover how many visitors to their site had webcams (and other devices) from a competitor. If that information could be discovered, then the site would subject those users to slanderous advertising and falsified "webcam tests" that made it appear as if their competitor's devices were broken in order to convince users to purchase their own brand of webcam.
TBD
TBD
This section describes some terminology and concepts that frame an understanding of the design considerations that follow. It is helpful to have a common understanding of some core concepts to ensure that the prose is interpreted uniformly.
MediaStream
vs "media stream" or "stream"MediaStream
interface as currently defined in the
WebRTC spec. Generally, a stream can be conceptually understood as a tube or conduit between sources (the stream's
generators) and destinations (the sinks). Streams don't generally include any type of significant buffer, that is,
content pushed into the stream from a source does not collect into any buffer for later collection. Rather, content
is simply dropped on the floor if the stream is not connected to a sink. This document assumes the non-buffered view
of streams as previously described.
MediaStream
formatMediaStream
is not in
any particular underlying format:[The data from a MediaStream
object does not necessarily have a canonical binary form; for
example, it could just be "the video currently coming from the user's video camera". This allows user agents
to manipulate media streams in whatever fashion is most suitable on the user's platform.]
MediaStream
content
and the potential interaction with the Streams API.
A shared device (in this document) is a media device (camera or microphone) that is usable by more than one application at a time. When considering sharing a device (or not), an operating system must evaluate whether applications consuming the device will have the ability to manipulate the state of the device. A shared device with manipulatable state has the side-effect of allowing one application to make changes to a device that will then affect other applications who are also sharing.
To avoid these effects and unexpected state changes in applications, operating systems may virtualize a device. Device virtualization (in a simplistic view) is an abstraction of the actual device, so that the abstraction is provided to the application rather than providing the actual device. When an application manipulates the state of the virtualized device, changes occur only in the virtualized layer, and do not affect other applications that may be sharing the device.
Audio devices are commonly virtualized. This allows many applications to share the audio device and manipulate its state (e.g., apply different input volume levels) without affecting other applications.
Video virtualization is more challenging and not as common. For example, the Microsoft Windows operating system does not virtualize webcam devices, and thus chooses not to share the webcam between applications. As a result, in order for an application to use the webcam either 1) another application already using the webcam must yield it up or 2) the requesting application may be allowed to "steal" the device.
A web application must be able to initiate a request for access to the user's webcam(s) and/or microphone(s). Additionally, the web application should be able to "hint" at specific device characteristics that are desired by the particular usage scenario of the application. User consent is required before obtaining access to the requested stream.
When the media capture devices have been obtained (after user consent), they must be associated with a
MediaStream
object, be active, and populated with the appropriate tracks.
The active capture devices will be configured according to user preference; the
user may have an opportunity to configure the initial state of the devices, select specific devices, and/or elect
to enable/disabled a subset of the requested devices at the point of consent or beyond—the user remains in control).
It is recommended that the active MediaStream
be associated with a browser UX in order to ensure that
the user:
MediaStream
can be stopped via stop()
.
Specific information about a given webcam and/or microphone must not be available until after the user has granted consent. Otherwise "drive-by" fingerprinting of a UA's devices and characteristics can be obtained without the user's knowledge—a privacy issue.
In addition, care must be taken that webcam and audio devices are not able to record and stream data without the user's knowledge. Explicit permission should be granted for a specific activity of a limited duration. Configuration controls should be possible to enable age-limits on webcam use or other similar techniques.
getUserMedia
? For example, does "video"
permission mean that the user grants permission to any and all video capture devices? Similarly with "audio"? Is
it a specific device only, and if so, which one? Given the privacy point above, my recommendation is that "video"
permission represents permission to all possible video capture devices present on the user's device, therefore
enabling switching scenarios (among video devices) to be possible without re-acquiring user consent. Same for
"audio" and combinations of the two.
After requesting (and presumably gaining access to media capture devices) it is entirely possible for one or more of the requested devices to stop or fail (for example, if a video device is claimed by another application, or if the user unplugs a capture device or physically turns it off, or if the UA shuts down the device arbitrarily to conserve battery power). In such a scenario it should be reasonably simple for the application to be notified of the situation, and for the application to re-request access to the stream.
Additional information might also be useful either in terms of MediaStream
state such as an error object,
or additional events like an error
event (or both).
MediaStream
, or can an "ended" mediastream be quickly revived? Reviving a local media stream makes
more sense in the context of the stream representing a set of device states, than it does when the stream
represents a network source. The WebRTC editors are considering moving the "ended" event from the
MediaStream
to the MediaStreamTrack
to help clarify these potential scenarios.
The application should be able to connect a media stream (representing active media capture device(s) to one or more sinks in order to use/view the content flowing through the stream. In nearly all digital capture scenarios, "previewing" the stream before initiating the capture is essential to the user in order to "compose" the shot (for example, digital cameras have a preview screen before a picture or video is captured; even in non-digital photography, the viewfinder acts as the "preview"). This is particularly important for visual media, but also for non-visual media like audio.
Note that media streams connected to a preview output sink are not in a "capturing" state as the media stream has no default buffer (see the Stream definition in section 4). Content conceptually "within" the media stream is streaming from the capture source device to the preview sink after which point the content is dropped (not saved).
The application should be able to affect changes to the media capture device(s) settings via the media stream and view those changes happen in the preview.
Today, the MediaStream
object can be connected to several "preview" sinks in HTML5, including the
video
and audio
elements. (This support should also extend to the source
elements of each as well.) The connection is accomplished via URL.createObjectURL
. For RTC scenarios,
MediaStream
s are connected to PeerConnection
sinks.
An implementation should not limit the number or kind of sinks that a MediaStream
is connected
to (including sinks for the purpose of previewing).
End-users need to feel in control of their devices. Likewise, it is expected that developers using a media stream capture API will want to provide a mechanism for users to stop their in-use device(s) via the software (rather than using hardware on/off buttons which may not always be available).
Stopping or ending a media stream source device(s) in this context implies that the media stream source device(s) cannot be re-started. This is a distinct scenario from simply pausing the video/audio tracks of a given media stream.
Pre-processing scenarios are a bucket of scenarios that perform processing on the "raw" or "internal" characteristics of the media stream for the purpose of reporting information that would otherwise require processing of a known format (i.e., at the media stream sink—like Canvas, or via capturing and post-processing), significant computationally-expensive scripting, etc.
Pre-processing scenarios will require the UAs to provide an implementation (which may be non-trivial). This is required because the media stream's internal format should be opaque to user-code. Note, if a future specification described an interface to allow low-level access to a media stream, such an interface would enable user-code to implement many of the pre-processing scenarios described herein using post-processing techniques (see next section).
Pre-processing scenarios provide information that is generally desired before a stream need be connected to a sink or captured.
Pre-processing scenarios apply to both real-time-communication and local capture scenarios. Therefore, the specification of various pre-processing requirements may likely fall outside the scope of this task force. However, they are included here for scenario-completeness and to help ensure that a media capture API design takes them into account.
Post processing scenarios are a group of all scenarios that can be completed after either:
video
or audio
elementsPost-processing scenarios will continue to expand and grow as the web platform matures and gains capabilities. The key to understanding the available post-processing scenarios is to understand the other facets of the web platform that are available for use.
Note: Depending on convenience and scenario usefullness, the post-processing scenarios in the toolbox below
could be implemented as pre-processing capabilities (for example the Web Audio API). In general, this document
views pre-processing scenarios as those provided by the MediaStream
and post-processing scenarios
as those that consume a MediaStream
.
The common post-processing capabilities for media stream scenarios are built on a relatively small set of web platform capabilities. The capabilities described here are derived from current W3C draft specifications, many of which have widely-deployed implementations:
video
and
audio
tags. These elements are natural
candidates for media stream output sinks. Additionally, they provide an API (see
HTMLMediaElement) for interacting with
the source content. Note: in some cases, these elements are not well-specified for stream-type sources—this task
force may need to drive some stream-source requirements into HTML5.
canvas
element
and the Canvas 2D context. The canvas
element employs
a fairly extensive 2D drawing API and will soon be extended with audio capabilities as well (RichT, can you
provide a link?). Canvas' drawing API allows for drawing frames from a video
element, which is
the link between the media capture sink and the effects made possible via Canvas.
Blob
which put simply is a read-only structure with a MIME type and a length. The File API integrates with many other
web APIs such that the Blob
can be used uniformly across the entire web platform. For example,
XMLHttpRequest
, form submission in HTML, message passing between documents and web workers
(postMessage
), and Indexed DB all support Blob
use.
Stream
is another general-purpose binary container. The primary differences
between a Stream
and a Blob
is that the Stream
is read-once, and has no
length. The Stream API includes a mechanism to buffer from a Stream
into a Blob
, and
thus all Stream
scenarios are a super-set of Blob
scenarios.
Blob
) and read/write its contents using the numerical data types already provided by JavaScript.
There's a cool explanation and example of TypedArrays
here.
Of course, post-processing scenarios made possible after sending a media stream or captured media stream to a server are unlimited.
Some post-processing scenarios are time-sensitive—especially those scenarios that involve processing large amounts of data while the user waits. Other post-processing scenarios s are long-running and can have a performance benefit if started before the end of the media stream segment is known. For example, a low-pass filter on a video.
These scenarios generally take two approaches:
Both approaches are valid for different types of scenarios.
The first approach is the technique described in the current WebRTC specification for the "take a picture" example.
The second approach is somewhat problematic from a time-sensitivity/performance perspective given that the
captured content is only provided via a Blob
today. A more natural fit for post-processing scenarios
that are time-or-performance sensitive is to supply a Stream
as output from a capture.
Thus time-or-performance sensitive post-processing applications can immediately start processing the [unfinished]
capture, and non-sensitive applications can use the Stream API's StreamReader
to eventually pack
the full Stream
into a Blob
.
canvas.toDataURL('image/jpeg', 0.6); // or canvas.toBlob(function(blob) {}, 'image/jpeg', 0.2);
This task force should evaluate whether some extremely common post-processing scenarios should be included as pre-processing features.
A particular user agent may have zero or more devices that provide the capability of audio or video capture. In consumer scenarios, this is typically a webcam with a microphone (which may or may not be combined), and a "line-in" and or microphone audio jack. The enthusiast users (e.g., audio recording enthusiasts), may have many more available devices.
Device selection in this section is not about the selection of audio vs. video capabilities, but about selection of multiple devices within a given "audio" or "video" category (i.e., "kind"). The term "device" and "available devices" used in this section refers to one or a collection of devices of a kind (e.g., that provide a common capability, such as a set of devices that all provide "video").
Providing a mechanism for code to reliably enumerate the set of available devices enables programmatic control over device selection. Device selection is important in a number of scenarios. For example, the user selected the wrong camera (initially) and wants to change the media stream over to another camera. In another example, the developer wants to select the device with the highest resolution for capture.
Depending on how stream initialization is managed in the consent user experience, device selection may or may not be a part of the UX. If not, then it becomes even more important to be able to change device selection after media stream initialization. The requirements of the user-consent experience will likely be out of scope for this task force.
Device selection should be a mechanism for exposing device capabilities which inform the application of which device to select. In order for the user to make an informed decision about which device to select (if at all), the developer's code would need to make some sort of comparison between devices—such a comparison should be done based on device capabilities rather than a guess, hint, or special identifier (see related issue below).
Capture capabilities are an important decision-making point for media capture scenarios. However, capture capabilities are not directly correlated with individual devices, and as such should not be mixed with the device capabilities. For example, the capability of capturing audio in AAC vs. MP3 is not correlated with a given audio device, and therefore not a decision making factor for device selection.
In addition to selecting a device based on its capabilities, individual media capture devices may support multiple modes of operation. For example, a webcam often supports a variety of resolutions which may be suitable for various scenarios (previewing or capturing a sample whose destination is a web server over a slow network connection, capturing archival HD video for storing locally). An audio device may have a gain control, allowing a developer to build a UI for an audio blender (varying the gain on multiple audio source devices until the desired blend is achieved).
A media capture API should support a mechanism to configure a particular device dynamically to suite the expected scenario. Changes to the device should be reflected in the related media stream(s) themselves.
If a device supports sharing (providing a virtual version of itself to an app), any changes to the device's manipulatable state should by isolated to the application requesting the change. For example, if two applications are using a device, changes to the device's configuration in one app should not affect the other one.
Changes to a device capability should be made in the form of requests (async operations rather than synchronous commands).
Change requests allow a device time to make the necessary internal changes, which may take a relatively long time without
blocking other script. Additionally, script code can be written to change device characteristics without careful error-detection
(because devices without the ability to change the given characteristic would not need to throw an exception synchronously).
Finally, a request model makes sense even in RTC scenarios, if one party of the teleconference, wants to issue a request that
another party mute their device (for example). The device change request can be propagated over the PeerConnection
to the sender asynchronously.
In parallel, changes to a device's configuration should provide a notification when the change is made. This allows web developer code to monitor the status of a media stream's devices and report statistics and state information without polling the device (especially when the monitoring code is separate from the author's device-control code). This is also essential when the change requests are asynchronous; to allow the developer to know at which point the requested change has been made in the media stream (in order to perform synchronization, or start/stop a capture, for example).
In some scenarios, users may want to initiate capture from multiple devices at one time in multiple media streams. For example, in a home-security monitoring scenario, a user agent may want to capture 10 unique video streams representing various locations being monitored. The user may want to collect all 10 of these videos into one capture, or capture all 10 individually (or some combination thereof).
While such scenarios are possible and should be supported (even if they are a minority of the typical web-scenarios), it should be noted that many devices (especially portable devices) supports the media capture by way of dedicated encoder hardware, and such hardware may only be able to handle one stream at a time). Implementations should be able to provide a failure condition when multiple video sources are attempting to begin capture at the same time.
In its most basic form, capturing a media stream is the process of converting the media stream into a known format during a bracketed timeframe.
Local media stream captures are common in a variety of sharing scenarios such as:
There are other offline scenarios that are equally compelling, such as usage in native-camera-style apps, or web-based capturing studios (where tracks are captured and later mixed).
The core functionality that supports most capture scenarios is a simple start/stop capture pair.
Ongoing captures should report progress either via the user agent, or directly through an API to enable developers to build UIs that pass this progress notification along to users.
A capture API should be designed to gracefully handle changes to the media stream, and should also report (and perhaps even attempt to recover from) failures at the media stream source during capture.
Uses of the captured information is covered in the Post-processing scenarios described previously. An additional usage is the possibility of default save locations. For example, by default a UA may store temporary captures (those captures that are in-progress) in a temp (hidden) folder. It may be desirable to be able to specify (or hint) at an alternate default capture location such as the users' common file location for videos or pictures.
Increasingly in the digital age, the ability to pause, rewind, and "go live" for streamed content is an expected scenario. While this scenario applies mostly to real-time communication scenarios (and not to local capture scenarios), it is worth mentioning for completeness.
The ability to quickly "rewind" can be useful, especially in video conference scenarios, when you may want to quickly go back and hear something you just missed. In these scenarios, you either started a capture from the beginning of the conference and you want to seek back to a specific time, or you were only streaming it (not saving it) but you allowed yourself some amount of buffer in order to review the last X minutes of video.
To support these scenarios, buffers must be introduced (because the media stream is not implicitly buffered for this scenario). In the capture scenario, as long as the UA can access previous parts of the capture (without terminating it) then this scenario could be possible.
In the streaming case, this scenario could be supported by adding a buffer directly into the media stream itself, or by capturing
the media stream as previously mentioned. Given the complexities of integrating a buffer into the MediaStream
proposal,
using capture to accomplish this scenario is recommended.
record
API (as described in early WebRTC drafts) implicitly supports overlapping capture by simply calling
record()
twice. In the case of separate media streams (see previous section) overlapping recording makes sense. In
either case, initiating multiple captures should not be so easy so as to be accidental.
All post-processing scenarios for captured data require a known [standard] format. It is therefore crucial that the media capture API provide a mechanism to specify the capture format. It is also important to be able to discover if a given format is supported.
Most scenarios in which the captured data is sent to the server for upload also have restrictions on the type of data that the server expects (one size doesn't fit all).
It should not be possible to change captures on-the-fly without consequences (i.e., a stop and/or re-start or failure). It is recommended that the mechanism for specifying a capture format not make it too easy to change the format (e.g., setting the format as a property may not be the best design).
HTMLMediaElement
supports an API called canPlayType
which allows developer to probe the given UA for support of specific MIME types that
can be played by audio
and video
elements. A capture format checker could use this same approach.
As mentioned in the introduction, declarative use of a capture device is out-of-scope. However, there are some potentially interesting uses of a hybrid programmatic/declarative model, where the configuration of a particular media stream is done exclusively via the user (as provided by some UA-specific settings UX), but the fine-grained control over the stream as well as the stream capture is handled programmatically.
In particular, if the developer doesn't want to guess the user's preferred settings, or if there are specific settings that may not be available via the media capture API standard, they could be exposed in this manner.
A common usage scenario of local device capture is to simply "take a picture". The hardware and optics of many camera-devices often support video in addition to photos, but can be set into a specific "camera mode" where the possible capture resolutions are significantly larger than their maximum video resolution.
The advantage to having a photo-mode is to be able to capture these very high-resolution images (versus the post-processing scenarios that are possible with still-frames from a video source).
Capturing a picture is strongly tied to the "video" capability because a video preview is often an important component to setting up the scene and getting the right shot.
Because photo capabilities are somewhat different from those of regular video capabilities, devices that support a specific "photo" mode, should likely provide their "photo" capabilities separately from their "video" capabilities.
Many of the considerations that apply to video capture also apply to taking a picture.
getUserMedia
?
Another common scenario for media streams is to share photos via a video stream. For example, a user may want to select a photo and attach the photo to an active media stream in order to share that photo via the stream. In another example, the photo can be used as a type of "video mute" where the photo can be sent in place of the active video stream when a video track is "disabled".
Special thanks to the following who have contributed to this document: Harald Alvestrand, Robin Berjon, Stefan Hakansson, Frederick Hirsch, Randell Jesup, Bryan Sullivan, Timothy B. Terriberry, Tommy Widenflycht.
No normative references.
No informative references.