Fast Forward with your VCR: Visualizing Single-Video Viewing Statistics for Navigation and Sharing

Online video viewing has seen explosive growth, yet simple tools to facilitate navigation and sharing of the large video space have not kept pace. We propose the use of single-video viewing statistics as the basis for a visualization of video called the View Count Record (VCR). Our novel visualization utilizes variable-sized thumbnails to represent the popularity (or affectiveness) of video intervals, and provides simple mechanisms for fast navigation, informed search, video previews, simple sharing and summarization. The viewing statistics are generated from an individual's video consumption, or crowd-sourced from many people watching the same video; both provide different scenarios for application (e.g. implicit tagging of interesting events for an individual, and quickly navigating to others' most-viewed scenes for crowd-sourced). A comparative user study evaluates the effectiveness of the VCR by asking participants to share previously-seen affective parts within videos. Experimental results demonstrate that the VCR outperforms the state-of-the-art in a search task, and has been welcomed as a recommendation tool for clips within videos (using crowd-sourced statistics). It is perceived by participants as effective, intuitive and strongly preferred to current methods.


INTRODUCTION
Consuming video online, on mobile devices or on home computers is now a well-accepted form of communication and entertainment, shown by the rapid growth of various providers such as YouTube TM . Despite the volume of video available, methods for efficient navigation and sharing of in-video content have not provided users with the ease of use or level of personalization re- * {abira, mfong, gregor, ssfels}@ece.ubc.ca quired to accommodate their needs. Constraints such as limiting the length of videos (e.g. six seconds on Vine TM and 15 seconds on Instagram TM ) can simplify the problem, however these do not address the challenges with unconstrained video.
Part of the problem is that the 3D spatio-temporal representation of video complicates relatively simple actions such as search or selection. Video search often taxes human memory by requiring memorization of a large quantity of previously-seen content. In particular, finding and selecting interesting parts has poor navigation and search support. We propose that the addition of a singlevideo visualization mechanism using viewing statistics will overcome some of these difficulties.
We investigate the usefulness of visualizing prior viewing by either single or multiple users to support fast navigation (to popular or unseen parts), search and directly previewing content, without interrupting normal playback. We envision users will watch videos differently when they have a visualization of their personal navigation: they can implicitly tag segments of video by re-viewing (thereby increasing the view count); it would also capture their natural behaviour, such as watching a funny section multiple times in a lengthy video. This non-linear viewing behaviour is already evident, such as in YouTube audience retention graphs 2 : videos have peaks in the graphs, implying users watch different content and likely seek to find interesting parts (unfortunately these graphs are not generally public, and require voluntary publication by video owners). Viewing graphs often show a shallow negative exponential curve (i.e cold-start problem) from crowd-sourced data, which can be very simply filtered to highlight the most popular content. Likewise, viewing statistics can be used to filter out videos where only the first few seconds are watched.

RELATED WORK
Researchers have proposed various navigation techniques to simplify access and improve efficiency. Simple linear video navigation Figure 2: The VCR is a new navigation tool based on viewing statistics which intuitively visualizes seen and unseen content. Each viewed interval is recorded and added to the view count for visualization. Playback is controlled either with the usual tools (play, pause, seek), or by using the VCR. For unwatched video, the view count is equal everywhere and the VCR gracefully becomes the familiar Filmstrip (Figure 3a).
can be accomplished through representative thumbnails (e.g. a filmstrip metaphor), such that selecting a thumbnail directly positions the main video at the specific time corresponding to that thumbnail [10]. Swifter claimed faster video navigation by displaying a grid of thumbnails instead of a single thumbnail when scrubbing a video [15]. Map-based storyboards emerged as a video navigation tool, where different intervals are directly visualized on a map [19]. Davis [8] proposed an iconic video representation language of human actions in video. Using these iconic representation allow users to visualize, browse, annotate, and retrieve video content. Others proposed playback mechanisms to help users rapidly skim through uninteresting content and watch interesting parts at a normal pace [4]. The timeline can also be controlled by directly manipulating video content (i.e. using object motion) [9].
Video summarization aims to shorten videos to emphasize the important content and reduce time needed for viewing. For instance, Correa et al. proposed a system for generating dynamic narratives from videos where a compact, coherent and interactive poster is created for each video [6]. They use a series of spatialtemporal masks to improve the output quality of stitching foreground and background regions of video frames. Daniel et al. [7] and Nguyen et al. [18] applied computer vision algorithms to develop a 3D volume-based interface for video summarization and navigation. It enabled users to understand the content and easily navigate to the intended parts of the video. Video summarization is represented by a large body of work [1,17,22], however, in general these methods do not take advantage of any implicit information gathered as users consume video, which may be used to personalize the user experience.
Video navigation histories and social navigation provide potential interaction techniques for fast navigation, event search and video summarization. People in real scenarios rewatch parts of videos that are important, interesting, affective, or hard to understand [20]. This can be seen most simply in YouTube's or Vimeo's feature for sharing a video from a specific start time, providing an exact use case of what we propose. This watching behaviour leaves digital footprints on the video frames creating a non-flat video histogram emphasizing the interest of each part of the video. Researchers have effectively employed this data for different purposes. For example, Yu et al. have used video browsing histories to rank different scenes within a video, then use the rankings to extract clips to generate video summaries [24]. Mertens et al. have used the video timeline itself to visualize users' viewing 'footprints' using different brightness levels, which lets users quickly navigate to the most viewed scenes [16]. However, this does not supply a visualization of the video, which inhibits search (when searching for a previously seen event, users need to remember its approximate location), and intervals are not defined as a whole which prevents sharing directly from the visualization. Al-Hajri et al. offered users access to their video browsing histories by visualizing the intervals a user watched as a separate thumbnail [2,3]. Users were able to find previously-seen content more quickly than when their history was not available. However, rewatched intervals were visualized multiple times (which increased the search space) which may ham-  per the search task.
Video navigation history can play an important role in user-based information retrieval from videos on the web. Shamma et al. [21] and Yew et al. [23] have proposed a shift from content-based techniques to user-based analysis because it provides a more promising basis for indexing media content in ways that satisfy user needs. Leftheriotis et al. used viewing statistics as a tool for extracting a representative image or thumbnail based on the most-viewed frame within a video [14]. Hwang et al. applied the viewing statistics to place the most-viewed video content in provider networks rather than using the complete video to reduce their storage and network utilization [13]. User-based viewing statistics were demonstrated to be at least as effective at detecting video events as when contentbased anaylsis techniques (i.e. computer vision) were used [12]. Fong et al. [11] employed users' navigation behaviour to propose a new casual video editing interaction technique. They showed that using their technique participants took at most two-thirds the time taken by the conventional method to select video segments and create playlists.

SINGLE VIDEO NAVIGATION
The objective of our research is to design a visualization that supports fast in-video navigation (play most popular or unseen parts), search (seek within intervals with prior knowledge e.g. 'seen the event before' or 'never seen the event'), preview, and instant sharing (share a single interval directly). To accomplish this, we use viewing statistics (personal or crowd-sourced) as the basis for a modification to the well-known filmstrip visualization [5], to create the View Count Record (VCR). The VCR uses a variable thumbnail size (and variable interval length) to reflect the relative popularity of intervals. This is similar in concept to timeline footprints [16], however timeline footprints do not allow direct navigation (we do not require seeking), in-place preview or direct interval sharing. We used size instead of colour since we are representing intervals using thumbnails where colour discrimination may be confused with the thumbnail content and would be difficult to differentiate for some videos. We also piloted the visualization using a coloured frequency bar attached to each thumbnail to indicate the popularity; while the information was welcomed, the visualization was re-ported as cluttered. The VCR applies a histogram visualization using thumbnails where the height of these thumbnails indicates the height of the histogram bars.
Whenever a segment of video is played, the video ID and timestamps for the interval's start and end are recorded. An accumulated view count is maintained for the video at a given resolution (e.g. 15 samples per second of video). The VCR, shown in Figure 3, consists of a fixed number of video segments 3 (described below). The duration and size of each segment is based on how often its corresponding interval has been viewed. If no viewing statistics are available, the VCR appears as a normal filmstrip as shown in Figure 3(a).

VCR Construction
The construction of the VCR starts by gathering intervals of time in which consecutive frames have equal view counts. While there are intervals less than a set threshold, the algorithm attempts to merge these intervals with one of their neighbouring intervals. The neighbour to merge with is determined by two criteria: first by the difference in view counts, and if the difference of view counts are equal, then by the duration of the neighbouring intervals. The merging process chooses the smallest difference in view counts, or the smallest interval duration. This process repeats until all intervals' duration are greater than the preset threshold.
Upon the completion of the merging process, the VCR contains a set of intervals with duration that are greater than the preset threshold. However, since the number of items in the visualization component is limited, we must reduce the set of intervals to match. Thus we look at the peaks of the view count graph, keep the highest peaks, and merge the other intervals until we get to the desired resolution. Conversely, if we do not have enough intervals, we linearly sample and split intervals until we have enough.
We then create our visualization by using a Video Segment component (described below) for each interval. The size of each segment is based on a ratio of the current segment's view count to the maximum view count for the video. The VCR updates automatically when the video is paused, based on the latest viewing statis- tics (it does not update while viewing so as not to distract from the main video). It illustrates to the user how they consumed the video and which parts were viewed the most/least. This provides a simple mechanism to find or navigate back to these segments when needed.

VCR Visualization
Each interval (i.e. video segment) in the VCR is represented as a thumbnail displaying the first frame. The temporal location of the interval with respect to the video is visualized as a red line on a grey background beneath the thumbnail (the location bar) to help users spatially contextualize the temporal location of intervals within the complete video. Video segments support: seeking using a popup overlay which expands from the red line in the location bar, and the thumbnail's image updates to reflect the seeked content (see Figure 4); playing the video directly, without interrupting the main player; playing the video in the main player, either by clicking the top right arrow or dragging the interval to the main player (dragging with the seek bar active plays the interval from that specific time); and finally, sharing of the entire interval by drag-and-drop.

VCR Scalability
The VCR is not affected by the length or duration of the video being visualized since the algorithm as described in Section 3.1, merges (or linearly samples and splits intervals) until the VCR gets to the desired resolution (i.e the required number of segments). However, due to the limited space and the fixed number of video segments, some medium-height peaks may be diminished and not easily viewed in the VCR. To alleviate this problem, the interface supports a zoom feature (via mouse wheel) where the selected video segment expands and is represented by its own VCR with the same number of segments. When a segment is zoomed-in, the VCR updates to visualize segments within that zoomed segment only and hides any other segments. Thus the VCR always uses the same number of video segments.

EVALUATION
We designed a comparative study to investigate: 1) if our visualization of video navigation provides faster search for user-specific af-fective intervals, and if users prefer our visualization for this task; 2) if crowd-sourced histories provide good summaries of video. Subjects were asked to find and share their favourite intervals using either the VCR or Filmstrip visualizations. We compared against the filmstrip design instead of timeline footprints [16] for several reasons: footprints does not easily let a user directly select or share a full interval; video cannot be previewed inside the footprints visualization (VCR and filmstrip can both directly preview without seeking); VCR and footprints could be used together, so we believe a comparison against Filmstrip is more informative.

Participants
Ten paid volunteers, 6 female and 4 male, participated in the experiment. Participants ranged in ages from 19 to 35. Three of the participants were undergraduate students while the rest (i.e. 7 participants) were from the general public (non-academic). Each participant worked on the task individually. All participants were experienced computer users and have normal or corrected to normal vision. Seven participants watch online videos on a daily basis and the other three watch videos 3-5 times a week. Five of the participants watch 1-3 videos on average per day, while three watch 3-5 videos per day and two watch more than 10 videos per day on average.

Design and Procedure
Two different navigation modes were tested: Filmstrip and VCR. The case where no history was available was represented by the state-of-the-art Filmstrip, shown in Figure 3(a). Each participant tried both modes to navigate and share their preferred parts of the video. Participants were divided equally into two groups where Group 1 used Filmstrip first and VCR second, and Group 2 had the order reversed. Participants freely watched a set of 5 different videos (Disney short animations) between 3 and 5 minutes long. Video length does not affect the VCR as mentioned in Section 3.3, however, due to the time constraints of the experiment short videos were tested.
To ensure all participants had seen equivalent video, the tasks began after all videos were viewed. For each video, participants were asked to list five intervals they would like to share; these were recorded by the researcher. The researcher chose an event from those provided by the participant which they must find: the search task began by clicking 'Find', choosing a video from a grid of thumbnails, and then the navigation layout for the current mode was displayed -the participant used this to find an interval representing the event. The interval is submitted for consideration by playing it: if approved by the researcher as correct, the task is complete.
Each participant performed a total of 14 search tasks (2 modes × 7 intervals); they were asked to perform as quickly as possible. For each task, the completion time, the number of previews and the number of zoom events were recorded. The completion time was measured from when the participant clicked on a 'Find' button until the moment they found the correct interval (confirmed by the researcher). The navigation behaviour and statistics were recorded during the viewing phase. The participants were also asked to rank each mode based on speed, ease and preference.
Upon the completion of the sharing tasks, participants started the second task where they were shown a short version of each video, automatically created from crowd-sourced histories (described below). Participants were asked if they thought the shortened version was a good summarization and whether each segment in the crowdsourced version matched their own affective segments; the experiment ended when participants had ranked all 5 shortened videos and their corresponding segments. The final task was to fill out a questionnaire to rank the modes, and provide feedback on the interface, its features and their experience. The experiment lasted Table 1: Results of the comparative study for the interval retrieval task, showing a significant advantage using our method (VCR) in terms of completion time. Note: SD = standard deviation; completion time measured in seconds. * p < 0.03

Crowd-Sourced Data Collection
Six graduate students (2 female, 4 male, aged 24 to 37) completely separate from participants in this study, voluntarily participated in the crowd-sourced data collection. Participants were invited prior to the experiment to freely watch and navigate the same set of videos while their viewing statistics were recorded. Their data was then aggregated and visualized using the VCR. At least 9 peaks existed for each video. However, due to the experiment time constraints (one hour), we decided to use only the highest 5 peaks of each video in the shortened videos that were tested.

Results and Discussions
Most participants commented that they enjoyed their time using the interface and they can imagine seeing its features applied, especially in social networking websites. They foresaw its applicability as a navigation aid for un-watched videos where social navigation can be leveraged for the benefit of future viewers, as well as a summarization tool for their own videos. Participants were impressed by how closely the crowd-sourced popular intervals matched their own preferences for best intervals, confirming that in most cases this would provide an effective tool for navigating new video.

Search Task
The main task in the experiment was to search for previously-seen preferred intervals: each participant was able to complete each search task in less than one minute (for all 14 trials). A pairedsamples t-test analysis determined the significance of the results in terms of the average completion time per search and the average number of previews per search. The analysis, shown in Table 1, demonstrated that the search task using Filmstrip took significantly more time than with the VCR. Participants were asked to rank the different modes for preference, ease and speed: they ranked the VCR as the most liked, easiest and fastest mode, which coincides with the quantitative results. This indicates that having access to the user's personal navigation record is useful for finding previouslyseen content within video, and that our visualization cues (e.g. size) of the mostly watched segments helped users to quickly and easily navigate to the correct intervals.
In terms of the average number of previews, the results revealed no significant difference between the two modes, which we did not anticipate. This can be due to the fact that many view count peaks can exist within a single video segment of the VCR, and that some segments ended up much smaller in size which made it harder to navigate. When analyzing the participants' navigation history, we found that participants created 11 history segments on average per video. This means that when using heuristics some VCR segments had more than one peak since there are only 6 segments in the VCR. However, as we mentioned in Section 3.3, we added the zoom functionality to mitigate this. Participants rarely used the zoom feature and preferred to navigate through these segments instead which explains the large number of previews.

Agreement With Crowd-Source
All participants agreed that the shortened video (created automatically using the crowd-sourced information) was an effective summary of the video content. Before using the interface, participants were asked whether they would use others' recommendations as a tool for navigating unseen videos; we were most interested in discovering if participants views would change after using our interface. Most said they would not use recommendations, however after using the interface and viewing the shortened video they expressed surprise at the quality of the summary. Participants mentioned that having the crowd-sourced VCR would save time, especially for long videos, since they can decide whether to watch the entire video or just the summary, or even just parts of the summary.
For each video, participants were asked to rank each segment derived from the crowd-sourced data. At least 8 participants out of 10 agreed that each segment represented something they liked or illustrated an affective clip. Out of a total of 25 segments, 7 segments were liked by 8 participants, 8 segments were liked by 9 subjects, while the remaining 8 were liked by all participants. The negative ranking of segments by some participants was due either to religious beliefs or perceived violent content, while other participants considered these segments to be funny. We expected the variation between participants, however, we did not predict the generally high level of agreement. This suggests implicit tagging of video from many users may serve as a valuable navigation tool for online video.

Ranking of Visualization's Features
From the aggregated results of the questionnaire (measuring easeof-use and usefulness), the average ranking across all components and features was 5.82 out of 7. All features were ranked above 5 except for three items which were: getting started ((M = 4.5), remembering how to use the interface (M = 4.6), and using the zoom (M = 4.3). The zoom scored slightly lower due to the mouse wheel sensitivity being reported as too high, which led to some participants becoming confused or frustrated. This could also explain the low usage of this feature while performing the tasks where only 2 participants used it for 4 tasks out of 140 tasks (10 participants × 2 modes × 7 tasks) when searching for events. This has been taken into account for future versions of the interface. Overall participants appreciated the zoom since it enabled them to get a more detailed view of the video's content.

Participants' Feedback
There were some overwhelming impressions and comments made by the participants about the interface. One participant commented that "I definitely see how this would be really helpful for long videos because I will not have to waste my time watching the whole video again to get to the important stuff. I could directly use my previous history to navigate to these intervals." Others said "I would love to see this implemented within social websites. I could see how it would save my time when viewing new videos"; "It is really cool and easy to use. When are you going to apply this to online video websites?"; and finally "I didn't expect others' history would be useful, but, you showed me it is." We have presented a new way to visualize and navigate a video space using the View Count Record (VCR), that provides simple navigation, search, preview and sharing of video intervals. Our comparative study based on a use case of searching and sharing found significant quantitative results in favour of our method, as well as being positively perceived by the participants. Using crowdsourced data as a tool for recommending segments within videos (i.e. social navigation) was found to be appreciated, and we confirmed that the summaries generated from crowd-popular segments were effective at communicating the content of video. The VCR was rated highly by users who recommend integrating this mechanism into online video websites.
We will investigate how this navigation mechanism may be extended to multiple videos, to provide users with an intuitive and fast navigation mechanism for their video collection, as well as for serendipitous discovery of new video based on crowd-sourced information. We intend to explore how users respond to these mechanisms, in conjunction with the presented VCR, via a field study utilizing online video; extensive data will help determine general users' current viewing behaviour for all types of video, and how it changes when given a VCR and other methods based on viewing statistics.