VideoStickers is a tool that allows viewers to quickly extract information from videos as 'motion stickers' for active note-taking.
It implements automated object detection and tracking, linking objects to transcribed narration, and supports expressive queries to generate stickers across space, time, and events of interest.
VideoStickers supprt various video topics and notetaking needs.


Videos are an effective tool for knowledge sharing. Unlike linear text in which learners need to mentally connect concepts one line at a time, videos offer an integrated graphic representation of objects and relationships over space and time. For instance, students can watch different transformations throughout the metamorphosis of a butterfly.

Such graphical representations have been demonstrated to support dual coding and improve recall. These representations are also more aligned with the representation used in the video. That is, capturing a graphical image (e.g., of a cell in a video of mitosis) in the notes that corresponds to the representation in the video can enhance retention and provide a useful study tool.

However, a challenge is that learners can find it difficult to take notes from video content. To engage in effective note-taking, they need to be able to recruit individual objects (e.g., sun) from videos, write down their understanding about the relationships between those objects, and correctly record event sequences over time, etc. Current practices require repeatedly pausing and rewinding videos, taking static screenshots of individual video frames, and manually transcribe narration. All of this increases the time and effort required for note-taking and makes learning from videos less effective.

We propose VideoStickers , a tool designed to support visual note-taking by extracting expressive content and narratives from videos as stickers.

Note-taking using VideoStickers: While watching the video about Phases of the Moon, the viewer (a) first captures the video frame for 'new-moon' as a frame sticker. Later the viewer (b) extracts frame objects for Earth and Moon as object stickers, and (c) expands the object sticker for the revolution of the moon around the earth and annotates key phases.

How VideoStickers works?

We propose a two-pass approach allows viewers to capture high-level information uninterrupted and later extract specific details.

Active Viewing
Users can use the frame marker button to quickly capture salient points in the video while viewing. When click on the frame marker button, VideoSticker generates a frame-sticker, capturing a screenshot of the current frame, and adds it to the diagram canvas.

Visual Note-taking
In this stage, users extract elements from videos and integrate them into coherent notes. The whole process breaks down in following steps:

  • Navigate to Point

    By clicking on the pointer button associated with the frame sticker, the video player navigates to the corresponding point in the video. Alternately, VideoSticker offers the flexibility to bring up stickers on any frame by simply pausing the video on that frame.

  • Sticker Extraction

    As shown in (h), object stickers are overlaid directly on top of the media player. When the user clicks on the desired sticker object (in case a frame has multiple stickers) to bring up a local time bar with start-end time ranges for the duration of the video the object is visible. Further, to facilitate sticker extraction, VideoSticker automatically detects interest points such as collision displayed as markers on the local timeline.Once satisfied, the user can click on the add sticker button, which adds it to the diagram canvas.

  • Sticker Editing

    On the diagram canvas, user can edit the label and text for the object sticker by double-clicking on the sticker to rigger an editting panel as shown in (j).

  • Frame-by-frame view

    VideoSticker offers frame-level control over object stickers to allow flexible note-taking. By clicking on the sticker expand button, user can bring up visuals for individual frames composing the object sticker (k). User can adjust the span of all frames along the x axis and the frame interval with the sliders.

  • Navigation:

    Easily trace back to video context by double-clicking stickers.

  • Diagramming:

    Provide an interface for diagramming over stickers.

Any Example?

Different User Cases

Educational videos, how-to videos and even entertainment ones!

Educational Video

For educational movies, learners will first generated sticker notes with labels and related narrations. Then diagram over the motion graphics to create a comprehensive diagram that connects the concepts shown in the stickers. Here is an example for a video introduce the neutron stars.

Let's generate a diagram:

Instructional videos: How to ..

Besides the educational video, VideoStickers can also be applied to record step-by-step videos like recipes or workout instructions using the same process as illustrated in the above session. The generated diagrams and motion stickers can be easily used in blogs or posts for the purpose of knowledge sharing.

Entertainment Videos: Creating Your Own Memes

Memes are widely used in online social communities. VideoStickers can also be applied to entertainment videos for creating memes with a few clicks. In the editing mode, there is an 'A' button besides the slider, which allow users to customize some texts on top of the sticker. The ‘A’ button will changes to a 'trash' button and a 'download' button when editing the texts. The users can remove the text or download the sticker with text as a meme.

And more!

The representation power

stickers are of multi-dimenstions!

As graphic representations of certain concepts, VideoStickers creates stickers encoding three different dimensions to facilitate content understanding:


The stickers are generated based on exact masks, rather than fixed bounding boxes. This allow the stickers to capture both the region and contour information of each objects. The contour information is useful to show transformations of objects. The region information gives intuition of the relative size of objects by comparison.

Spatial Arrangement / Positioning

Spatial information for each sticker is stored. From the bounding boxes, we can easily calculate the position in videos frame, relative position between objects, motion path and potential interactions.

Timing of Animation

The timing information is encoded in the name of the .gif file of the stickers. Thus, we can easily extract the duration of animation and order of animation. This dimension can be especially useful for measuring speed. As shown in the notes for workout exercise. It is important to know the speed and intensity of each movement.

Here are some of the example stickers that illustrate the above concepts!

Technical Details

The system architecture:

The VideoStickers implements a zero-shot object detection algorithm based on salent object detection. We adapt the salency map generation network from the work PoolNet. The detection algorithm schema is shown below:

For object tracking, we implemented a hybrid methods based on median flow algorithm. The interest points are detected within the range of the target object appearance. We are able to detect points of transformations and interactions. We compared our hybrid tracking methods with other algorithms and highlight the detected interest point in the following graph:

The technical details are written in the half-draft of our paper. If you are interested, please check it out in the .pdf file I provided at the bottom of the page.

The system video

My Contributions

Brain storming the idea
Developing the algorithm
Designing the user interface
Building up the system
Writing down the story


This is an research project in the DataMaze lab at University of Michigan.
Thank you to Professor Eytan Adar and Hari Subramonyam for guidance!

Resource Links

You can find more information here:
[Document] :IUI'22