Day 1 - June 07
Poster Competition: Procedure-Aware Pretraining for Instructional Video Understanding
Instructional videos depict humans demonstrating how to perform multi-step tasks such as cooking, repairing, etc. Building good video representations from instructional videos is challenging due to the small amount of video annotations available. This makes extracting the procedural knowledge such as the identity of the task (e.g., ‘make latte’), its steps (e.g., ‘pour milk’) challenging. Our insight is that instructions for procedures depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph, where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We call this Procedural Knowledge Graph based pre-training method and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings.