Videos from the World Wide Web provide a rich source of information that robots could use to acquire knowledge about manipulation tasks. Previous work has focused on generating action sequences from unconstrained videos for a single robot performing manipulation tasks by itself. However, robots operating in the same physical space with people need to not only perform actions autonomously, but also coordinate seamlessly with their human counterparts. This often requires representing and executing collaborative manipulation actions, such as handing over a tool or holding an object for the other agent. We present a system for knowledge acquisition of collaborative manipulation action plans that outputs commands to the robot in the form of visual sentences. We show the performance of the system in 12 unlabeled action clips taken from collaborative cooking videos in the YouTube website. We view this as the first step towards extracting collaborative manipulation action sequences from unconstrained, unlabeled online videos.