Towards Modeling Collaborative Task Oriented Multimodal Human-Human Dialogues