Towards Multi-Modal Interactive Systems that Connect Audio, Vision and Beyond