Visual-semantic embedding networks for cross-modal learning and information retrieval with search engine integration