More Than A Feeling?: Reliability And Robustness Of High-Level Music Classifiers