Detecting Violent Content in Hollywood Movies and User-Generated Videos

AutorEsra Acar Celik, Melanie Irrgang, Dominique Maniry, Frank Hopfgartner
QuelleSmart Information Systems 

Detecting violent scenes in videos is an important content understanding functionality, e.g., for providing automated youth protection services. The key issues in designing violence detection algorithms are the choice of discriminative features and learning effective models. We employ low and mid-level audio-visual features and evaluate their discriminative power within the context of the MediaEval Violent Scenes Detection (VSD) task. The audio-visual cues are fused at the decision level. As audio features, Mel-Frequency Cepstral Coefficients (MFCC), and as visual features dense histogram of oriented gradient (HoG), histogram of oriented optical flow (HoF), Violent Flows (ViF), and affect-related color descriptors are used. We perform feature space partitioning of the violence training samples through k-means clustering and train a different model for each cluster. These models are then used to predict the violence level of videos by employing two-class support vector machines (SVMs). The experimental results in Hollywood movies and short web videos show that mid-level audio features are more discriminative than the visual features, and that the performance is further enhanced by fusing the audio-visual cues at the decision level.