Movie genre classification is an active research area in machine learning; however, the content of movies can vary widely within a single genre label. We expand these ‘coarse’ genre labels by identifying ‘fine-grained’ contextual relationships within the multi-modal content of videos. By leveraging pretrained ‘expert’ networks, we learn the influence of different combinations of modes for multi-label genre classification. Then, we continue to fine-tune this ‘coarse’ genre classification network self-supervised to sub-divide the genres based on the multi-modal content of the videos. Our approach is demonstrated on a new multi-modal 37,866,450 frame, 8,800 movie trailer dataset, MMX-Trailer-20, which includes pre-computed audio, location, motion, and image embeddings.
@inproceedings{Fish:ICIP:2021,
AUTHOR = Fish, Ed and Weinbren, Jan and Gilbert Andrew",
TITLE = "Rethinking genre classification with fine grained semantic clustering",
BOOKTITLE = "IEEE International Conference on Image Processing (ICIP), 2021",
YEAR = "2023",
}