Classification of environmental sounds plays a key role in surveillance systems, crime detection etc. Since the study of the sounds in a real environment can get significant information. Deep learning models, such as convolutional neural networks, have been shown very useful for environmental sound classification (ESC). Recent work has shown that Vision Transformer (ViT) models can achieve comparable or even superior performance on image classification tasks. In the paper, an environmental sound classification method based on Vision Transformer is proposed. We represent sound files with their image representations, namely Log Mel Spectrogram Images and train a Vision Transformer model on these image representations. Specifically, the method obtains an average classification accuracy of 94.6633%. The classification result reveals that the proposed approach is with a good performance on the ESC accuracy.
|