Music emotion classification for Turkish songs using lyrics

Abstract


Introduction
Music has grown into an important part of people's daily lives, and as we move further into the digital age in which a large collection of music is being created daily and becomes easily accessible renders people to spend more time on activities that involve music. Everyone may encounter music throughout most routine daily activities such as waking up, eating, working, jogging, swimming, driving, and so forth [1]. As the amount of musical content continues to explode, conventional approaches that manage music pieces based on bibliographic information such as titles, artist names, and genres on a display are no longer sufficient. Hence, music information organization and retrieval has to evolve to meet for the demand for easy and effective information access [1], [2].
Music classification is an essential process for improving music information retrieval (MIR) systems in various media platforms such as Spotify and LastFm, which are the two most widely known music platforms and they have extensive music catalogue. Several approaches such as content-based, contextbased, audio-based, etc. are used in order to generate recommendations to listeners [3].
Today, there are several music services which provide large scale music datasets for information extraction and most of the musical content is easily accessible [4]. These music datasets are widely used in order to perform classification of music into predefined categories such as their genres and moods. However, it is essential to assign correct metadata to provide best search results or correct recommendations of multimedia resources [5].
Music listening is a very situational behavior [6]. Lehtiniemi and Ojala [4] stated that the emotional state of the listener is essential for selecting the type of the music for listening. They also argued that specifying the mood of the listener and classifying emotions based on the preferred music by that listener are difficult tasks to accomplish well automatically.
Recently, this demand leads to an increasing interest in the research community to propose and develop tools and algorithms for efficient music organization and retrieval by emotion. It is generally believed that music cannot be composed, performed, or listened to without considering the affection involved. Music information behavior studies have also identified emotion as an important criterion used by people in music seeking and organization [1], [2]. As stated by a study of social tagging on a popular music website, Last.fm, after the genre and locale tags, mood tag is the third most frequent type of tag assigned to music pieces by online users [7].
As West et al. [8] stated in their study, human beings often use "contextual or cultural labels" for music. Cultural references which lay behind when people define a music piece are changeable. Furthermore, one song can be described mostly with more than a single tag, genre or emotion. Defining them with a single label can be a further limitation.
In one of the music mood detection studies, music mood is classified by asking users to select one mood picture from a set of options rather than a label. This idea is found to be a successful concept in this study and stated to add novel experiences to music listening [4]. According to the study, it is seen as a good way to receive music recommendations from real users based on their mood picture interpretations.
In addition to importance of emotional state in listening music, it is argued that music listening is a very personal behavior [6].
In system development and evaluation, the need for considering the human factors such as preference, activity, and emotion is largely emphasized [9]. Users' involvement and their contribution through non-message-based interactions have become a major force behind successful online communities. Recognition of this new type of user participation is crucial to understanding of the interests of mass population [10]. For these reasons many researchers have called attention to the user-centered design to tag music. In literature, various crowdsourcing tasks are used for human assessment of music mood. Urbano et al. [11] stated that crowdsourcing is a perfectly viable alternative to evaluate music systems without the need for experts.
Users assign rich meanings to music emotion queries, but a music classification algorithm could only retrieve them from the computations' results, which would be shallow in the perspective of users. Consequently, in this study crowdsourcing was used to obtain the emotion tags of various songs. The concept of music emotions from the end-user's perspective was investigated by asking users to choose one emotion cluster from a set of options for various songs. Afterwards, it was tried to formulize a text mining model, which automatically recognizes the emotion of a given music piece from its lyrics. As a side note, the terms "mood" and "emotion" were used interchangeably in this study.
The contributions of this paper are twofold. First, to the best of our knowledge, this is the first study which tries to extract emotions/moods from the Turkish music lyrics. Second, current study considers the use of n-gram features for the purpose of automatically extracting emotions/moods from lyrics, and in the study comparative analysis is conducted with the aim of understanding the effects of applying different stemming methods proposed for Turkish language, termweighting approaches and classification algorithms on the classification performance.

Music Mood Recognition
With the widespread usage of smart phones and personal computers for accessing music and the exploding amount of digital music content available to people necessitate the development of novel algorithms and tools for easy and effective music retrieval. As almost every music piece is created to convey emotion, music organization and retrieval by emotion is a reasonable way of accessing music information [1], [2]. There is a significant amount of study that has been done on the music mood recognition based solely on audio, lyrics, and crowdsourced tags as well as multi modal approaches in where audio, lyrics and tags are used altogether to obtain more accurate and reliable mood classifiers.
Early work on music mood recognition started as a special case of music tagging, by using categorical labels such as happy or sad [12]. Feng and his colleagues [12] used an approach named as Computational Media Aesthetics (CMA), to classify music emotion. In their approach, they assume that composers choreograph the expectation to arise emotion, and performers convert the musical intention into music language to arise emotion. So that, they analyzed music mood on the viewpoint of how music is made. In their scheme, music database is indexed on four labels of music mood, concretely "happiness", "sadness", "anger" and "fear". And three features, relative tempo, the mean and standard deviation of average silence ratio (articulation), are used to classify mood using a backpropagation neural network.
Some web services provide audio decoding of musical features, which are then used as a base for automatic music emotion detection tasks. Echo Nest [13] has offered a web service that provides users a set of musical features, like timbre, pitch, and rhythm. Similarly, the MIR Group of the Vienna University of Technology [14] also made a web service available that returns a set of musical features for a given song such as rhythm patterns, statistical spectrum descriptors and rhythm histograms, and allows the training of self-organizing music maps [15].
In the study by Liu et al. [9], LiveJournal dataset is used to predict user mood. This dataset contains blog articles from the social blogging website LiveJournal. Instead of being collected in a controlled environment, data is contributed by users spontaneously during their regular daily lives. The study offers insights into the role of music in mood regulation and demonstrates how LiveJournal dataset with two-million (LJ2M) articles can contribute to studies on real world music listening behavior [9]. Moreover, a million-scale music-listening dataset was obtained from music related Twitter hashtags in another study [16].
A lyrics based classification technique using n-gram features is proposed by Fell and Sporleder [20]. The novelty of their approach is the varied dimensionality of the lyrics features such as style, song structure and orientation towards the world other than vocabulary and semantics. In order to decide style of the song, a rhyme detection tool is used. Regressive Imagery Dictionary method is applied for semantic evaluation to find the imageries of the lyrics such as conceptual thought and primordial thought.
Hu and Downie [17] present a study comparing music classification techniques using lyrics features and audio features. In this study, in order to find effective features for each specific mood, accuracy of using selected audio and lyrics features a including psycholinguistic lexicon are evaluated among each mood. Most promising accuracy results are achieved using context word (CW) lyrics features where the average accuracy of 61.7% is obtained. Precision or recall values are not provided in the study. In conclusion, lyrics features are found as the most effective ones in classifying majority of the moods.

Music classification in non-English languages and text mining of Turkish lyrics
Text mining is a special form of data mining which includes searching in and interpretation of retrieved textual information. Text mining of the song lyrics is a widely used method in MIR and classification. Text mining is a process which generally comprises of text-preprocessing, term-bydocument matrix generation and knowledge extraction steps.
Earlier works on lyric analysis for languages other than English are based on lexicon based methods. For instance, Cho and Lee [18] used a manually built lexicon in Korean to extract emotion vectors and recognized moods accordingly. Logan and Salomon [19] categorized stemmed words that are taken from news and lyrics. The aim of their study was to evaluate artist similarities of the songs using lyrics, and they measured similarities based on categorized stems.
Kim and Kwon [21] proposed a method where its strength is claimed to be the feature extraction approach which is adapted to retrieve emotion regarding Korean language's specialties. Such features are measured by emotion condition change, negative word combination, time of emotion and interrogative sentence existence. Howard et al. [22] conduct another study focusing on lyrics in languages other than English for music genre classification problem. In their study, a multilingual setting is considered where songs are written in Spanish and Portuguese. Claiming that traditional text preprocessing techniques may not be suitable for multilingual texts, they run experiments to point out the use of stemming and stop words extraction. As a result, they reported that stopwords removal decrease the accuracy in all classification algorithms.
Türkmenoğlu and Tantuğ [23] performed sentiment analysis of Turkish social media to compare Lexicon and Machine Learning (ML) based methods. In their study, they find out that ML based method performs better than Lexicon based method on both short and long informal texts. In another study, Vural et al. [24] presented a framework for unsupervised sentiment analysis in Turkish text documents. In their work, authors customized SentiStrength sentiment analysis library by translating its lexicon to Turkish and used it for the classification of the polarity of Turkish movie reviews. They achieved 76% accuracy by their proposed technique that is unsupervised and is not specific to the studied problem domain. A more general framework called SentiTurkNet is proposed by Dehkharghani et al. [25] where three polarity scores are assigned to each synset in the Turkish WordNet to indicate its level of positivity, negativity, and objectivity. Using these polarity scores, they achieved 66.7% accuracy for ternary classification of movie reviews.
Zemberek is a Turkish text-preprocessing tool which is the most widely used software library in Turkish text analysis. Although Zemberek provides extensive functionalities for performing different phases of text mining as a whole, such as diacritic restorer for Turkish (deASCIIfier) and part of speech tagger, we extensively focused on word stemming operations.
In the literature, several stemming methods have been proposed to find stems of the Turkish words. Tunalı and Bilgin [26] compared Affix Stripping, Fixed Prefix and Zemberek stemming methods and their performances in stemming Turkish texts. As a conclusion, they stated that Zemberek and Fixed Prefix 5 methods are preferable due to their reduction rate [26]. As part of their work, they developed a software program named PRETO which provides several textpreprocessing functionalities among which we have utilized different stemming approaches and assess their effect in music emotion classification.
In text mining, after performing text-preprocessing, term-bydocument matrix is generated based on the distribution and occurrences of terms within a set of documents, which are the song lyrics in our case. As for the representation of the indices used in this matrix, the widely used term frequencies and tf-idf values were considered. Tf-idf score is computed as the multiplication of two measures: tf (term frequency) and idf (inverse document frequency). Here, tf represents the frequency of the term within a document (single song lyric), whereas idf indicates how rare is the term among all document set (all song lyrics in the dataset) [27].

Classification algorithms used for music emotion detection
Mood tag is the third most frequent type of tag assigned to music pieces by online users in Last.fm [7]. In the following, we elaborate on the algorithms used in the literature for genre and emotion classification of music.
Using Gaussian mixture models and diagonal covariance matrices, Tzanetakis and Cook [28] achieved 61% classification accuracy with ten genres. The three features they used for classification were timbre texture, rhythmic content, and pitch content. Hamel and Eck [29] proposed a system that can automatically extract relevant features from audio for a given task. They obtained a classification accuracy of 84.3% on the dataset of Tzanetakis et al. [28] by using deep belief networks and non-linear Support Vector Machine (SVM) classifier. McKay and Fujinaga [30] used feedforward neural networks and knearest neighbour classifiers in order to classify the recordings by genre using features based on instrumentation, texture, rhythm, dynamics, pitch statistics, melody and chords. Consequently, for a hierarchical taxonomy consisting of 9 leaf genres, classification accuracies of 98% and 90% were obtained for root genres and for leaf genres, respectively [30].
In Music Emotion Retrieval (MER), emotions are categorized into a number of classes (such as happy, angry, sad, and relaxed), and then selected machine learning techniques are applied to create an emotion classifier [31]. In this respect, several machine learning algorithms have been applied to learn the relationship between music features and emotion labels, such as neural networks [12], support vector machines [32], [33], fuzzy c-means classifier [34], and k-nearest neighbor [35]. Subsequently, models generated through the application of these techniques are used to identify the emotion of a music piece given as the input.
Weninger et al. [36] found that recurrent neural networks outperform both support vector regression (SVR) and feedforward neural networks both in continuous-time and static music mood regression, and achieve an R2 of up to 0.70 and 0.50 with arousal and valence annotations for music mood classification, respectively.
Liu et al.'s [9] study offers insights into the role of music in mood regulation and demonstrates how LJ2M (LiveJournal 2million) can contribute to studies on real world music listening behavior. They employed the MER models trained from a Last.fm dataset of 31.427 songs, which consider a total number of 190 music emotion classes. In their study, they have adopted the 12-D EchoNest timbre descriptor as the underlying feature representation, and used support vector machine with the radial basis function kernel. The average accuracy of the 190 binary classifiers is 73.9% in area under curve, according to cross-validation results obtained from the Last.fm dataset.
Measuring similarity of songs or artists using lyrics also attracted attention a lot in the field of text mining. Logan and Salomon [19] used Probabilistic Latent Semantic Analysis method for text analysis of lyrics. Kim and Kwon [21] proposed a lyrics-based emotion classification system based on Partial Syntactic Analysis which reported 58.8% accuracy with their improved emotion features. Bag of Words is another method used for feature extraction of lyrics [22]. However, many researches show that combining text analysis and acoustic analysis provides better results in music classification problem [37].

Methodology
In this section, the methodology followed is explained in this research. First, the details on how we gather necessary data for the selected 300 songs are given, and then we elaborate on the emotion tagging process. Thereafter, we try to clarify the text mining analysis process employed in building a classifier to categorize given songs with respect to their perceived/exhibited emotion. The process of our analysis is illustrated in Figure 1, where each step is elucidated in the following subsections.

Data gathering and preparation
In this phase, 45 Turkish popular music artists were selected from 282 enlisted artists in Turkish Wikipedia page [38]. Thereafter, we have selected 10 to 20 songs from each artist and corresponding lyrics of these songs have been collected with a custom code from the web.
Several problems were encountered with the data collected and had to perform some elimination over this data. Some of the songs were in languages other than Turkish, thus we removed them from the data set. Besides, original versions of songs are intended to be included in the data set. Therefore, if a randomly retrieved song is a remix, acoustic or other kind of adapted versions of the original song, it is also removed from the dataset. Another set of eliminated songs are those which were emotionally confusing songs where it is hard to decide in which mood category they belong to. If people who tagged the song did not agree on a mood, that song is identified as noisy data and excluded from the data set. All songs are tagged for perceived emotions by 3 people. If at least 2 of them agree on the mood category of a song, then it is left in the data set, otherwise it is removed. At the end, we are left with 300 songs in the data set where each one of the four mood categories contains 75 songs. Equal number of songs for each category is selected and tagged with the purpose of avoiding imbalanced learning problem where this approach has been adopted in similar studies [22].

Mood labelling (Emotion tagging)
Russel [39] proposed the circumplex model of affect based on the two dimensional model where the dimensions are "pleasant-unpleasant" and "arousal-sleep". There are 28 affect words in Russel's circumplex models and are shown in Figure 2. Several researchers have adopted a subset of Russel's taxonomy.   Each one of the selected Turkish songs is tagged by at least 3 different volunteered human taggers, where each one comes from the same socio-economic background and education. Each tagger was a PhD student in department of Management Information Systems. They worked independently and assigned a tag to one of the 4 mood categories. At the end, more than 400 songs are tagged by the human annotators, and among the ones where there is an agreement among at least two of the three annotators, 300 of them are selected in such a way to obtain the class balance between the mood categories as given in Table 2.

Reliability analysis
Based on the annotator agreement results, at least two of the three human annotators can only agree upon the single mood category for 76% of the songs assigned to them just based on their lyrics. Besides, among the songs for which a single mood is agreed upon, only 49% of them are labeled as belonging to the same mood category by all three human annotators. This shows us the difficulty of classifying a given song into a single mood category.
Moreover, the inter-annotator agreement for mood annotation task is examined in order to assess the interrater reliability. Based on the Cohen's kappa [42], we obtained the highest pairwise inter-annotator agreement as 0.61. Besides, interannotator agreement based on Fleiss' kappa is moderate at 0.55 level [43], [44].

Preprocessing of the Turkish song lyrics and feature extraction
Lyrics can provide valuable information about the mood of a song. In this respect, to perform classification of the songs into four emotion categories based only on their lyrics, first we extracted lyrics from the database of the music lyrics website "songlyrics.com" [45]. This database provides Java based application programming interface for downloading lyrics with keywords in the form of "track name" + "artist name". For those lyrics of the songs that we could not find in this site, we found them by querying the Google search engine.
In order to accomplish the preprocessing of the Turkish song lyrics, we have utilized PRETO tool, which is designed by Tunalı and Bilgin [26]. This tool was utilized in the Turkish word stemming and stopwords elimination parts. With PRETO tool, one can apply word filtering, such as removal of the words containing less than 3 letters and/or perform de-asciification of words in song lyrics automatically. For the stopwords elimination, we enhanced Turkish stopwords dictionary obtained from python package repository [46]. As a result, 223 stopwords in total were excluded from the lyrics in our analysis.
PRETO tool includes several approaches for stemming of the Turkish words. We have analyzed and compare methods available in PRETO where in their original work, Zemberek method provided the highest quality results in grouping of Turkish words [26].
In our study, all three Unigram, Bigram and Trigram bag-ofwords features are extracted from the lyrics after performing stopwords removal, both in their original forms (nonstemmed) and after available stemming procedures applied. Term by document matrices that are created using both original and stemmed words (Affix Stripping, Fixed Prefix, Zemberek and Zemberek Long stemming methods [26] are considered) of the lyrics are fed into supervised learning algorithms to generate corresponding mood detection models. In this respect, both term frequencies and term frequencyinverse document frequency (tf-idf) scores are employed as the index values to compare which one fits best for the Turkish song mood detection task.
In order to generate term by document matrix, first we extracted word stems using PRETO tool. During the extraction phase, we have solved some issues encountered by integrating some Java codes into the source code provided by the authors. Terms, which exist in more than 95% of the lyrics, were eliminated from analysis, since they cannot differentiate the songs from each other.
As mentioned before, we compare two different representations of the indices, term frequencies and tf-idf scores, used in the term by document matrix. All three Unigram, Bigram and Trigram features are considered in this study. First, we generate classifiers just using UNIGRAM word features and then compare the results with the classifiers generated using BIGRAM+ (Unigram and Bigram words together) and TRIGRAM+ (Unigram, Bigram and Trigram words together) word features. Term by document matrices generated by these six possible combinations of representations and n-gram features were fed into different classification algorithms in order to build a classification model for Turkish song moods. At the end, these models are cross-validated based on their accuracies in order to obtain the best mood detection model.

Classification model building and testing
For the classification model building and testing step, we employed scikit-learn python library [47]. In order to obtain better classification performances, we performed mood detection utilizing different classification methods. Selected algorithms to be considered in this study for the mood detection are; support vector machines (SVM) with linear kernel, of which the libsvm based implementation called SVC method is used, k-Nearest Neighbor method (k-NN) where the best k value is found by the GridSearch method provided in the scikit-learn library, Multinomial Naïve Bayes, Random Forest classifier which contains 100 trees in the forest, and Logistic Regression method.
In order to obtain reliable accuracy performances of these models, and to avoid overfitting, 10-fold cross validation procedure is employed. For the model comparisons, precision and recall values are considered as the accuracy performance metrics.

Results
In this section, comparison of the accuracy performances of the models created for music mood detection based on song lyrics are made and explained.
Best accuracy performance is achieved by feeding Multinomial Naïve Bayes classifier with the term by document matrix generated from the unigram stemmed words obtained by applying the Zemberek Long stemming method where the term frequencies are used as the representation for the indices in the matrix. In Table 3, summary results are given for each stemming method considered alongside with the result obtained from using original words. Here, only the results for the combinations that achieve the best accuracy performances sorted by recall scores are given.
As it is seen from the Table 3, Zemberek Long, Zemberek and Fixed Prefix has shown close accuracy performances where use of Zemberek Long stemming method leads to the best precision and recall values.
In Table 4, top 10 accuracy scores obtained from the generated models are given irrespective of the combination of the methods used, again sorted by recall values. It can be inferred from the Table 4 that stemming procedure utilized has significant effect on the performance of the generated classification model where Zemberek and Zemberek Long methods achieved 9 of the top 10 accuracy scores. Besides, only three classification methods, Multinomial Naïve Bayes, SVM and Logistic regression, give rise to top 10 scores whereas accuracy scores obtained from utilizing the other two algorithms, k-NN and Random Forest, are significantly lower as compared to the other methods. Table 5 shows the accuracy results obtained for each musical mood category by utilizing the selected classifiers where the Zemberek Long stemming method is applied on the lyric words. From Table 5, we can conclude that while for the "Happy" category high accuracy values are obtained consistently. But, best recall scores achieved depends mostly on the classification algorithm used and utilized index representation.
Exceptionally high recall values are obtained for the "Happy" class when k-NN method is used with the term by document matrix created using either BIGRAM+ (Unigram and Bigram words together) or TRIGRAM+ (Unigram, Bigram and Trigram words together) words where the word frequencies are used as the representation for the index values.
In terms of precision, lowest values are always obtained for the "Calm" mood class, whereas best precision scores are obtained for "Happy" and then "Angry" classes, respectively. So, since the lowest values are obtained for the "Calm" mood category, for the best precision and recall scores obtained, one should give emphasis on understanding the reasons for getting low accuracy scores for this category and try to improve the results.
To conclude this section, most frequent (unigram, bigram and trigram) words encountered in the lyrics are given in the following tables. In Table 6, a set of the 10 most frequent unigram words (stems) in the song lyrics is given for each category.
Highly ranked content words seem to have meaningful connections to the categories, such as "aşk/love", "sev/like", "kalbi/from heart" in Happy songs. The categories Sad and Angry include similar words which have negative meaning like "ağla/cry", "kal/stay", "git/go away" and "yalan/lie". On the other hand, almost none of the words of "Calm" category carry an emotional meaning except "bırak/leave". When we compare the success rates of Unigram, Bigram and Trigram text feature based classification models, we conclude that Unigram words convey more information to the classification algorithms. Table 7 and Table 8 show the 10 most frequent Bigram and Trigram words, respectively.

Conclusion
In this study, mood detection of songs with text mining of the song lyrics through bag-of-words approach is investigated. In order to do so, several classification algorithms are examined with various textual features including Unigram, Bigram and Trigram features. Besides, to impose structure on text we generate term by document matrix by utilizing both term frequencies and tf-idf scores and try to understand which representation and feature set fits best for the mood detection problem. Recall and precision scores were used as the performance measures in this study. PRETO tool together with the scikit-learn library were employed for performing preprocessing of the lyrics and creating classifiers, respectively. Three learning algorithms, namely Multinomial Naïve Bayes, SVM and Logistic regression, best fit for the mood detection task out of the five chosen classification algorithms including k-NN and Random Forest. Besides, Zemberek Long stemming method achieved best accuracy results in terms of both recall and precision values. When we consider n-gram word features, that is to say when the success rates of Unigram, Bigram and Trigram text feature based classification models are compared, we can conclude that Unigram words convey more information to the classification algorithms.     This situation might cause a limitation of our study and related studies of mood detection. To avoid this problem in this study, songs are annotated by 3 people at the same period of time.
Annotators have the same socio economic profile, which although could not represent the whole community, resulting labels can be considered consistent within themselves.
Therefore, it is recommended to utilize crowdsourcing for emotion tagging with more taggers to achieve more reliable labelling of the songs.
As a second limitation, we have only benefited from pure n-gram text features. Accuracy performances of the models might be improved by utilizing other syntactic as well as semantic properties of the Turkish language, such as using emotionally representative words, understanding negative expressions for Turkish and utilizing Part-of-Speech tags. In addition, accuracy of the framework can also be improved via incorporating techniques that consider the word orderings.
Moreover, single mood label is chosen for each one of the songs and songs that are not labeled same by at least two of the three annotators are eliminated. Therefore, in order not to lose precious data and improve the accuracy performances, problem can be defined as a multilabel classification problem.
Finally, the different detection models could be applied in different Valence-Arousal Quadrants [48] to get results that are more accurate in estimating a song's emotion. Besides, in a future study, we are planning to integrate text-based features with acoustic features in order to improve classifier performances.