Keywords extraction and dataset filtering
We started by extracting a subdata ("quotes") from the quotebank dataset according to a keywords list. This keywords list is created using logistic regression trained on two external data sets "Wiki_train.tsv" and "train_1-tsv", representing the main words from climate related quotes. We constructed a new data set ("speakers") from cleaning and filtering "wikidata_labels_descriptions_quotebank.csv". For further analysis, we created a "one_hot.bz2" file containing our categorical data of interest according to the 20 main elements of each feature (columns).
In order to visualize the temporal evolution of the environmental topic, we plotted the distribution of climate-change-related quotes over months and years. These data are taken from the "quotes" dataframe.
Climate speakers profile characterization
To obtain the top 20 main features of each column ("speakers") we exploded the values for each attribute of the speakers dataframe and plotted them in barplots.
We used the VADER Sentiment Analysis in order to obtain the rates of positive and negative emotions in the quotes from the quotes dataframe over time. We extracted and visualized the main words of the highest positive and negative rated quotes. This enabled us to confirm our general sentiment analysis. To investigate more precisely some features, we took the two main parties of the US from the speakers dataframe and analyzed their general feeling.
We performed some classification tasks on our dataframe containing the categorical datas (extraction of data from the one-hot file). First we fitted a baseline model in order to know in what direction we must go. Then, following a comparison between PCA and standardization we performed a new logistic model. We then tested the importance of each element of each feature by fitting several logistic regression models and selected the best one giving the ROC_AUC score.