“Quotations will tell the full measure of meaning, if you have enough of them.”
—James Murray—

Quotebank dataset

Quotebank is a corpus of quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2015 and 2020. The quotations have been extracted with Quobert, a machine learning framework used for extracting an attributing quotations from a corpus of news articles.

Clean and filtered Quotebank Climate-related dataset

We used a subset of the Quotebank dataset, a corpus of quotations that we identified as climate change-related, extracted from the entire Quotebank dataset. It results from the cleaning and filtering of Quotebank. The filtering to obtain climate-related quotes was done using a list of keywords, that was extracted using logistic regression from two given test sets (see below). The filtered datasets are referred as "clean_quotes-year.bz2".

Logistic regression training datasets

"train1.tsv" and "Wiki_train.tsv" are the two datasets used to train and test the logistic regression model used to extract our list of keywords. They were found on the article from Varini et al. These datasets are composed of sentences labeled as climate-related or not. They were obtained using Active Learning on previously existing datasets.

Speaker dataset

In addition to the climate-related Quotebank subset we used a list of all the speakers that appear in the Quotebank dataset, containing their characterists, such as nationality, gender, education, political party, date of birth, ethnic group and religion. We also enriched this data with a column containing boolean values describing whether this person talks about climate or not. The first value is True if any quote from this speaker is found in the climate-related dataset. For futher analyses, we kept the 20 most represented features for each column and turned them into one-hot columns ("one_hot.bz2"). This enabled us to handle the categorical data.

All our datasets can be found here.