The research field of Natural Language Processing (NLP) puts a lot of emphasis on empirical results. It seems that models are reaching state-of-the-art and even “super-human” performance on language understanding tasks on a daily basis, thanks to large datasets and powerful models with billions of parameters. However, the existing evaluation methodologies are lacking in rigor, leaving the field susceptible to erroneous claims. In this talk, I will describe efforts to build a solid framework for evaluation and experiment design for diverse NLP tasks.
To begin, I describe how NLP researchers currently conduct experiments and measure model performance, highlighting some important limitations. I will present our work, in which we propose statistical analyses for comparing models that allow researchers to make credible and statistically valid claims in many experimental settings in NLP, such as experimenting with deep neural network models or reporting model performance across multiple languages.
Then, I will discuss challenges related to evaluating text-generation tasks, such as machine translation or automatic summarization. Evaluation of text-generation models is not straightforward because different texts can convey the same meaning. Therefore, comparing a model’s output with a human-written reference may not reflect the true quality of the output. Researchers have designed metrics to automatically evaluate text-generation systems. However, none of them measures all of the aspects we wish to evaluate. Therefore, I will propose methods for estimating the quality of these evaluation metrics and introduce limitations for a certain type of evaluation technique known as reference-free. Finally, I will discuss future research directions and challenges in estimating the true performance of NLP models.
This talk is based on papers published in TACL (2017, 2021), ACL (2018, 2019), NAACL 2022, EMNLP 2022, and a book published by Morgan and Claypool Publishers in 2020.