The #BenderRule: On Naming the Languages We Study and Why It Matters

The #BenderRule: On Naming the Languages We Study and Why It Matters

. 15 min read

High Resource Languages vs Low Resource Languages

Progress in the field of Natural Language Processing (NLP) depends on the existence of language resources: digitized collections of written, spoken or signed language, often with gold standard labels or annotations reflecting the intended output of the NLP system for the task at hand (e.g. the gold standard text for a speech recognition system or gold standard user intent labels in a dialogue system such as Siri, Alexa or Google Home). Unsupervised, weakly supervised, semi-supervised, or distantly supervised machine learning techniques reduce the overall dependence on labeled data, but even with such approaches, there is a need for both sufficient labeled data to evaluate system performance and typically much larger collections of unlabeled data to support the very data-hungry machine learning techniques.

This has led to a digital divide in the field of NLP between high resource and low resource languages. High resource languages constitute a short list starting with English, (Mandarin) Chinese, Arabic and French[1]. and then possibly also including German, Portuguese, Spanish and Finnish. These languages have large, accessible[2] collections of digitized text, large collections of recorded speech (these are all spoken, not signed languages) much of which has been transcribed, as well as annotated resources such as treebanks and evaluation sets for a large number of NLP tasks and NLP tools such as off-the-shelf parsers, morphological analyzers, named entity recognizers, etc. As of August 2019, the LRE Map lists 961 resources for English and another 121 for American English, 216 for German, 180 for French, 130 for Spanish, 103 for Mandarin Chinese, and 103 for Japanese. The only other languages with >50 resources listed there are Portuguese, Italian, Dutch, Standard Arabic, and Czech.[3] The remainder of the world’s ~7000 languages have far fewer resources.

Not unrelatedly, the bulk of the research published at major NLP conferences, by researchers working in countries all around the world, focuses on the high resource languages, and disproportionately on English. Robert Munro, Sabrina Mielke and I have all done surveys of the languages reflected in major NLP conferences, the results of which I have summarized in the following table.[4] [5]

| Conference | % English | Next most common language(s) | % next most common language(s) | Source |
|---|---|---|---|---|---|
|ACL 2004|87|Chinese|9|Mielke 2016[6]|
|ACL 2008|63|German, Chinese|4|Bender 2009[7]|
|ACL 2008|87|Chinese|16|Mielke 2016[6:1]|
|EACL 2009|55|German|7|Bender 2011[8]|
|ACL 2012|86|Chinese|23|Mielke 2016[6:2]|
|ACL 2015|75|Chinese|5|Munro 2015[9]|
|ACL 2016|90|Chinese|13|Mielke 2016[6:3]|

Though English and Mandarin Chinese are widely spoken, both as first and second languages,[10] clearly a world in which advanced language technology exists only for these two languages is undesirable. The promise of language technology includes pro-social applications spanning a wide range from biomedical applications (e.g. matching patients to research studies or automatically flagging patients for time-sensitive tests based on physician notes), through machine translation of documents available on the web, to interactive tutors for language learning and other learning scenarios and more. These benefits should be available to all.[11] Furthermore, the existence of even the most basic language technology (keyboards or input systems supporting the writing system, spell checkers, web search) builds up the value of a language which can be an important factor in self-esteem and educational outcomes for speakers of minoritized languages and can contribute to the maintenance of languages under threat of displacement by local majority languages (see e.g. Bamgbose 2011[12]).

And yet, the field of NLP is caught in a negative feedback loop that hinders the expansion of the languages we work on. Work on languages other than English is often considered “language specific” and thus reviewed as less important than equivalent work on English. Reviewers for NLP conferences often mistake the state of the art on a given task as the state of the art for English on that task, and if a paper doesn’t compare to that then they can’t tell whether it’s “worthy”.[13] I believe that a key underlying factor here is the misconception that English is a sufficiently representative language and that therefore work on (just) English isn’t also language-specific. This misconception is abetted by the habit of failing to name the language studied when it is English.

The History of the #BenderRule

In 2009, Tim Baldwin and Valia Kordoni organized a workshop of invited talks at EACL entitled “The Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?” At that time, machine learning (pre-deep learning) for NLP was very much in vogue, with a lot of the rhetoric around it being about how machine learning approaches to NLP were more economical because they required less input from linguistic experts than the previously dominant paradigm of rule-based NLP. A particularly strident form of this rhetoric (present in some but not all NLP papers of the day) asserted that NLP systems which don’t encode any specific linguistic knowledge are therefore “language independent”. In my paper at the EACL 2009 workshop (entitled “Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology”[7:1]), I pushed back against this notion, arguing that if we only work on English (or English plus a small handful of other languages), we can’t tell whether the systems we build are in fact well-adapted to language in general, through a feedback loop of overfitting as we search for systems that do ever better on English test sets. Just because no specific linguistic knowledge about English is directly encoded doesn’t entail that the model will generalize across all languages.[14] Furthermore, if the goal is language-independent or cross-linguistically applicable systems, we are better advised to take advantage of linguistic knowledge. In particular, we should take advantage of the results of the field of linguistic typology, which studies the range of variation across the world’s languages and the limits on that variation.[15]

In Bender 2011 (“On Achieving and Evaluating Language-Independence in NLP”[8:1]), an extended version of the 2009 workshop paper, I include an etiquette-book style list of “dos and don’ts” for language independent NLP. It includes this early statement of what later became called the Bender Rule:

Do state the name of the language that is being studied, even if it's English. Acknowledging that we are working on a particular language foregrounds the possibility that the techniques may in fact be language specific. Conversely, neglecting to state that the particular data used were in, say, English, gives [a] false veneer of language-independence to the work. (Bender 2011:18)

It wasn’t until 2019, however, that this message really caught on. In November 2018, while writing about language resources available for computational semantics and pragmatics, I once again found myself frustrated that even the papers presenting language resources for English can fail to clearly state that English is the language in question. This led to the following tweet:

In March and then later in May 2019, Nathan Schneider,[16] Yuval Pinter,[17] Robert Munro,[18] and Andrew Caines[19] all independently coined “Bender Rule” or “Bender Clauses” as, variously, the practice of naming the languages studied, the practice of asking, as a reviewer, which languages were studied, or the practice of being skeptical of claims of language independence when only one test language was used. Eventually, the statement of the Bender Rule coalesced to “always name the language you’re working on”.

At NAACL 2019 and ACL 2019 and their associated workshops, there were several posters that directly mentioned the #BenderRule in the context of naming their languages. I suspect this was at least in part because calling out to some external rule helps when going against local social norms, in this case, the norm of treating the direct naming of English as redundant (and therefore silly) because English is the default or because it’s obvious it must be English because the examples are in English or because “everyone knows” that the resources used are English resources.[20]

Similarly, the principle seems obvious and trivial, yet I am happy to lend my name to it, because I feel strongly that the field of NLP must broaden its purview beyond English and the handful of other well-studied languages and I believe that we won’t get there unless we stop treating English as the default language and stop pretending that work on English and only English isn’t “language-specific”.

English is Neither Synonymous with Nor Representative of Natural Language

NLP is an interdisciplinary field, building on work in (at least) linguistics, computer science, statistics, and electrical engineering. One thing that linguists in particular bring to this enterprise is a focus on the phenomenon of language itself, as opposed to the information or communicative intent encoded in or communicated with specific language behavior. At my recent talk at Widening NLP 2019[21], I likened this to a rain-spattered window. People working on e.g. information extraction are interested in information encoded in digitized language, analogous to peering at the scene outside the window. People working in linguistics, on the other hand, are interested in the structures and patterns of language and how they relate to communicative intent, analogous to the patterns of the raindrops and how they affect how we see the scene outside the window.

Rain_splashing_down_a_window

Stretching this metaphor a bit further, one can think of each language, including English, as a specific window with a specific pattern of raindrops, i.e. its own idiosyncrasies. Here’s a quick list of ways in which English fails to represent all languages, that is, properties of English that are not broadly shared, even among the world’s widely used languages:

  1. It’s a spoken language, and not a signed language.
    Right off the bat, if we take only English, we’ve restricted our attention away from an important class of languages.
  2. It has a well-established, long-used, roughly phone-based orthographic system. Phone-based means that the letters correspond to individual sounds. English orthography only approximates this principle. Other languages such as Spanish have much more transparently phone-based orthographies, still others represent only consonants (e.g. Hebrew and Arabic, traditionally) or have symbols which represent syllables rather than single sounds (e.g. Malayalam, Korean, or the Japanese kana), or use logographic systems (e.g. Chinese, or the sinographs borrowed into Japanese as kanji; see Handel 2019[22]). And of course, many of the world’s languages are not written, or are written but don’t have a long tradition of being written and/or don’t have standardized orthographies. We routinely underestimate how much standardization simplifies the task of NLP for English.
  3. The standardized orthography for English provides a standardized notion of “word” indicated by whitespace.
    This isn’t true for all languages, even those with standardized orthography. Many NLP systems for Chinese, Japanese, Thai and other languages have to start with the problem of word tokenization.[23]
  4. English writing uses (mostly) only lower-ascii characters found on every computer.
    For the most part, we don’t have to worry about rarer character encodings, unsupported Unicode ranges, etc. when working with English.
  5. English has relatively little inflectional morphology and thus fewer forms for each word.
    Many kinds of NLP technology suffer from data sparsity problems, which are only exacerbated when one and the same word shows up in many different forms in a highly inflecting language. (Character n-gram based deep learning models level this problem somewhat, but it remains an important difference between English and many of the world’s languages.)
  6. English has relatively fixed word order.
    Compared to many languages in the world, English is rigid in its word order, insisting on subject-verb-object in most circumstances, adjectives before nouns but relative clauses after, etc. Without testing on more flexible word order languages, how can we know the extent to which systems rely on this property of English?
  7. English forms might ‘accidentally’ match database field names, ontology entries, etc.
    Many language technologies achieve task-specific goals by mapping strings in the input language or transformations of those strings into syntactic or semantic representations to external knowledge bases. When the input strings and the field names or entries in the knowledge base are in the same language, processing shortcuts become available. But for how many languages is this true?
  8. English has massive amounts of training data available (like the 3.3B tokens used to train BERT (Devlin et al 2019[24])).
    If we focus all of our attention on methodologies which rely on amounts of training data that simply aren’t available for most of the world’s languages, how are we going to build systems that work for those other languages? Similarly, if we only value work that uses those technologies (e.g. in conference reviewing), how can we expect to make any progress on cross-linguistically useful NLP?

Naming the Language is Just the First Step

I’m heartened that the field is starting to take up the point that we should name the language, even when it’s obviously English. However, as the field begins to grapple with the ethical implications of our work and the ways in which language technology has the capacity to negatively impact both users and bystanders (see e.g. Hovy & Spruit 2016[25], Speer 2017[26], Grissom II 2019[27]), it has become clear that there is much more we need to be saying about the data we use to train and test models.

The first thing to consider is variation within languages: all languages are constantly changing, and except in cases of very small speaker populations, there is always going to be wide variation across different varieties of a language (e.g. Labov 1966[28], Eckert and Rickford 2001[29]). This includes variation across different regions as well as variation associated with different social groups and social identities. Models trained on speech/text/sign from one specific population won’t necessarily work for others, even among speakers of what’s considered the same language.

The second concern is that models trained on running text will pick up biases from that text, based on how the authors of the text view and talk about the world (e.g. Bolukbasi et al 2016[30], Speer 2017[26:1]). In order to position ourselves to address the potential for harm raised by both of these cases, Batya Friedman and I (in Bender & Friedman 2018[31]) propose “data statements”, a practice for clearly documenting datasets used in NLP systems.[32] We suggest that all NLP systems should be accompanied by detailed information about the training data, including the specific language varieties involved, the curation rationale (how was the data chosen and why?), demographic information about the speakers and annotators, and more. This information alone won’t solve the problems of bias, of course, but it opens up the possibility of addressing them.


  1. This list is produced impressionistically based on the range of literature discussing these languages. ↩︎

  2. Some language resources are only available under very restrictive licenses or at high cost, limiting the potential for research based on them. ↩︎

  3. The LRE Map (Calzolari et al 2012) is an initiative of ELRA, the European Language Resources Association, and is built up as authors submitting papers to participating conferences create entries for the language resources their papers are presenting or building on. ↩︎

  4. Each of these surveys had its own methodology and so the numbers aren’t directly comparable, but the overall trend is very clear. See the original sources for details on the methodology and for further information collected in each survey. The differing values for ACL 2008 are because my counts looked at the percentage of studies that looked at a given language, where for Mielke, the denominator is papers. ↩︎

  5. Most work on Chinese languages doesn’t specify which Chinese language is concerned, especially if the variety is Mandarin. This table, like the underlying sources, just reports “Chinese”, but this is likely exclusively Mandarin. ↩︎

  6. Mielke, S. J. (2016). Language diversity in ACL 2004 - 2016. (Blog post, available at https://sjmielke.com/acl-language-diversity.htm, accessed 6 August 2019) ↩︎ ↩︎ ↩︎ ↩︎

  7. Bender, E. M. (2009, March). Linguistically naïve != language independent: Why NLP needs linguistic typology. In Proceedings of the EACL 2009 workshop on the interaction between linguistics and computational linguistics: Virtuous, vicious or vacuous? (pp. 26–32). Athens, Greece: Association for Computational Linguistics. Available from http://www.aclweb.org/anthology/W09-0106 ↩︎ ↩︎

  8. Bender, E. M. (2011). On achieving and evaluating language independence in NLP. Linguistic Issues in Language Technology, 6, 1–26. Available from http://journals.linguisticsociety.org/elanguage/lilt/article/download/2624/2624-5403-1-PB.pdf ↩︎ ↩︎

  9. Munro, R. (2015). Jungle light speed: Languages at ACL this year. (Blog post, available at http://www.junglelightspeed.com/languages-at-acl-this-year/, accessed July 25, 2019) ↩︎

  10. Ethnologue.com estimates 379 million first language (L1) speakers of English and 753 million second language (L2) speakers. For Mandarin, they estimate 918 million L1 and 199 million L2. ↩︎

  11. At the same time, it is worth remembering that not all applications of NLP are in fact beneficial and that already minoritized or marginalized populations are more likely to bear the brunt of negative impacts of e.g. surveillance technology built on NLP (see Grissom II 2019). In other words, there may be ways in which is it beneficial to be “left out” of progress in language technology. ↩︎

  12. Bamgbose, A. (2011). African languages today: The challenge of and prospects for empowerment under globalization. In Selected proceedings of the 40th annual conference on African linguistics (pp. 1–14). ↩︎

  13. There’s a whole other conversation to be had about the over-emphasis on leaderboards and chasing the state of the art in our field, to the neglect of careful analysis of what is and isn’t working and why. For thoughtful discussion of this, see Heinzerling 2019. ↩︎

  14. For a clear demonstration of this effect in the context of classifying medical SMSes, see Munro & Manning 2010. ↩︎

  15. This theme was taken up by the ACL 2019 workshop Typology for Polyglot NLP, organized by Haim Dubossarsky, Arya D. McCarthy, Edoardo M. Ponti, Ivan Vulić, and Ekaterina Vylomova. ↩︎

  16. On Facebook ↩︎

  17. On Twitter:

    ↩︎

  18. On Twitter:

    ↩︎

  19. On Twitter:

    ↩︎

  20. This isn’t nearly as obvious as one might think. For example, The English Penn Treebank, usually just called “the Penn Treebank” (Marcus et al 1993), is but one resource of its type: There are also the Penn Chinese Treebank (Xue et al 2005) and the Penn Arabic Treebank (Maamouri et al 2004). ↩︎

  21. Video available here ↩︎

  22. Handel, Z. (2019). Sinography: The borrowing and adaptation of the Chinese script. Leiden, Boston: Brill. ↩︎

  23. Even in English, the problem of tokenization is both more important and more subtle that is usually appreciated (Dridan and Oepen 2012). ↩︎

  24. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Available from https://www.aclweb.org/anthology/N19-1423 ↩︎

  25. Hovy, D., & Spruit, S. L. (2016, August). The social impact of natural language processing. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 591–598). Berlin, Germany: Association for Computational Linguistics. Available from http://anthology.aclweb.org/P16-2096 ↩︎

  26. Speer, R. (2017). Conceptnet numberbatch 17.04: better, less-stereotyped word vectors. (Blog post, https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/, accessed 6 July 2017) ↩︎ ↩︎

  27. Grissom II, A. (2019). Thinking about how NLP is used to serve power: Current and future trends. (Presentation at Widening NLP 2019; slides available at https://github.com/acgrissom/presentations/blob/master/winlp_tech_dom_marp.md, video available at https://www.livecongress.it/aol/index.php?id=1EA9D7D3) ↩︎

  28. Labov, W. (1966). The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics. ↩︎

  29. Eckert, P., & Rickford, J. R. (Eds.). (2001). Style and sociolinguistic variation. Cambridge: Cambridge University Press. ↩︎

  30. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems 29 (pp. 4349–4357). Curran Associates, Inc. Available from http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf ↩︎

  31. Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604. Available from https://doi.org/10.1162/tacl_a_00041 ↩︎

  32. Other similar proposals, looking at machine learning more generally, include Datasheets (Gebru et al 2018), Nutrition Labels (Holland et al 2018), and Model Cards (Mitchell et al 2019). ↩︎





Author Bio
Emily M. Bender is the Howard and Frances Nostrand Endowed Professor in the Department of Linguistics and an Adjunct Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. She is also the Faculty Director of the UW's Professional MS in Computational Linguistics (CLMS). Her research interests include multilingual grammar engineering, computational semantics and ethics and NLP. She is the author of Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax (2013) and, with Alex Lascarides, the forthcoming companion to that work Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics.


Acknowledgments
I'd like to express my appreciation to Nathan Schneider, Yuval Pinter, Robert Munro, and Andrew Caines both for the original naming of the #BenderRule and for comments on this article. Thanks also to Hugh Zhang, Adithya Ganesh and Stanley Xie for the invitation to write it and further comments.


Citation
For attribution in academic contexts or books, please cite this work as

Emily M. Bender, "The #BenderRule: On Naming the Languages We Study and Why It Matters", The Gradient, 2019.

BibTeX citation:

@article{bender2019rule,
author = {Bender, Emily},
title = {The #BenderRule: On Naming the Languages We Study and Why It Matters},
journal = {The Gradient},
year = {2019},
howpublished = {\url{https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/ } },
}


If you enjoyed this piece and want to hear more, subscribe to the Gradient and follow us on Twitter.