Brian Roark's Homepage

Contact Info: Google, Inc., 555 SW Morrison St., Ste. 500 Portland, OR 97204 email: roarkbr AT SYMBOL g m a i l DOT c o m

Before joining Google as a research scientist in May, 2013, I was a faculty member for 9 years in the Center for Spoken Language Understanding (CSLU) at the Oregon Health & Science University (OHSU) – part of what used to be the Oregon Graduate Institute (OGI). Before that, I was in the Speech Algorithms Department at AT&T Labs - Research from 2001 – 2004. I finished my Ph.D. in the Department of Cognitive and Linguistic Sciences at Brown University in 2001. There I was part of the Brown Laboratory for Linguistic Information Processing.

I am a computational linguist working on various topics in natural language processing. My research interests include: transliteration and text normalization; language identification; language modeling for automatic speech recognition, text entry and other applications; weighted transducers and grammars; supervised and unsupervised learning of language models; pronunciation modeling; text entry, accessibility and augmentative & alternative communication (AAC); syntactic parsing of text and speech; statistical models of human language processing; spoken language processing for diagnosis of neurodevelopmental and neurodegenerative disorders.

A few recent publications:

Adrian Benton, Alexander Gutkin, Christo Kirov and Brian Roark. 2026. Mining Naturally Romanized Seed Corpora without Romanizations. In Proceedings of LREC, pp. 2996-3012.
Brian Roark, Richard Sproat and Su-Youn Yoon. 2026. Tools of the Scribe: How Writing Systems, Technology, and Human Factors Interact To Affect the Act of Writing. Springer Nature, Cham, Switzerland. (See link to review in Nature below.)
Adrian Benton, Alexander Gutkin, Christo Kirov and Brian Roark. 2025. Improving Informally Romanized Language Identification. In Proceedings of EMNLP, pp. 2318–2336. preprint

Other publications; Google Scholar profile; Google Research page; Semantic Scholar page; CV.

Some recent-ish activities, links and/or resources:

Here's a nice review of Tools of the Scribe from January 2026 in the journal Nature from Andrew Robinson, in which he labels it "a stimulating and original, if technical, book". Aw, shucks...
I gave a talk on "Empirical methods in context-aware transliteration" at the Eugene Charniak Memorial Symposium at the CS Deptartment of Brown University in November, 2024.
I co-organized (w/Kyle Gorman, Emily Prud’hommeaux and Richard Sproat) the Second Workshop on Computation and Written Language (CAWL 2024), held in conjunction with LREC-COLING in Torino, Italy, May 21, 2024. The workshop was sponsored by the newly-formed ACL Special Interest Group on Writing Systems and Written Language (SIGWrit), which I helped establish in 2023. The previous year, we organized the First ACL Workshop on Computation and Written Language (CAWL), which was held at ACL 2023 in Toronto, July 14, 2023.
I was co-Editor-in-chief for the Transactions of the Association for Computational Linguistics (TACL) from 2018-2022. The journal's 2021 impact factor was the topic of an MIT Press blog post.
Here's the site of the Dakshina dataset, an open-source collection of romanized and native script Wikipedia in 12 South Asian languages that I helped put together.
Here's a 2021 Google Research blog post about some work my team was involved in, transliterating geo entity names into Brahmic scripts. And here's an earlier (2017) post about some related work I contributed to, providing transliteration keyboards in 20+ South Asian languages.