STATISTICS IN CORPUS LINGUISTICS

Academic Year 2022/2023 - Teacher: MARCO VENUTI

Expected Learning Outcomes

According to the Dublin descriptors, students, at the end of the course, will demonstrate:
1) Knowledge and understanding: Understanding basic concepts in Corpus Linguistics; Understanding basic concepts in statistics in relation to quantitative approaches to language; Knowing the use of LancsBox software; Knowing the use of the online Stats Tools
2) Applying knowledge and understanding: Being able to identify and to use the appropriate tools and statistical measures which better fit with the necessary quantitative analyses.
3)Making judgements: Being able to collect, analyse and interpret textual data within a complex context, being able to formulate efficient solution, motivating necessary choices.
4) Communication skills: Being able to explain and support the various stages of the analysis and its findings in a structured and clear way, making use of an appropriate register and style.
5) Learning skills: Starting from the knowledge, understanding, and competence developed in the module, being able to continue the graduate programme in a largely self-directed and autonomous way.

Course Structure

Lectures and practical sessions

Required Prerequisites

A B2 competence on English, basic ntions in linguistics

Attendance of Lessons

Compulsory attendance

Detailed Course Content

After introducing basic notions about corpora and corpus construction, the module, and the continuous hands-on sessions, will focus on the use of software for the statistical analysis of the English language including: a) the analysis of lexis, with particular emphasis on collocation, keywords and lexical density, b) a lexico-grammatical approach to language description; c) the analysis of language variation, focusing on registers; d) sociolinguistic and stylistic studies; e) Diachronic comparisons.

Textbook Information

Brezina, Vaclav 2018 Statistics in Corpus Linguistics. A Practical Guide, Cambridge University Press, pp. 296.
Baker, Paul, 2010, “Will Ms ever be as frequent as Mr? A corpus-based comparison of gendered terms across four diachronic corpora of British English”. In: Gender and Language. 4/1, pp.125-149.
Teaching materials for the practical workshop on corpus querying software will be available on STUDIUM.
Please remember that in compliance with art 171 L22.04.1941, n. 633 and its amendments, it is illegal to copy entire books or journals, only 15% of their content can be copied.
For further information on sanctions and regulations concerning photocopying please refer to the regulations on copyright (Linee Guida sulla Gestione dei Diritti d’Autore) provided by AIDRO - Associazione Italiana per i Diritti di Riproduzione delle opere dell’ingegno (the Italian Association on Copyright).
All the books listed in the programs can be consulted in the Library.

Course Planning

 SubjectsText References
1Lecture 1 introduces basic principles of statistical thinking that are necessary for informed application of statistical procedures to corpus data. It explains the role of statistics in scientific research in general and corpus linguistics in particular. Topics such as the creation of corpora, types of research design, basic statistical terminology, as well as data exploration and visualization will be discussed.Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 1 
2Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 1 
3Lecture 2 introduces simple statistical measures that help describe the occurrence of words in texts and corpora. It focuses on word frequencies and distributions both of which are crucial for meaningful description of patterns of language useBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 2 
4Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 2 
5Lecture 3 explores meanings of words in context, which is an area important to both linguistic and social analyses. Topics discussed are collocations, keywords and manual coding of concordance lines; these play a key role both in the study of semantics (‘dictionary’ meanings of words) and in discourse analysis.Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 3
6Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 3
7Lecture 4 focuses on the statistical analysis of lexico-grammatical features in language such as articles, passive constructions or modal expressions. The chapter shows how lexico-grammatical variation can be summarised using cross-tabulation and what statistical measures can be computed based on cross-tabulation summary tables. These measures range from simple percentages to the chi-squared test and logistic regressionBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 4 
8Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 4 
9Lecture 5 discusses a group of methods that can be used for the simultaneous analysis of a large number of linguistic variables that characterise different texts and registers. First, we look at the relationship between two linguistic variables by means of correlation. Both Pearson’s and the non-parametric Spearman’s correlations are explained. Next, we explore the classification of words, texts, registers etc. using the technique of hierarchical agglomerative clusteringBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 5 
10Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 5 
11Lecture 6 discusses different statistical procedures available for the analysis of stylistic and sociolinguistic variation in corpora. It reviews different approaches to variation, pointing out the common connection to the notion of ‘style’ understood as a particular way of speaking and using languageBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 6 
12Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 6 
13Investigation of frequency and context of usage of gender marked language in four equal sized and equivalently sampled corpora of British EnglishBaker, Paul, 2010, “Will Ms ever be as frequent as Mr?
14Lecture 7 discusses statistical procedures that can be used to explore historical or diachronic data. First, specific features of diachronic studies are outlined and techniques that provide effective visualizations of diachronic change are introduced. Second, the lecture focuses on the statistical comparison of two time periods using a procedure called bootstrapping. Next, the diachronic application of the cluster analysis is discussedBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 7 
15Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 7 
16Lecture 8 brings together the statistical knowledge discussed in this course. It then discusses an important topic of replication and introduces a statistical technique called meta-analysis, which provides statistical (quantitative) summary of studies dealing with the same research question(s) (topic). Finally, common effect size measures are reviewed and a guide for their interpretation is providedBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 8 
17Computer lab session with exercises and Lancaster Stats Tools onlineBrezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 8 
18Definition of students' research project

Learning Assessment

Learning Assessment Procedures

- Practical assignment: students will build a corpus that they will query with the tools introduced in the module. Three in-session tasks will take place each 6 lessons; students will be assessed according to: a) criteria for corpus selection, b) corpus creation and tagging/annotation, c) appropriate research question
- Written assignment: at the end of the course, students will write a written report (3000/4000 words) where they will present and discuss the results of the analysis carried out in their own corpora. Students will be assessed according to the detail of analysis and to the results, taking into account choices in corpus compilation and analysis.
- Oral exam: discussion of the submitted report; elements thare were not present or not clearly spelled out will be discussed.

Examples of frequently asked questions and / or exercises

Criteria for corpus compilation
Choice of most appropriate tools
Choice of most appropriate statistical tests
Research question

VERSIONE IN ITALIANO