STATISTICS IN CORPUS LINGUISTICS

Anno accademico 2022/2023 - Docente: MARCO VENUTI

Risultati di apprendimento attesi

Secondo i descrittori di Dublino gli studenti dovranno, alla fine del corso, acquisire:
1) Conoscenza e comprensione: Comprendere le basi della Corpus Linguistics; Comprendere le basi della statistica in relazione all’analisi quantitativa della lingua; Conoscere l’utilizzo del software LancsBox; Conoscere l’utilizzo degli Stats tools online.
2) Capacità di applicare conoscenza e comprensione: Saper identificare ed utilizzare gli strumenti e le misure statistiche appropriate ai diversi tipi di analisi quantitative a partire dagli specifici obiettivi identificati.
3) Autonomia di giudizio: Dimostrare di saper raccogliere, analizzare ed interpretare dati testuali da un contesto complesso e di saper adottare la soluzione progettuale più efficiente, motivando in modo appropriato tutte le scelte necessarie per lo svolgimento del progetto.
4) Abilità comunicative: Saper spiegare ed argomentare in modo chiaro le diverse fasi dell’analisi svolta, facendo riferimento alla terminologia appropriata.
5) Capacità di apprendimento: Tramite i concetti, le nozioni e le capacità acquisite durante il corso, saper intraprendere gli studi successivi con un alto grado di autonomia.

Modalità di svolgimento dell'insegnamento

Lezioni frontali e sessioni pratiche

Prerequisiti richiesti

Conoscenza della lingua inglese di un livello B2, nozioni di base linguistica generale, o di filosofia del linguaggio o di semiotica

Frequenza lezioni

Frequenza obbligatoria

Contenuti del corso

Dopo un’introduzione ai corpora e alle nozioni di base riferite alla creazione di corpora, il corso introdurrà, con regolari sessioni pratiche, l’uso di software per l’analisi statistica della lingua inglese in merito a: a) l’analisi del lessico, attraverso i concetti di collocation, keywords e lexical density; b) un approccio lessico-grammaticale alla descrizione linguistica; c) l’analisi della variazione in termini di registro linguistico; d) studi sociolinguistici e stilistici; e) confronti diacronici.

Testi di riferimento

Brezina, Vaclav 2018 Statistics in Corpus Linguistics. A Practical Guide, Cambridge University Press, pp. 296.
Baker, Paul, 2010, “Will Ms ever be as frequent as Mr? A corpus-based comparison of gendered terms across four diachronic corpora of British English”. In: Gender and Language. 4/1, pp.125-149.
I materiali per le esercitazioni pratiche con software di interrogazione di corpora verranno resi disponibili tramite STUDIUM

Si ricorda che, ai sensi dell’art. 171 della legge 22 aprile 1941, n. 633 e successive disposizioni, fotocopiare libri in commercio, in misura superiore al 15% del volume o del fascicolo di rivista, è reato penale.
Per ulteriori informazioni sui vincoli e sulle sanzioni all’uso illecito di fotocopie, è possibile consultare le Linee guida sulla gestione dei diritti d’autore nelle università (a cura della Associazione Italiana per i Diritti di Riproduzione delle opere dell’ingegno - AIDRO).
I testi di riferimento possono essere consultati in Biblioteca.

Programmazione del corso

	Argomenti	Riferimenti testi
1	Lecture 1 introduces basic principles of statistical thinking that are necessary for informed application of statistical procedures to corpus data. It explains the role of statistics in scientific research in general and corpus linguistics in particular. Topics such as the creation of corpora, types of research design, basic statistical terminology, as well as data exploration and visualization will be discussed.	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 1
2	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 1
3	Lecture 2 introduces simple statistical measures that help describe the occurrence of words in texts and corpora. It focuses on word frequencies and distributions both of which are crucial for meaningful description of patterns of language use	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 2
4	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 2
5	Lecture 3 explores meanings of words in context, which is an area important to both linguistic and social analyses. Topics discussed are collocations, keywords and manual coding of concordance lines; these play a key role both in the study of semantics (‘dictionary’ meanings of words) and in discourse analysis.	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 3
6	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 3
7	Lecture 4 focuses on the statistical analysis of lexico-grammatical features in language such as articles, passive constructions or modal expressions. The chapter shows how lexico-grammatical variation can be summarised using cross-tabulation and what statistical measures can be computed based on cross-tabulation summary tables. These measures range from simple percentages to the chi-squared test and logistic regression	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 4
8	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 4
9	Lecture 5 discusses a group of methods that can be used for the simultaneous analysis of a large number of linguistic variables that characterise different texts and registers. First, we look at the relationship between two linguistic variables by means of correlation. Both Pearson’s and the non-parametric Spearman’s correlations are explained. Next, we explore the classification of words, texts, registers etc. using the technique of hierarchical agglomerative clustering	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 5
10	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 5
11	Lecture 6 discusses different statistical procedures available for the analysis of stylistic and sociolinguistic variation in corpora. It reviews different approaches to variation, pointing out the common connection to the notion of ‘style’ understood as a particular way of speaking and using language	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 6
12	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 6
13	Investigation of frequency and context of usage of gender marked language in four equal sized and equivalently sampled corpora of British English	Baker, Paul, 2010, “Will Ms ever be as frequent as Mr?
14	Lecture 7 discusses statistical procedures that can be used to explore historical or diachronic data. First, specific features of diachronic studies are outlined and techniques that provide effective visualizations of diachronic change are introduced. Second, the lecture focuses on the statistical comparison of two time periods using a procedure called bootstrapping. Next, the diachronic application of the cluster analysis is discussed	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 7
15	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 7
16	Lecture 8 brings together the statistical knowledge discussed in this course. It then discusses an important topic of replication and introduces a statistical technique called meta-analysis, which provides statistical (quantitative) summary of studies dealing with the same research question(s) (topic). Finally, common effect size measures are reviewed and a guide for their interpretation is provided	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 8
17	Computer lab session with exercises and Lancaster Stats Tools online	Brezina, Vaclav 2018 Statistics in Corpus Linguistics, chapter 8
18	Definition of students' research project

Verifica dell'apprendimento

Modalità di verifica dell'apprendimento

- Prova pratica: prima della fine del corso, studenti e studentesse costruiranno un corpus da interrogare con gli strumenti appresi durante le lezioni. In tre prove in itinere, ciascuna alla fine di un ciclo di 6 lezioni, verranno valutati: a) i criteri di selezione dei testi, b) la raccolta dei dati testuali e l’accuratezza l’eventuale dettaglio dell’annotazione, c) l’identificazione di una valida ipotesi di ricerca
- Prova scritta: alla fine del corso, studenti e studentesse presentano una relazione (3000/4000 parole) in cui verranno presentati e discussi i risultati dell’analisi svolta sul corpus costruito. Verrà valutato il livello di dettaglio dell’analisi ed i suoi risultati anche in base alle scelte fatte in fase di costruzione del corpus
- Prova orale: discussione sulla relazione presentata; verranno valutati elementi non emersi o non sufficientemente chiari presenti nella relazione

Per la valutazione dell’esame si terrà conto della padronanza dei contenuti e delle competenze acquisite, dell’accuratezza linguistica e proprietà lessicale, nonché della capacità argomentativa dimostrata dal/la candidato/a.

Esempi di domande e/o esercizi frequenti

Criteri di Costruzione del Corpus
Scelta del tool più appropriato
Scelta dell’approccio statistico più appropriato
Ipotesi di ricerca

ENGLISH VERSION

Corso di laurea magistrale in

Scienze del testo per le professioni digitali