The CorDis Corpus is a large multimode, multigenre collection of political and media discourse on the 2003 Iraqi conflict. It was generated from different subcorpora previously assembled by various research groups for diverse discourse analytical purposes. A more detailed description of its composition can be found in the introduction. A significant portion of our work was devoted to making the subcorpora into a unified homogeneously encoded corpus which could be interrogated using Xaira. Initially the corpus was only lightly encoded by each research group on the basis of specific research objectives and hypotheses. The heterogeneity of data, the specificity of the genres and the various methods adopted involved the use of a wide range of coding strategies to make textual and meta-textual information retrievable by means of available concordance software. It was clear from the outset that marking up the corpus as a whole would entail various levels of pre-encoded and pre-existing interpretation. The main purpose of this paper is to show the process of standardization and integration whereby a loose collection of texts has become a stable architecture. The TEI Guidelines proved a valid instrument providing for a hierarchical organization of metadata which makes mark-up part and parcel of the corpus. We will underline that it is precisely the mark-up which gives the corpus a sound structure favouring the replicability and enhancing reliability of research. In discussing some examples we will deal with issues like conformity and validity, and we will examine the constraints imposed on data handling by the methodological framework adopted. In particular, we will argue that the crucial role of annotation leads to a reconsideration of the definition of corpus itself, in which special emphasis is placed on mark-up being the backbone of the corpus rather than a superimposed accessory. Finally, the fact that mark-up involves a substantial amount of human intervention on machine processed data has some crucial implications for corpus assisted discourse studies (CADS), since it permits the combination of qualitative and quantitative research approaches. There is a tendency to distinguish between ‘mark-up’ and ‘annotation’ (McEnerey, Xiao, Tono 2006: 29), adopting the first term to refer to contextual information (i.e. editorial and descriptive metadata) and the second to refer to ‘interpretative linguistic information’. We will here use the two terms interchangeably, since both notions share the same salient qualities for the purposes of our description: they are both added value and they both carry interpretative information.

The making of the CorDis Corpus: compilation and mark-up

VENUTI, MARCO;
2009-01-01

Abstract

The CorDis Corpus is a large multimode, multigenre collection of political and media discourse on the 2003 Iraqi conflict. It was generated from different subcorpora previously assembled by various research groups for diverse discourse analytical purposes. A more detailed description of its composition can be found in the introduction. A significant portion of our work was devoted to making the subcorpora into a unified homogeneously encoded corpus which could be interrogated using Xaira. Initially the corpus was only lightly encoded by each research group on the basis of specific research objectives and hypotheses. The heterogeneity of data, the specificity of the genres and the various methods adopted involved the use of a wide range of coding strategies to make textual and meta-textual information retrievable by means of available concordance software. It was clear from the outset that marking up the corpus as a whole would entail various levels of pre-encoded and pre-existing interpretation. The main purpose of this paper is to show the process of standardization and integration whereby a loose collection of texts has become a stable architecture. The TEI Guidelines proved a valid instrument providing for a hierarchical organization of metadata which makes mark-up part and parcel of the corpus. We will underline that it is precisely the mark-up which gives the corpus a sound structure favouring the replicability and enhancing reliability of research. In discussing some examples we will deal with issues like conformity and validity, and we will examine the constraints imposed on data handling by the methodological framework adopted. In particular, we will argue that the crucial role of annotation leads to a reconsideration of the definition of corpus itself, in which special emphasis is placed on mark-up being the backbone of the corpus rather than a superimposed accessory. Finally, the fact that mark-up involves a substantial amount of human intervention on machine processed data has some crucial implications for corpus assisted discourse studies (CADS), since it permits the combination of qualitative and quantitative research approaches. There is a tendency to distinguish between ‘mark-up’ and ‘annotation’ (McEnerey, Xiao, Tono 2006: 29), adopting the first term to refer to contextual information (i.e. editorial and descriptive metadata) and the second to refer to ‘interpretative linguistic information’. We will here use the two terms interchangeably, since both notions share the same salient qualities for the purposes of our description: they are both added value and they both carry interpretative information.
2009
9780415871372
Mark-up; XML; TEI Guidelines; XAIRA; harmonization; consistency; flexibility; reliability; reusability
File in questo prodotto:
File Dimensione Formato  
CorDis_Venuti.pdf

solo gestori archivio

Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 881.78 kB
Formato Adobe PDF
881.78 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/71159
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact