QLVL-bibliography-presentations_abstracts


This publication list was last updated on 16-06-2015.
Noreillie, Ann-Sophie; Kestemont, Britta; Peters, Elke; Desmet, Piet; Heylen, Kris (2015)
VALILEX - Theoretical and empirical validation of lexical competence in English and French within the Common European Framework of Reference
The language-neutral Common European Framework of Reference for Languages (CEFR) (Council of Europe, 2001) is probably one of the most influential document for educational language policy in Europe. It describes language tasks and linguistic competences for any European language and links them to six levels of language proficiency (from A1 to C2). However, the CEFR and RLDs, the language-specific descriptions, have been criticized for lacking a theoretical and empirical basis (Alderson, 2007; Hulstijn, 2014; Kuiken et al., 2010). Therefore, two PhD-projects – one focusing on English, the other on French – would like to contribute to an empirical and theoretical validation of the CEFR by combining a corpus-based and expert-judged approach (Bardel et al., 2012) and by taking into account Hulstijn’s theory of language proficiency (2011). Specifically, we will first determine the shared vocabulary for listening and speaking among native speakers for communicative settings described by the A1 and B1 CEFR-levels. Next, we aim to determine the lexical competence (= vocabulary size and lexical items) needed for learners of English and French to successfully perform listening and speaking tasks at these two CEFR-levels. Finally, we aim to determine whether the communicative activities at these two levels share a common vocabulary in two typologically different languages

Tummers, José; Deveneyns, Annelies (2014)
Lexicale rijkdom in het professioneel hoger onderwijs: Aanzet tot sociolinguïstische staalkaart
In deze bijdrage bestuderen we de lexicale rijkdom van Nederlandse teksten geschreven door studenten in het professioneel hoger onderwijs. De vastgestelde verschillen worden vervolgens gerelateerd aan sociodemografische kenmerken van de studenten. Aan de KHLeuven is aan een steekproef van 350 eerstejaarsstudenten gevraagd om een argumentatieve tekst van 500 woorden in het Nederlands te schrijven waarin ze de lezer overtuigen van hun visie op sociale media. De deelnemers beschikten over 1 uur de tijd en mochten gebruikmaken van een computer en alle hulpmiddelen die ze nodig achtten. De studenten zijn geselecteerd aan de hand van een clustersteekproef over de 13 professionele bacheloropleidingen aan de KHLeuven (variërend van lerarenopleiding over sociaal werk en verpleegkunde tot informatica en handelswetenschappen). De analyse vertrekt van de volgende onderzoeksvragen: In welke mate zijn verschillen in lexicale rijkdom in de teksten van de studenten gelinkt aan hun sociodemografische profiel? Zijn er verschillen in lexicale rijkdom tussen de verschillende woordsoorten, met name tussen werkwoorden, zelfstandige naamwoorden, bijvoeglijke naamwoorden en functiewoorden? Voor iedere tekst worden er twee maten van lexicale rijkdom berekend: de type-token-ratio om de lexicale variatie te meten en de ‘inverse document frequency’ om de specificiteit van de gebruikte woordenschat te meten. Deze maten worden vervolgens gekoppeld aan de volgende eigenschappen van de student en de opleiding die h/zij volgt: Kenmerken student: geslacht, vooropleiding secundair onderwijs, voorgeschiedenis hoger onderwijs, leeftijd behalen diploma secundair onderwijs, opleidingsniveau moeder, thuistaal; Kenmerken opleiding: opleiding, aantal ECTS (Nederlandse) communicatie in het eerste jaar van de opleiding aan de KHLeuven.

Steurs, Frieda (2014)
Medische terminologie: uitdagingen voor medische professionals en tolken
Deze presentatie behandelt een aantal uitdagingen in het gebruik en het correct vertalen of tolken van de medische vaktaal. Deze vaktaal wordt gekenmerkt door een zeer hoog abstractieniveau, met verwijzing naar klinisch complexe concepten. De medische terminologie wordt dan ook bij uitstek gekenmerkt door samengestelde termen, vaak met ontleningen uit het Grieks en het Latijn. De verschillende registers binnen de medische communicatie maken het echter noodzakelijk om de terminologie ook op verschillende abstractieniveaus te bekijken, afhankelijk van het doelpubliek. In dat verband is het concept ‘health literacy’ interessant : in hoeverre hebben patiënten “gezondheidsvaardigheden”? Deze vaardigheden zijn van cognitieve en sociale aard. Met andere woorden : begrijpt de patiënt de vaktalige communicatie? Moet de arts of medische zorgverlener overschakelen naar een ander register? Binnen de meertalige situatie, wanneer een tolk betrokken is, geldt evenzeer de vraag : hoe wetenschappelijk kan/mag de medische communicatie zijn, en in hoeverre moet er ruimte zijn voor het verklaren van bepaalde moeilijke termen. Adequate medische terminologiebanken zullen dan ook rekening houden met de verschillende registers die kunnen worden gebruikt.

Steurs, Frieda (2014)
Machine Translation a a tool
In the battle between man and machine, man seems to be on the losing side. Since Henry Ford invented the assembly line, machines rule in most fields where stamina, brute force or precision are required. Luckily, more intellectual challenges were still won by man. Until Deep Blue beat Kasparov at chess and Watson won at Jeopardy… Now the machines are at the gate of one of man’s final strongholds: natural language. Fight or flee! Or should we consider to open the gate and co-exist peacefully? More than ever, Machine Translation (MT) is a controversial topic. Translators without thorough knowledge of MT often feel threatened. All too often, this leads to emotional pleas to save their trade rather than a rational discussion based on economic factors. This workshop is an attempt to rationalise the debate: we consider MT to be just another tool, just like computers, electronic dictionaries and translation memories. We will set up a small-scale experiment to see if MT and increases productivity. A translation assignment is split into three parts: the first part of a translation project will be translated from scratch, the second part with output from Google Translate and the third section with a custom MT engine. This workshop is based on an experiment at KU Leuven. Postgraduate students (http://www.arts.kuleuven.be/home/opleidingen/manamas/emt/index) translated a similar (larger) project. Time and quality of the three parts were compared. The results will be shown during the workshop.

Kockaert, Hendrik; Scarpa, Federica; Segers, Winibert; Steurs, Frieda (2014)
Qualetra: The implications of the transposition of Directive 64/2010 for the training and assessment of legal translators and practitioners
Between 2000 and 2007, the EU has registered significant increases in criminal proceedings involving a non-national (± 10%) resulting in rising costs of translation. According to estimates made by DG Justice in the Impact Assessment for the Directive on the right to interpretation and translation in criminal proceedings {COM(2009) 338 final}{SEC(2009) 916, pp. 18-19}, the need for fair and cost-efficient legal translations will increase significantly.. Qualetra aims at anticipating some serious challenges EU Member States will have to deal with after the transposition of Directive 2010/64/EU by proposing deliverables that are expected to cater for training and assessment needs experienced by legal translators specialising in the translation of European Arrest Warrants and by legal practitioners working with translators. With an EU-wide format of online training and testing, the project contributes towards facilitating transparent, cost-efficient criminal proceedings in EU courts, guaranteeing the rights of suspected and accused persons as stipulated in Directive 2010/64/EU. In practice, the project focuses on developing translation memories and multilingual terminology, relevant for the translation of European Arrest Warrants, and on developing EU-wide training programmes and testing procedures for legal translators and practitioners. This will have a positive impact on the training of legal translators and practitioners because they will have to interact efficiently with beneficiaries of legal translation services such as police, prosecutors, court staff, judges, lawyers and professionals providing victim support. This paper focuses in particular on the training and assessment work streams of the project, and will present research-based core curricula and training materials for legal translators and legal practitioners, testing, evaluation and assessment procedures, and materials for professionals in specific working conditions related to the translation of European Arrest Warrants and Essential Documents.

Zenner, Eline; Speelman, Dirk; Geeraerts, Dirk (2014)
A sociolinguistic analysis of borrowing in weak contact situations: English loanwords and phrases in expressive utterances in a Dutch reality TV show
In this paper, we present a quantitative corpus-based variationist analysis of the English insertions used by Belgian Dutch and Netherlandic Dutch participants to the reality TV show Expeditie Robinson. The data consist of manual transcriptions of 35 hours of recordings for 46 speakers from 3 seasons of the show. Focusing on the expressive utterances in the corpus, we present a mixed-effect logistic regression analysis to pattern which of a variety of speaker-related and context-related features can help explain the occurrence of pragmatic English insertions (such as shit, oh my God) in Dutch. Results show a strong impact of typical variationist variables such as gender, age and location, but more situational features like emotional charge and topic of the conversation also prove relevant. Overall, in its combined focus on (a) oral corpora of spontaneous language use, (b) social patterns in the use of English, and © inferential statistical modeling, this paper presents new perspectives on the study of anglicisms in weak contact settings.

Szmrecsanyi, Benedikt (2014)
Typological profiling: analyticity versus syntheticity between Middle English and Present-Day English
No one interested in typological change in the history of English will manage to avoid the terms ANALYTIC and SYNTHETIC, terminology that goes back to August Wilhelm von Schlegel (Schlegel 1818). The textbook view is that English is supposed to have changed from a rather synthetic language – i.e. one that relies heavily on inflections to code grammatical information – in Old English times into a rather analytic language that draws on word order and function words to convey grammatical information. The wholesale loss of nominal and verbal inflections that started towards the end of the Old English period, so the textbook story goes, has set in motion a long-term drift towards analyticity that is still in operation today. By way of a reality check, we adopt terminology, concepts, and ideas developed in quantitative morphological typology (cf. Greenberg 1960, Szmrecsanyi 2009) to empirically investigate the coding of grammatical information in English diachrony. Specifically, we utilize a quantitative, language-internal measure of OVERT GRAMMATICAL ANALYTICITY, defined as the text frequency of free grammatical markers, and a measure of OVERT GRAMMATICAL SYNTHETICITY, defined as the text frequency of bound grammatical markers. We subsequently apply these measures to the Penn Parsed Corpora of Historical English series, which covers the period between circa AD 1100 and AD 1900, and demonstrate that this time slice does not, in fact, exhibit a steady drift from synthetic to analytic. Rather, analyticity was on the rise until the end of the Early Modern English period, but declined subsequently; the reverse is true for syntheticity. That said, the historical variability in English in all the historical periods we investigate is not particularly dramatic. Compared to languages like Italian, German, Bulgarian and Russian, English scores consistently low on syntheticity in all these periods. An analysis of frequency fluctuation in individual markers further reveals that while in the big picture, twentieth-century English is quantitatively almost back to the analyticity-syntheticity coordinates defining twelfth-century English, modern analyticity and syntheticity seem qualitatively different from their Early English counterparts.

Tummers, José; Janssens, Kim (2014)
Rethinking analyses of crossed effects experiments in marketing communications research
Although Repeated Measures ANOVA is often used to analyze experimental designs, this method does not suffice to describe all variance in a crossed effects experiment. Responses are generated from the same subjects and simultaneously those responses will be collected for the same stimuli, exposing the independence of the results. We address this methodological concern by fitting a mixed-effects model to reanalyze the outcomes of an experiment. In this experiment, a RM ANOVA was used to analyze the impact of a condition and a treatment factor on the recall of products displayed for a short time on a computer screen and where the within-subject variance was a random effect (Janssens et al., 2011). Although there was no major impact on the fixed effects, the interaction between the experimental condition and the treatment remained significant, the mixed-effects model with two random effect terms outweighs a RM ANOVA with only one random effect term for subject. It significantly reduces the overall variance and significantly improves the predictive power of the model, measured by the index of concordance. Additionally, the intra-class correlation reveals that the random effect term for the stimuli explains 49.14% of the variance compared to only 7.93% for the subjects.

Heylen, Kris; Bond, Stephen; De Hertog, Dirk; Kockaert, Hendrik J.; Steurs, Frieda; Vulić, Ivan (2014)
TermWise: Leveraging Big Data for Terminological Support in Legal Translation
Increasingly, large bilingual document collections are being made available online, especially in the legal domain. This type of Big Data is a valuable resource that specialized translators exploit to search for informative examples of how domain-specific expressions should be translated. However, general purpose search engines are not optimized to retrieve previous translations that are maximally relevant to a translator. In this paper, we report on the TermWise project, a cooperation of terminologists, corpus linguists and computer scientists, that aims to leverage big online translation data for terminological support to legal translators at the Belgian Federal Ministry of Justice. The project developed dedicated knowledge extraction algorithms and a server-based tool to provide translators with the most relevant previous translations of domain-specific expressions relative to the current translation assignment. In the paper, we give an overview of the system, give a demo of the user interface and then discuss, more in general, the possibilities of mining big data to support specialized translation.

Tummers, José; Speelman, Dirk; Heylen, Kris; Geeraerts, Dirk (2014)
Beyond the textual company of words: What corpus settings tell us about lexical collocability
The study of lexical collocations occupies a central position in corpus linguistic research. Lexical restrictions on a word’s combinatorial possibilities are often an integral part of corpus linguistic analyses and are applied in various domains (e.g. lexicography, language teaching). However, if a corpus is considered a sample of spontaneously realized language use by a linguistic community in (a) given setting(s), it is rather surprising that the settings of actual language use received little attention in traditional corpus linguistics. In this contribution, we will focus on the impact of the usage settings on the linguistic properties of the language use in a corpus. We will investigate whether lexical collocability is subject to extra-linguistic constraints. Based on a variational case study, viz. the inflectional variation of attributive adjectives in Dutch, it will be demonstrated that the collocational strength of the AN pair is significantly modified by register, region and their interaction. Furthermore, the impact of the collocational strength of the AN pair on the adjectival inflection is constrained by register, region and their interaction as well as by individual speakers’ idiolectic language properties. Based on those results, we will argue for a systematic integration of usage settings in corpus linguistic research.

Heylen, Kris; Bond, Stephen; De Hertog, Dirk; Vulić, Ivan; Kockaert, Hendrik J. (2014)
TermWise: A CAT-tool with Context-Sensitive Terminological Support
In this paper and the accompanying poster and demo, we present TermWise, a Computer Assisted Translation (CAT) tool that offers additional terminological support for domain-specific translations. Compared to existing CAT-tools, TermWise has an extra database, a Term&Phrase Memory, that provides context-sensitive suggestions of translations for individual terms and domain-specific expressions. The Term\&Phrase Memory has been compiled by applying newly developed statistical knowledge acquisition algorithms to large parallel corpora. Although these algorithms are language- and domain-independent, the tool was developed in a project with translators from the Belgian Federal Justice Department (FOD Justitie/SPF Justice) as end-user partners. Therefore the tool is demonstrated in a case study of bidirectional Dutch-French translation in the legal domain. In this paper, we first describe the specific needs that our end-user group expressed and how we translated them into the new Term Memory functionality. Next, we summarize the term extraction and term alignment algorithms that were developed to compile the Term Memory from large parallel corpora. Section 4 describes how the Term\&Phrase Memory functions as server database that is integrated with a CAT user-interface to provide context-sensitive terminological support. Section 5 concludes with a short description of the evaluation scheme.

Heylen, Kris; Bond, Stephen; De Hertog, Dirk; Vulić, Ivan; Kockaert, Hendrik J. (2014)
TermWise: A Computer Assisted Translation Tool with Context-Sensitive Terminological Support
This poster with demo presents TermWise, a prototype for a Computer Assisted Translation (CAT) tool that offers additional terminological support for domain-specific translations. Compared to existing CAT-tools, TermWise has an extra database, a Term&Phrase Memory, that provides context-sensitive suggestions of translations for individual terms and domain-specific expressions. The Term&Phrase Memory has been compiled by applying newly developed statistical knowledge acquisition algorithms to large parallel corpora. Although the algorithms are language- and domain-independent, the tool was developed in a project with translators from the Belgian Federal Justice Department (FOD Justitie/SPF Justice) as end-user partners. Therefore the tool is demonstrated in a case study of bidirectional Dutch-French translation in the legal domain. On the poster, we first describe the specific needs that our end-user group expressed and how we translated them into the new Term&Phrase Memory functionality. Next, we summarize the term extraction and term alignment algorithms that were developed to compile the Term&Phrase Memory from large parallel corpora. In our case-study we worked on the online available official Belgian Journal (Belgisch Staatsblad/Moniteur Belge). The poster then describes the server-client architecture that integrates the Term&Phrase Memory’s server database with a CAT user-interface to provide context-sensitive terminological support. In conclusion we also discuss the evaluation scheme that was set up with two end-user groups, viz. students of Translations Studies at KU Leuven (campus Antwerp) and the professional translators at the Belgian Federal Justice Department. The demo will show the use of the TermWise tool for the translation of Belgian legal documents from Dutch to French and vice versa.

Heylen, Kris; Bond, Stephen; De Hertog, Dirk; Vulić, Ivan; Kockaert, Hendrik J. (2014)
TermWise: A Computer Assisted Translation Tool with Context-Sensitive Terminological Support
This poster with demo presents TermWise, a prototype for a Computer Assisted Translation (CAT) tool that offers additional terminological support for domain-specific translations. Compared to existing CAT-tools, TermWise has an extra database, a Term&Phrase Memory, that provides context-sensitive suggestions of translations for individual terms and domain-specific expressions. The Term&Phrase Memory has been compiled by applying newly developed statistical knowledge acquisition algorithms to large parallel corpora. Although the algorithms are language- and domain-independent, the tool was developed in a project with translators from the Belgian Federal Justice Department (FOD Justitie/SPF Justice) as end-user partners. Therefore the tool is demonstrated in a case study of bidirectional Dutch-French translation in the legal domain. On the poster, we first describe the specific needs that our end-user group expressed and how we translated them into the new Term&Phrase Memory functionality. Next, we summarize the term extraction and term alignment algorithms that were developed to compile the Term&Phrase Memory from large parallel corpora. In our case-study we worked on the online available official Belgian Journal (Belgisch Staatsblad/Moniteur Belge). The poster then describes the server-client architecture that integrates the Term&Phrase Memory’s server database with a CAT user-interface to provide context-sensitive terminological support. In conclusion we also discuss the evaluation scheme that was set up with two end-user groups, viz. students of Translations Studies at KU Leuven (campus Antwerp) and the professional translators at the Belgian Federal Justice Department. The demo will show the use of the TermWise tool for the translation of Belgian legal documents from Dutch to French and vice versa.

Wielfaert, Thomas; Heylen, Kris; Daems, Jocelyne; Speelman, Dirk; Geeraerts, Dirk (2014)
Towards a Lexicologically Informed Parameter Evaluation of Distributional Modelling in Lexical Semantics
-level models Abstract: Distributional models of semantics have become the mainstay of large-scale modelling of word meaning statistical NLP (see Turney and Pantel 2010 for an overview). In a Word Sense Disambiguation task, identifying semantic structure is usually seen as a clustering problem where occurrences of a polysemous word have to be assigned to the ‘correct’ sense. As linguists however, we are not interested solely in performance evaluation against some gold standard; rather, we want to investigate the precise relation between a word's distributional behaviour and its meaning. Given that distributional models are extremely parameter-rich, we want to assess how well and in which way a specific model can capture a lexicological description of semantic structure. In this presentation, we discuss three tools we are developing for a lexicological assessment of distributional models. Firstly, we are creating our own lexicologically informed 'gold standard' of disambiguated noun occurrences, based on the ANW (Algemeen Nederlands Woordenboek) and a random sample from two large-scale Belgian (1.3G) and Netherlandic (500M) Dutch newspaper corpora. Secondly, we are developing a visualisation tool to analyse the impact of parameter settings on the semantic structure captured by a distributional model. Thirdly, we have adapted the a clustering quality measure (McClain & Rao 1975) to assess how well a manual disambiguation is captured by a distributional model independently from a specific clustering algorithm. Similar to Lapesa and Evert's (2013) parameter sweep for a type-level model on semantic priming data, we are striving towards a large-scale parameter evaluation for token-level models on sense-annotated occurrences.

Szmrecsanyi, Benedikt (2013)
On linguistic complexity (in varieties of English and beyond)
Linguistic complexity is one of the currently most hotly debated notions in linguistics. Questions addressed in the literature include: Do complexity differentials exist at all, or are all languages (or language varieties) equally complex? How can we measure linguistic complexity? Are there complexity trade-offs between different linguistic levels and/or subsystems? Against this backdrop, the paper sketches ways to measure, and explain, morphosyntactic complexity variance in varieties of English and beyond. We will specifically explore (1) ad-hoc complexity ratings of survey features, (2) analyticity-syntheticity variation within and across varieties and languages, and (3) an information-theoretic measure of linguistic complexity.

van der Vliet, Hennie; Wermuth, Cornelia; Oosterhof, Albert (2013)
Experts over terminologie: waar het naartoe moet (Enquête NL-Term)
De vereniging NL-Term wil de belangen behartigen van professionele gebruikers van Nederlandstalige terminologie. Om dat zo goed mogelijk te kunnen doen is de vereniging een onderzoek gestart naar de doelgroep en hun wensen. Op de TiNT-dag 2012 is een enquête afgenomen onder de bezoekers. De resultaten van de enquête worden besproken in het artikel van Hennie van der Vliet in de TiNT-bundel die op de TiNT-dag 2013 wordt uitgereikt. Daarnaast hebben enkele ondervragers (Cornelia Wermuth, Albert Oosterhof en Hennie van der Vliet) hun licht opgestoken bij een twintigtal professionals die door de vereniging zijn uitgezocht. Deze mensen staan goed bekend in hun vakgebied en zijn zeer betrokken bij de terminologie. In deze presentatie doen twee van de ondervragers verslag van hun bevindingen.

Heylen, Kris; Wielfaert, Thomas; Speelman, Dirk (2013)
Mapping Semantic Space in Comparable Corpora. Token-level semantic vector spaces as an analysis tool for lexical variation.
Conceptual space can be carved up linguistically in different ways. The mapping between a set of related concepts and a set of forms need not be one to one and can differ both between varieties of the same language and between different languages. Recently, a number of studies have combined quantitative corpus analysis with visualization techniques to study form-meaning mappings on the exemplar level, both cross-linguistically and within one language: Wälchli (2010) used distributional similarity in parallel corpora and Multi-Dimensional Scaling to visualize how the exemplars of local phrase markers divide up the semantic space between themselves in different languages. Levshina (2011) coded exemplars of Dutch causative constructions for many different features in comparable corpora of different varieties and then used MDS to visualize how they carve up the causativity space. In this study, we present such an exemplar-level analysis and visualization for referentially rich lexical categories, rather than the less referential, grammatical categories studied by Wälchli and Levshina. We argue that the rich semantics of full lexical categories can be captured in a bottom-up, automatic way by token-level Semantic Vector Spaces (Turney & Pantel 2010; Heylen, Speelman & Geeraerts 2012) and we visualize how the individual occurrences of a set of near-synonyms carve up their concept’s semantic space in a comparable corpus of different language varieties. As a case study, we look all the occurrences of lexemes used to refer to the concept IMMIGRANT in a 1.3 million word corpus of Dutch and Belgian newspapers from 1999 to 2005. A token-level Semantic Vector Space (Heylen, Speelman & Geeraerts 2012) is then used to structure these occurrences semantically based on the similarity of their contextual usage. Multi Dimensional Scaling allows us to represent these contextual similarities in a 2 dimensional semantic space. With an interactive visualization, we can analyze the different dimensions in the semantic space and their contextual realization, as well as the differences in form-meaning mapping between the Netherlands and Belgium and different newspapers. We also look at the change in the space and form-meaning mappings during the period 1999-2005.

Tummers, José; Speelman, Dirk; Geeraerts, Dirk (2013)
Lectal conditioning of lexical collocations
In this contribution, we will focus on the lectal conditioning of lexical collocations. First, we will analyze how register and national variety modify the distributional properties of AN collocations in Dutch. Next, we will analyze how those lectal variables alter the impact of lexical collocations on the alternation between two inflectional variants of the adjective in Dutch definite NPs with a singular neuter head noun.

Heylen, Kris; Wielfaert, Thomas; Speelman, Dirk (2013)
Tracking Immigration Discourse through Time: A Semantic Vector Space Approach to Discourse Analysis
In previous work (Peirsman, Heylen & Geeraerts 2010), we have shown how Semantic Vector Spaces can serve as an explorative tool to investigate in large data collections which wording is used to construe attitudes towards religions and how these change over time. In this paper we extend this work and use Semantic Vector Spaces to identify construal patterns in immigration discourse and we analyze the construal on the more fine grained level of specific utterances. As a case study, we look at immigration discourse in Belgium and the Netherlands in the period that both countries experienced a break-through of political parties with a strong anti-immigration platform (Vlaams Blok in Belgium in 1999 and Lijst Pim Fortuyn in the Netherlands in 2002). In a 1.3 million word corpus of Dutch and Belgian newspapers from 1999 to 2005, we collect the occurrences of lexemes referring to the concept IMMIGRANT in Dutch: allochtoon, vreemdeling, (im)migrant, buitenlander, nieuwkomer. We use Semantic Vector Spaces both on a type-level and a token-level (Turney & Pantel 2010, Heylen,Speelman & Geeraerts 2012) to structure the meaning relations between the lexemes and between their occurrences based on the similarity of their contextual usage. Multi Dimensional Scaling allows us to represent these contextual similarities in a 2-dimensional semantic space. Using an interactive visualization, we can then analyze the different construal patterns and their contextual realizations, the differences between the Netherlands and Belgium and between different newspapers, as well as the change in the construal patterns from 1999 to 2005.

Pizarro Pedraza, Andrea (2013)
Women's Stuff: The Effect of Embodiment in the Sociolinguistic Variation if Sexed Concepts
Background and research question: Menstruation or women’s genitalia are considered widespread taboos that surpass cultural boundaries (Douglas 1966). In the general theory of linguistic taboo, that would imply that in some situations speakers would avoid those concepts, or convey their meanings through euphemisms (Allan and Burridge 1991; 2006). “Now, taking an experiential view of meaning” (Geeraerts and Kristiansen 2012), it seems pertinent to reflect on the effects of the speakers’ gender on the semantic variation of those concepts in use. Our hypothesis is that embodiment thwarts the effect of taboo, which is reflected on the onomasiological variation of sexed concepts across genders (probably in interaction with other variables –age, education, district, stances…– as gender is socio-culturally constructed). Empirical data: We work with a corpus of 54 face-to-face recorded interviews in Spanish, which was designed to indirectly elicit sexual concepts. It was collected ad hoc in two districts of Madrid, controlling for the social background of the speakers (gender, age, education, etc.), in order to account as accurately as possible for the sociolinguistic reality of sexual expressions. For this study, we have manually extracted a subset of expressions referring to concepts belonging to men and women’s biological specificities (body parts, physiological processes). Analytic methods: Assuming the importance of semantic vagueness as a euphemistic strategy (Grondelaers and Geeraerts 1998), we propose to work on the taxonomic level in order to elucidate whether gender (and other social variables) influences the (under)specification of the concepts. A data matrix has been built where each token is coded considering the level of specification, and the social background of the speaker. Preliminary results: An initial analysis shows variation across genders. For example, expressions like bleed (“sangrar”) for menstruate or belly (“barriga”/ “tripa”) for pregnancy are mostly used by women in our data. The qualitative approach will be complemented with a quantitative analysis in order to measure whether gender is affecting significantly taxonomic shifts related to an embodied relation of the speakers towards particular realities.

Wielfaert, Thomas; Heylen, Kris; Speelman, Dirk (2013)
Interactive visualizations of Semantic Vector Spaces for lexicological analysis
Within Computational Linguistics, distributional models of semantics have become the mainstay of large-scale modelling of lexical semantics. Distributional modelling also holds a large potential for research in Linguistics proper : It allows linguists to base their analysis on large amounts of usage data, thus vastly extending their empirical basis, and makes it possible to detect potentially interesting semantic patterns. However, so far, there have been relatively few applications, mainly because of the technical complexity and the lack of a linguist-friendly interface to explore the output. In this paper, we propose an interactive visualization of a distributional similarity matrix based on Multi-Dimensional Scaling. We present our prototype for a visualization tool built in Processing which opens up new possibilities for the visual analysis of token-based models and apply it to a small case study of a Dutch polysemous word.

Heylen, Kris; Wielfaert, Thomas; Speelman, Dirk (2013)
Tracking change in word meaning. A dynamic visualization of diachronic distributional semantics
Within Computational Linguistics, distributional models of semantics have become the mainstay of large-scale modeling of lexical semantics (see Turney and Pantel 2010 for an overview). Distributional modeling also hold a large potential for research in Linguistics proper: It allow linguists to base their analysis on large amounts of usage data, thus vastly extending their empirical basis, AND they make it possible to detect potentially interesting semantic patterns. However, so far, there have been relatively few applications, mainly because of the technical complexity and the lack of a linguist-friendly interface to explore the output. To address this issue, Heylen et al. (2012) proposed an interactive visualization of a distributional similarity matrix based on Multi-Dimensional Scaling for synchronic data. In this paper, we extend this approach to diachronic data and propose a dynamic visualization of distributional semantic change through motion charts. As a case study, we look at the meaning changes that 17 positive evaluative adjectives have undergone in the Corpus of Historical American English (COHA, Davies 2012) between 1860 and 2000. Visualization of diachronic distributional data has been proposed previously by a.o. Rohrdantz et al. (2011) but these representations were static. In this paper, we use a dynamic visualization of linguistic change, first proposed by Hilpert (2011) for manually coded data sets, and extend here to the large-scale, unsupervised distributional models. For a set of adjectives that express positive evaluation (a.o. brilliant, magnificent, fantastic, terrific, superb). We investigate how they carve up this semantic space and how this changes over time. From COHA, we extracted a word-by-context co-occurrence vector using a window of 4 left and right for each adjective in each of the 14 decades between 1860 and 2000. Next, we calculated the cosine similarity between all adjective/decade vectors and used non-parametric MDS to represent these similarities in 2 dimensions. The MDS solution with adjective and decade information was then visualized with the R-package googleVis, an interface between R and the Google Visualization API. The resulting, dynamic motion chart is available online under https://perswww.kuleuven.be/~u0038536/magnificent/Magnificent3D.html. The chart shows adjectives as clickable bubbles with a time slider to move between decades. 'Playing' the chart shows dynamically how the semantic distances between the adjectives changes over time. In the center, adjectives like splendid, magnificent or great represent the core of the concept and they remain relatively stable over time. However, figure 1 shows that terrific was in 1860 still quite far removed from the center, probably because it was still predominantly used in its literal sense of FRIGHTENING. Only around 1950, terrific starts to move to the core and acquires its positive evaluative meaning. Since, distributional models are a completely automatic technique with a multitude of possible parameter settings, this particular solution is probably not yet optimal. An important next step is therefore the evaluation of the automatically induced patterns against a manually coded and interpreted dataset.

Pizarro Pedraza, Andrea (2013)
Factores sociolingüístico-cognitivos en la variación semántica de los conceptos sexuados
La menstruación o los genitales femeninos son tabúes extendidos cuyo efecto lingüístico es una tendencia al eufemismo. Sin embargo, la adopción teórica de “una visión experiencial del significado” (Geeraerts and Kristiansen 2012) sugiere la posibilidad de que personas con cuerpos y fisiologías distintos conceptualicen diferentemente las realidades que les son propias. Sobre esta base, estudiamos el efecto del género de los hablantes sobre la variación onomasiológica de los conceptos sexuados, con la hipótesis de que la experiencia corporeizada contrarresta el efecto del tabú. A partir de un corpus de 54 entrevistas sobre sexualidad a informantes de Madrid, se han extraído expresiones relacionadas con el cuerpo y los procesos fisiológicos de hombres y mujeres. Asumiendo la importancia de la vaguedad semántica como estrategia eufemística, trabajamos en el plano de la variación taxonómica, con el fin de analizar cuantitativamente si el género de los hablantes tiene un efecto sobre el grado de especificidad de las expresiones lingüísticas de los conceptos sexuados. Los resultados preliminares muestran que expresiones como “sangrar” para menstruar y “barriga” para embarazo son utilizadas principalmente por mujeres. Esta aproximación cualitativa será completada con un análisis estadístico para medir si el género (junto con otras variables sociales) tiene un efecto significativo sobre los cambios taxonómicos en la expresión de realidades experimentadas físicamente.

Levshina, Natalia; Heylen, Kris (2012)
Construction Grammar meets semantic vector spaces: A radically data-driven approach to semantic classification of slot fillers.
In this paper we propose an innovative, radically corpus-driven methodology for analysis of syntactic constructions. It is based on Semantic Vector Space models, a data-driven distributional approach to lexical semantics widely applied in Computational Linguistics (Lin 1998, Heylen et al. 2008). This approach is integrated in a novel way with such well-known statistical methods as regression analysis and hierarchical clustering to create a fully bottom-up and objective classification of constructional slot fillers (words). In a case study of the Dutch causative constructions with doen and laten based on a large richly annotated corpus, we show how our method can be applied to models of constructional near-synonymy. The results suggest that the method can be used to formulate linguistic hypotheses about the use of constructions, providing a quick and robust estimation of large amounts of data.

Wielfaert, Thomas; Heylen, Kris; Speelman, Dirk; Geeraerts, Dirk (2012)
Token Spaces: a visual tool for analysing word use
Investigating the different uses of a word in texts and corpora is a research activity in several fields of the humanities. Within linguistics, lexicology is the subdiscipline that analyses semantic structure of words in terms of polysemy, vagueness and meaning relations like metaphor or metonymie. Historical linguistics study how these uses developed through time (see Geeraerts 2010 for an overview). Lexicographers record this semantic structure of words in dictionaries. Yet also in disciplines for which language is not research object per se, scholars analyse the different meanings and uses of words: In literary studies, researchers look at how writers develop themes throughout their works using specific words. Historians, legal scholars and theologians analyse how concepts have been construed by looking at specific word uses in a historic body of texts. Traditionally, such analyses have been done by sorting through concordances of words, i.e. corpus attestations of a word in context. Although many software packages are available to extract concordances and collocations and annotate them, the actual semantic analysis still has to be done manually by the researcher. He or she has to go through a concordance list and organize the attestations in terms of which uses are similar and constitute a separate meaning or typical usage. However, in Computational Linguistics, so-called Semantic Vector Spaces (SVS) have been developed that can find detect usage patterns and semantic structure automatically based on a quantitative, statistical analysis of large corpora. More specifically, SVSs model word meaning in terms of frequency distributions of words over co-occurring context words (Turney and Pantel, 2010 for an overview). Unfortunately, these models are largely black boxes that contain purely mathematical representations of meaning, and hence they are not easily accessible to humanity scholars. However, we have argued (Heylen et al., 2012) that by visualizing these Semantic Vector Spaces, we can attain a double goal: On the one hand, SVSs can become a supporting tool for lexicologists and other humanities scholars to investigate word meaning and usage on larger scale and in a more data driven fashion. On the other hand the SVS models themselves become amenable to evaluation by human specialists. In this study, we use a token-based SVS that models the semantic distances between individual occurrences of a word in terms of their contextual usage. To visualize the output of the SVS and make it accessible to human experts, we use statistical dimension reduction techniques to create two dimensional scatter plots. In these plots, so-called token clouds become visible and make it possible to distinguish a word's different meanings and usages. As as case study, we analyse the usage of a set of Dutch near-synonyms in a large corpus of Belgian and Netherlandic Dutch. The near-synonyms, i.e. beeldscherm, computer scherm, monitor, and display all refer to the same concept of COMPUTER SCREEN. In Belgian Dutch however, monitor can also be used to refer to a type of youth leader, for instance speelpleinmonitor (playground monitor). This specific usage in Belgian Dutch is clearly distinguishable in the Token Space. We have made an interactive implementation of the COMPUTER SCREEN scatter plot with both Google Visualization and R using the method developed by Heylen et al. (2012). The Google implementation is annotated with manual semantic disambiguations which helps to visually identify the clusters by using colour codes. In the R version on the other hand, the context words are annotated with their weights, which shows how much each context word contributed to the solution. The goal is to make a visualization in which both annotations and weights are combined.

Heylen, Kris; Speelman, Dirk; Geeraerts, Dirk (2012)
Exploring Semantic Space. Word Space Models as a Research Tool for Lexical Semantics
Word Space Models (WSMs) are a statistical-computational technique to compare the collocational behaviour of words in corpora on a large scale (See Turney & Pantel 2010 for an introduction). They are typically used to find similarities or differences in meaning between words, based on their shared or diverging contextual usage. Although primarily a computational technique, WSMs have been applied in Linguistics in diachronic lexical studies (Sagi et al. , Peirsman et al. 2010b) or the study of regional variation (Peirsman et al. 2010a). In this paper, we want to show how WSMs can further aid the linguistic analysis of lexical semantics, provided that they are made accessible to lexicologists through a visualization of their underlying collocational similarity matrices.

Asnaghi, Costanza; Grieve, Jack (2012)
An Analysis of Regional Lexical Variation in California English using Site-Restricted Web Searches
Research question: The main goal of this study is to describe regional variation in California English trough an analysis of written newspaper language and a quantitative and corpus-based analysis. Approach: The innovative approach of this study is the contrast with traditional methods of data collection in dialectology. Traditional methods include linguistic interviews in the form of postal questionnaires (e.g. Davis 1948), fieldworker interviews (e.g. Kurath 1939-43), and telephone interviews (e. g. Labov et al 2006); some rare cases of corpus-based data collection are also recorded (e.g. Grieve 2009). However, a lexical study requires an extremely large corpus or a huge amount of interviews. Method: In order to overcome the obstacle of the quantity of data, a new method of data collection was developed. The method aims to identify patterns of regional lexical variation using site-restricted web searches. For each variant of a lexical alternation, the number of pages containing that variant in a series of city newspaper websites is counted. A Perl (LWP) script was used to automatically query online search engines and extract the number of hits from the html source code for the results page. Given these results, the alternation is then measured quantitatively as a proportion. This method has been validated in the US as a whole (Grieve and Asnaghi 2011). Thanks to the quantity of the data and to advanced statistics it is possible to find regional patterns despite this noise of the data collected through siterestricted web searches. Raw maps show the results of the research: each alternation is measured quantitatively as the proportion of the first form relative to the second form, and then mapped. Local Spatial Autocorrelation statistics is used to smooth raw data cutting through all the noise (Ord and Getis 1995; Grieve 2011). Autocorrelated maps identify significant patterns of spatial clustering, the result being similar to an isogloss drawing. A multivariate spatial analysis will be conducted to identify common patterns of regional variation and dialect regions (Grieve et al 2011). Data: A list of 422 Californian Newspapers from 336 Californian cities was collected. A list of 130 word alternations was also collected: variables were chosen both following previous dialectology studies (Vaux’s Harvard Survey of North American Dialects; Kurath’s A Word Geography of the Eastern United States, 1949; Cassidy’s Dictionary of American Regional English, 1985-2002; Grieve’s A Corpus-Based Regional Dialect Survey of Grammatical Variation in Written Standard American English, 2009) and from a convergence/divergence project on The Brown University Standard Corpus of Present-Day American English (Ruette et al in preparation). Expected results: As a result, it will be possible to compare maps plotted from this new study to previous Californian English maps (Bright 1971), Spanish in California maps, settlement maps and travel time maps, in order to identify significance of these predictors. Also, North/South and inland/coastal distinctions, if applicable, will be considered.

Pizarro Pedraza, Andrea (2012)
Stance, Identity and the Lexical-Semantic Variation of Taboos: The Abortion Debate in Spanish Online Newspapers’ Comments and Face-to-Face Interviews in contrast
Verbal taboo is at a complex crossroads where many disciplines meet. This has resulted in some theoretical and methodological diffusion, and some pertinent questions have been left quite unattended, or only partially resolved; namely, those concerning the indexical power of variation in the expression of taboo concepts, requiring a sociolinguistic perspective. As has been demonstrated in Sociolinguistics in the last years, linguistic features are tools at hand for speakers to build their identities in discourse; and their variation is meaningful ('Third Wave', Eckert 2005). This study defends that taboo concepts are extremely revealing in this perspective, because they participate in a complex interplay of social, moral and emotional, deeply-rooted regimes (Irvine 2011), manifested in discourse in a variety of ways. In this paper, we compare the results of two studies on the concept of abortion in contemporary Spanish. The first is based on a corpus of readers’ comments on online newspapers’ articles the day of the approval of the new Law of Abortion in Spain, on March 2010. The second is a corpus of interviews on sexuality that we collected ad hoc inMadrid. We focus on a subset of questions based on the Law of Abortion. Our aim is to analyze how opposed discourses of abortion utter the concept, and how they do it in different contexts (written vs. oral, anonymous vs. face-to-face, etc.). In order to cope with the lack of analytical solutions for the study of lexical-semantic variation in Sociolinguistics, we base our method on Cognitive Semantics. We consider that lexical-semantic choices are strategies contributing to the construction of social identities based on differences in conceptualizations (Kristiansen and Dirven 2008). The results present variation corresponding to different stances, roughly Pro-life and Pro-choice discourses. Both stances are better represented in the anonymous comment’s corpus, where there is consequently more variation than in the interviews. The very consideration of abortion as a taboo or not is an ideological statement, therefore, we find contrasting tendencies in the use of the literal abortion vs. non-literal semantic variants (metaphors, metonymies, etc.). Within these, lexical variation (eliminate, murder…vs. decide, voluntary interruption of pregnancy…) reflects a complex matrix of intertextual, cultural, and historical references that determine the local shape of an international debate. This mixed method copes with the traditional difficulties of the sociolinguistic analysis of lexical-semantic variation. It achieves to analyze differences at the lexical-semantic level in the discursive construction of opposed stances; and furthermore, to show how these stances are performed differently under the circumstances of contexts like online written comments and oral interviews. The analysis of taboo concepts’ utterance is extremely revealing of how identities are constructed in discourse at this level, because they take on very local, social meanings.

Tummers, José; Deveneyns, Annelies (2012)
Learner Corpora in Use: A Taxonomy of Flemish Students’ Errors in Written Dutch
The language proficiency of youngsters is deteriorating, which is especially in (higher) education a pressing problem and has teachers raise the alarm. At Leuven University College, a Flemish university college of 6,500 students, a project is running to analyse students’ written language proficiency in their mother tongue, Dutch. A corpus of 346 texts was gathered by asking students of all programmes to write a 500 word persuasive text. Starting from James’ broad definition of an error as “an unsuccessful bit of language”, all errors in the corpus were identified and coded using a scheme that, in line with learner corpus research, combines linguistic and error information. The following research questions were posed. (i) What are the most frequently made errors? (corpus frequency) (ii) What are the most typical errors? (document frequency) A quantified error taxonomy sheds light on the corpus and document frequency, which combined give us an insight into the distribution of errors, as well as the extent in which those errors recur. Errors of textual grammar (especially referential coherence), syntax, punctuation and lexical use are the most frequent and widespread. Those results are the starting point for a usage-based remediation process of students’ written language proficiency.

De Hertog, Dirk; Heylen, Kris; Kockaert, Hendrik J.; Speelman, Dirk (2012)
The prevalence of multiword term candidates in a legal corpus.
Many approaches to term extraction focus on the extraction of multiword units, assuming that multiword units comprise the majority of terms in most subject fields. However, this supposed prevalence of multiword terms has gone largely untested in the literature. In this paper, we perform a quantitative corpus-based analysis of the claim that multiword units are more technical than single word units, and that multiword units are more widespread in specialized domains. As a case study, we look at Dutch terminology from the Belgian legal domain. First, the relevant units are extracted using linguistic filters and an algorithm to identify Dutch compounds and multiword units. In a second step, we calculate for all units an association measure that captures the degree to which a linguistic unit belongs to the domain. Thirdly, we analyze the relationship between the units' technicality, frequency and their status as a simplex , compound or multiword unit.

Heylen, Kris; De Hertog, Dirk (2012)
A distributional corpus analysis of Dutch endo- and exocentric compounds
In Dutch, like in other Germanic languages, compounding is a highly productive strategy to form new words. The most common pattern is to simply glue together two existing nouns (possibly with a binding morpheme) into one new compound noun, e.g. appel+taart (apple pie) or regering+s+beslissing (government decision). Most noun-noun compounds are endocentric : the right-hand noun is the head of the compound and the left-hand noun is the modifier, so that the compound as a whole is in a type-of relation with the head. An appeltaart is a type of taart (pie). However, some compounds are not in a type-of relation to their head and are then called exocentric. For example, a grapjas (litt. "joke coat", fig. joke-cracker) is not a type of coat and seksbom (sex bomb) is not a type of bomb. In this study, we investigate whether distributional corpus frequency statistics can differentiate between endo- and exocentric compounds.

Heylen, Kris; Speelman, Dirk; Geeraerts, Dirk (2012)
Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets
In statistical NLP, Semantic Vector Spaces (SVS) are the standard technique for the automatic modeling of lexical semantics. However, it is largely unclear how these black-box techniques exactly capture word meaning. To explore the way an SVS structures the individual occurrences of words, we use a non-parametric MDS solution of a token-by-token similarity matrix. The MDS solution is visualized in an interactive plot with the Google Chart Tools. As a case study, we look at the occurrences of 476 Dutch nouns grouped in 214 synsets.

Deveneyns, Annelies; Tummers, José (2011)
Zoek de fou(d)t: een taxonomie van de fouten in teksten van professionele bachelors
Maatschappelijk groeit de consensus dat de schriftelijke taalvaardigheid van jongeren achteruitgaat en vanuit het (hoger) onderwijs trekt men daarbij steeds vaker aan de alarmbel. Aan de KHLeuven loopt een onderzoek dat de schriftelijke taalvaardigheid van professionele bachelors in kaart brengt. Teksten van studenten werden geanalyseerd op structuur, lexicale rijkdom en foutenlast. De resulterende foutentaxonomie kan vervolgens worden gebruikt voor empirisch gefundeerde remediëringstrajecten.

Heylen, Kris; Geeraerts, Dirk (2011)
A Journey through Word Space. Semantic Vector Space Models as an Analysis Tool for Lexicologists and Lexicographers
Linguistics is increasingly becoming a data-driven science. On the one hand, there is a growing awareness that descriptive studies and theoretical claims can no longer rely on just a handful of examples, but should be based on a thorough analysis of empirical data. As a consequence, a growing number of studies uses advanced statistical techniques to corroborate their hypotheses. On the other hand, there simply is more and more data available. Corpora are getting bigger and more diverse. Next to falsifying existing hypotheses, this creates the need for methods to explore these large amounts of data, find meaningful patterns and generate new hypotheses. In this regard, lexical semantics presents a specific challenge: Word meaning is inherently complex and multifaceted. A word's meaning can be structured along many different semantic dimensions and the relevant dimensions can differ drastically from word to word. Moreover, a word's semantic structure can only be described in sufficient detail by looking at a large number of examples. Similarly, the semantic relations between words can be quite diverse and multiple relations can hold simultaneously. It's clear that a lexicologist or lexicographer seeking to give an empirically adequate analysis of lexical semantics would have great benefit from a tool to explore, organize and find patterns in the large and complex data set that he or she is presented with. In this talk we will discuss the potential of Semantic Vector Space Models, also known as Word Spaces, to function as such a tool. Semantic vector representations were originally developed in Computational Linguistics as a way to model semantic similarity quantitatively. They have been applied to a wide variety of computational tasks, involving both the automatic disambiguation of individual polysemous words and the discovery of taxonomic relations between multiple words. This functionality opens up the perspective of providing a lexicologist with a preliminary organization of his or her data set in terms of clusters of similar words or of similar uses of a word, which can then be a starting point for further analysis. However, in most computational applications, Semantic Vector Spaces are used as a black box technique: they output large, unwieldy matrices that are not readily interpretable by a language expert. Moreover, they may show that two words are semantically similar or that a given instance is likely to belong to one predefined sense rather than another, but they do not tell the researcher WHY this is the case. In this presentation, we will discuss various methods that our research group has developed to visualize Semantic Vector Spaces and make their output accessible to linguists. Additionally, we will also discuss a number of applications to historical data that have been recently proposed in the literature.

Heylen, Kris; Ruette, Tom; Speelman, Dirk; Geeraerts, Dirk (2011)
Degrees of Semantic Control in Measuring Lexical Distances
Aggregated and quantitative measurements of linguistic differentiation have been conducted at varying levels of generalization, from the macro-level of typology to the micro-level of dialectology. This presentation is situated in the tradition of stylometric and regiolectal studies that analyze linguistic differences at the intermediate level of supraregional varieties (e.g. British vs. American English) and registers (e.g. academic writing vs. informal speech) within a single language. Although all these traditions rely on different data types, one issues they have all been dealing how semantic differences can be measured in way that goes beyond mere frequencies of lexemes. This presentation will compare 3 methods to measure differences between language varieties as represented by text corpora, and based on aggregation over lexical variables. The methods all originate in quantitative corpus linguistics, but differ to the extent they model and control the semantics of the underlying lexical variables. The research question we address is how much of this semantic control is necessary for detecting the differences between varieties of a language, in this case stylistically and regionally defined varieties of Dutch.

Pizarro Pedraza, Andrea (2011)
The Sociolinguistics of Spanish Sexual Metaphors in Speech
Albeit some particular situations, sexuality is still a taboo in many societies, which means that there is a social interdiction upon it, whose origins are ancient and have been largely studied by anthropologists and psychologists (for an overview, see Allan and Burridge, ch. 1). The lexical items from the field behave as taboo words, defined as “words and expressions which are supposed not to be used, and which are shocking, offensive, blaspheme or indecent when they are used” (Trudgill: 133). Consequently, the reference to such elements, particularly in interaction, is usually not direct, and they are replaced by non-offensive expressions known as ‘euphemistic substitutes’. The linguistic means for the formation of the substitutes are multiple (Crespo Fernández: 108), but in the lexical-semantic level it is very often based on the use of figurative language. Metaphor is not only considered as the preferred tool for their expression, but even as their hyperonym, complying therefore with cognitive theory‘s terms (Chamizo Domínguez): the euphemistic metaphor has then a target, the taboo concept, and a source, the figurative substitute. My main interest is the lexical-semantic variation of the source domains, as it shows a wide heterogeneity -and creativity- in the metaphorical expression of sexuality in Spanish, as we can appreciate in Spanish taboo words dictionaries (Cela). The study of the lexical choices reveals the concepts that are related to sexual categories in contemporary Spanish. Now, considering, with Cognitive Sociolinguistics, that our social environment in its broadest sense determines our understanding of the world (Kristiansen and Dirven), I face a central research question: what is the relation between social categories of the speakers and the choice of a source domain for the creation of sexual euphemistic metaphors? In my study, I investigate lexical-semantic variation in the taboo field of sexuality, in interaction. I collect my data through sociolinguistic interviews, with a stratified sample from two socially-differentiated districts in Madrid. My approach to the data combines sociolinguistic and cognitive metaphorical analysis. I expect the results to show a relation between lexical-semantic variation of the source domain and the stance (Jaffe) of the speaker, as an indirect index of other social factors like gender, age, education, socioeconomic class, etc. which would demonstrate that these metaphors are, on the one hand, a window to observe the weight of external factors in the conceptualization of sexuality, and on the other, a crucial element in the construction of identity in social interaction.

Grieve, Jack; Asnaghi, Costanza (2011)
The Analysis of Regional Lexical Variation using Site-Restricted Web Searches
This paper presents a novel method for the analysis of regional lexical variation using siterestricted web searches. In total, 39 binary lexical alternation variables whose regional distribution in American English are known based on previous research were analyzed using this method, including low frequency content word alternations (e.g. frosting/icing, cemetery/graveyard, sunset/sundown) and high frequency function word alternations (e.g. though/although, among/amongst, backward/backwards). Data was collected using the following procedure. First, a list of 2061 newspapers from across the contiguous United States was harvested from refdesk.com. Second, the Internet search engine bing.com was used to count the number of web pages hosted by each of these 2061 newspaper websites that contain tokens of each of the 78 lexical variants using siterestricted searches (e.g. frosting site:www.latimes.com). Third, counts were combined for newspapers from the same city and cities with low counts were then deleted, leaving 822 cities in the final dataset. Finally, for each of the 39 lexical alternation variables a proportion was calculated for each of the 822 cities by dividing the number of hits for the first variant by the number of hits for both variants. These proportions were then mapped across the cities in the corpus and subjected to a multivariate spatial analysis (Grieve et al, 2011) in order to identify patterns of regional linguistic variation in the dataset. The results of this analysis were then compared to the results of previous American dialect surveys. In almost every case the regional pattern identified by the web-based analysis agreed with the results of previous dialect surveys. Based on these comparisons, it is argued that this web-based approach is both a valid and efficient method for gathering data on regional lexical variations.

Heylen, Kris; Speelman, Dirk (2011)
Quantifying Lexical Variation: A Corpus-Based Analysis of Semasiological Divergence in Dutch.
Lexical variation is . However, most studies focus on specific case studies. In this paper, we will introduce so-called distributional models of lexical semantics that were originally developed in Computational Linguistics and that allow a large scale analysis of meaning differences on the basis of corpus data. We will show how these models can be integrated in the research programme of Cognitive Sociolinguistics (Kristiansen & Dirven 2008) by providing a usage-based way of identifying semantic classes and analysing their internal structure while at the same time taking into account lectal variation. Distributional models of lexical semantics (also known as vector spaces or word space models) try to capture the meaning of a word on the basis of the contexts in which it appears. Models of the word-based type count the frequencies of the context words that often occur together with the target, while syntax-based methods look at the syntactic relations in which the target word takes part. Target words that occur in similar contexts are then taken to be semantically related. For instance, the semantic similarity between stroke and caress can be derived from the fact that they both often co-occur with words like lovingly and hand, or that they both often appear with an object like cat. Based on this contextual information, semantically similar words can then be clustered into semantic classes. In this paper we will look at a sample of the 10.000 nouns, adjectives and verbs in a large newspaper corpus of Netherlandic and Belgian Dutch (1.5G words). In a first step, we cluster the Netherlandic verbs into semantic classes based on their selectional preferences, i.e. the specific lexemes they take as arguments (subjects, objects) and modifiers (adverbs). For a number of semantic classes, we interpret the conceptual relations within the class and we analyze the internal structure in further detail by looking at the arguments and modifiers that are most typical of that semantic class. In a second step, we do a similar analysis for the Belgian Dutch verbs and we compare the composition of the semantic verb classes in the two varieties. More specifically, we zoom in on verbs that belong to different semantic classes in the two varieties and we analyze which differences in selectional preferences cause the diverging classification.

Levshina, Natalia; Heylen, Kris; Geeraerts, Dirk (2011)
Corpus-based analysis of nearsynonymous constructions: Larger, faster and more objective
In this paper we propose a radically corpus-driven methodology for studies of near-synonymous constructions. It is based on the quantitative method of semantic space models (Lin 1998, Heylen et al. 2008), which is used to create a fully bottom-up and objective classification of constructional slot fillers. On the material of Dutch causative constructions with doen and laten we show how this method can be applied to popular statistical models of constructional near-synonyms, such as regression models and collostructional analysis, and discuss the advantages of the method, as well its future challenges.

Heylen, Kris; Speelman, Dirk; Geeraerts, Dirk (2011)
A Semantic Vector Space for Modelling Word Meaning in Context
Semantic Vector spaces have become the mainstay of modelling of word meaning in statistical NLP. They encode the semantics of words through high-dimensional vectors that record the co-occurrence of those words with context features in a large corpus. Vector comparison then allows for the calculation of e.g. semantic similarity between words. Most semantic vector spaces represent word meaning on the type (or lemma) level, i.e. their vectors generalize over all occurrences of a word. However, the meaning of words can differ considerably between contexts due to polysemy or vagueness. Therefore, many applications, like Word Sense Disambiguation (WSD) or Textual Entailment, require that word meaning be modelled on the token level, i.e. the level of individual occurrences. In this paper, we present a semantic vector space model that represents the meaning of word tokens by taking the word type vector and reweighting it based on the words observed in the token's immediate vicinity. More specifically, we give a bigger weight to the context features in the original type vector that are semantically similar to the context features observed around the token. This semantic similarity between context features is calculated based on the original wordtype-by-contextfeature matrix. We explore the performance of this model in a WSD task by visualizing how well the model separates the different meanings of polysemous words in Multi-Dimensional Scaling solution. We also compare our model to other token-level semantic vector spaces as proposed by Schütze (1998) and Erk & Padó (2008).

Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk (2010)
The bottom-up identification of semantic verb classes in corpora: Combining contextual and lectal information.
One focus of research within Cognitive Semantics is the way in which the lexicon reflects the conceptual organization of our world knowledge. Lexical items have been described as being structured into larger, conceptually motivated wholes like Idealized Cognitive Models (Lakoff 1987), Frames (Fillmore 1985) or, more generally, semantic classes. Following the trend towards usage-based and more empirical approaches in Cognitive Linguistics, recent studies have analysed the internal structure of these semantic classes through the statistical analysis of corpus data (e.g. Divjak & Gries 2006 on Russian verbs of beginning) and some have also taken into account lectal variation within semantic class structure (e.g. Glynn 2008 on verbs of annoying in different varieties of English). However, most of these studies looked at a limited number of verbs selected from a predefined semantic class. In this paper, we will introduce so-called distributional models of lexical semantics that were originally developed in Computational Linguistics and that allow to induce semantic classes in a fully bottom-up fashion from corpus data. We will show how these models can be integrated in the research programme of Cognitive Sociolinguistics (Kristiansen & Dirven 2008) by providing a usage-based way of identifying semantic classes and analysing their internal structure while at the same time taking into account lectal variation.

Heylen, Kris (2010)
Distributional models of verb meaning: syntactic versus lexical contexts.
Over the last decade or so, distributional methods have become the mainstay of semantic modelling in Computational Linguistics. As such, they have also been applied the automatic modelling of verb meaning. However, more than with other lexical categories, the research into verb semantics has taken its inspiration from the idea that a verb's meaning is strongly linked to its syntactic behaviour and more specifically, to its selectional preferences. Depending on how they use these selectional preferences, distributional models of verb meaning come in two flavours. The first approach has its historical origins in the linguistic research tradition into verb valency and frame semnatics and is in principle purely syntactical in nature. A verb's semantic category is said to be inferrable from its distribution over subcategorization (subcat) frames, i.e. the possible combinations of syntactic verb arguments like subject, direct object, indirect object etc. Additionally, this purely syntactic information can be extended with some high-level semantic information like the animacy of the verb arguments (see ~\citeasnoun{SchulteimWalde06a} for an overview). Whereas this first, syntax-oriented approach is specifically geared towards verbs, the second approach is more generally applicable to all lexical categories and is a direct implementation of the ideas of ~\citeasnoun{Harris54a}. These so-called word space models use other words as context features with a specific implementation using only those context words that co-occur in a given dependency relation to the target word (see ~\citeasnoun{PadoLapata07a} for an overview). In the first approach, one context feature is a possible combination of syntactic arguments that a verb can govern. In the second approach, one specific context feature corresponds to one lexeme plus its syntactic relation to the target verb. Whereas the first approach is mostly used to automatically induce Levin-style verb classes, the second approach is typically applied to retrieve semantic equivalents for specific verbs (but see ~\citeasnoun{LiBrew08a} for a comparison of the two methods on the task of inducing Levin-style classes). In this presentation we will try to have a closer look at the kind of semantic information that is captured by these two distinct types of distributional methods for verb meaning. For a sample of 1000 frequent Dutch verbs we construct the two basic models described above from an automatically parsed corpus of Dutch newspapers. In a first step, we use all of the verb-specific dependency relations covered by the parser to calculate distributional similarities between the verbs. In a second step we reduce the number of dependency relations to only include core arguments (excluding so-called complements). In a first general evaluation, we look at the overall correlation between the verb similarities produced by the models to gauge the exent to which they contain different information. We show that they, at least partially, capture comparable semantic distances. In a second evaluation, we analyse the models' performance on the task of finding semantically related verbs as recorded in Dutch WordNet and find that Word Space Models outperfom Subcat Models. Finally, a third evaluation focuses on a subset of verbs that are the Dutch cognates of the German verbs in~\citeasnoun{SchulteimWalde06a}. We compare how well the models' similarity matrices allow clustering of the verbs into a number of Levin-type classes. In this case, the semantically enriched Subcat Model gives the best results.

Heylen, Kris (2009)
Distributional Models of Verb Meaning: Syntactic versus Lexical Contexts
Over the last decade or so, distributional methods have become the mainstay of semantic modelling in Computational Linguistics. As such, they have also been applied the automatic modelling of verb meaning. However, more than with other lexical categories, the research into verb semantics has taken its inspiration from the idea that a verb's meaning is strongly linked to its syntactic behaviour and more specifically, to its selectional preferences. Depending on how they use these selectional preferences, distributional models of verb meaning come in two flavours. The first approach has its historical origins in the linguistic research tradition into verb valency and frame semnatics and is in principle purely syntactical in nature. A verb's semantic category is said to be inferrable from its distribution over subcategorization (subcat) frames, i.e. the possible combinations of syntactic verb arguments like subject, direct object, indirect object etc. Additionally, this purely syntactic information can be extended with some high-level semantic information like the animacy of the verb arguments (see ~\citeasnoun{SchulteimWalde06a} for an overview). Whereas this first, syntax-oriented approach is specifically geared towards verbs, the second approach is more generally applicable to all lexical categories and is a direct implementation of the ideas of ~\citeasnoun{Harris54a}. These so-called word space models use other words as context features with a specific implementation using only those context words that co-occur in a given dependency relation to the target word (see ~\citeasnoun{PadoLapata07a} for an overview). In the first approach, one context feature is a possible combination of syntactic arguments that a verb can govern. In the second approach, one specific context feature corresponds to one lexeme plus its syntactic relation to the target verb. Whereas the first approach is mostly used to automatically induce Levin-style verb classes, the second approach is typically applied to retrieve semantic equivalents for specific verbs (but see ~\citeasnoun{LiBrew08a} for a comparison of the two methods on the task of inducing Levin-style classes). In this presentation we will try to have a closer look at the kind of semantic information that is captured by these two distinct types of distributional methods for verb meaning. For a sample of 1000 frequent Dutch verbs we construct the two basic models described above from an automatically parsed corpus of Dutch newspapers. In a first step, we use all of the verb-specific dependency relations covered by the parser to calculate distributional similarities between the verbs. In a second step we reduce the number of dependency relations to only include core arguments (excluding so-called complements). In a first general evaluation, we look at the overall correlation between the verb similarities produced by the models to gauge the exent to which they contain different information. We show that they, at least partially, capture comparable semantic distances. In a second evaluation, we analyse the models' performance on the task of finding semantically related verbs as recorded in Dutch WordNet and find that Word Space Models outperfom Subcat Models. Finally, a third evaluation focuses on a subset of verbs that are the Dutch cognates of the German verbs in~\citeasnoun{SchulteimWalde06a}. We compare how well the models' similarity matrices allow clustering of the verbs into a number of Levin-type classes. In this case, the semantically enriched Subcat Model gives the best results.