Job announcement: Computational Linguistics, Sociolinguistics, Text/Corpus Linguistics, Postdoc, University of Leuven, Belgium.
University or Organization: University of Leuven
Department: Department of Linguistics
Web Address: http://wwwling.arts.kuleuven.be/qlvl/
Job Rank: Post Doc
Specialty Areas: Computational Linguistics; Sociolinguistics; Text/Corpus Linguistics
The research unit Quantitative Lexicology and Variational Linguistics (http://wwwling.arts.kuleuven.be/qlvl/) at the University of Leuven invites applications for a three year post-doctoral position (01.01.2009 - 31.12.2011) in the research project 'Sociolectometry and lexical variation'. This project wishes to contribute to the development of a quantitative, corpus-based lexical sociolectometry. On the basis of large (available) corpora of Dutch, the project will analyse to what extent different language varieties exhibit different word choice preferences in referring to specific concepts. Doing this for many concepts makes it possible to get a systematic overview of lexical variation in Dutch and to measure the divergence in word use between varieties, hence the term sociolectometry. However, such a large scale investigation will require an automated approach to analyse word choices. Therefore, a key aspect of the project consists of the incorporation of novel methods of automatic semantic analysis into the lectometric framework previously developed by the research unit. Computational semantic techniques, like semantic vector space models, are currently already used in the research unit for exploring semantic relations between words. The successful applicant will co-ordinate and spearhead the integration of these techniques into the lectometric framework and thus develop tools for large scale lexical variation research. The applicant will closely co-operate with team members in confronting the linguistic and extra-linguistic lexical knowledge captured by semantic vector space models with experimental psycholinguistic findings.
The basic principle behind the lectometric approach is quite simple: Define the set of synonyms that can refer to a given concept and then count how often the language varieties under investigation use these synonyms. The difference in the relative frequency of synonym use can then be regarded as a measure of divergence in word choice preferences between the two varieties. As an example, take the concept of ?a large strong motor vehicle for transporting goods?, and the two synonyms lorry and truck that can refer to it in English. We get from representative corpora that British English uses lorry 85% of the times and truck only 15%, whereas Australian English uses lorry only in 10% of the cases and truck in 90%. This means there is an overlap of 25% of the cases where both varieties use the same word to refer to the concept. The inverse of this overlap can then be used as a distance measure between British and Australian English. Calculating this distance for many concepts and aggregating over them can provide an image of the general lexical divergence between the two varieties. This basic framework has been supplemented with additional statistical machinery like concept weighting or more sophisticated distance measures and it has proven its great value for studying lexical variation in Dutch (Geeraerts et al. 1999) and Portuguese (Soares Da Silva 2005). However, an important issue for the approach is scalability. The manual definition of synonym sets and their disambiguation in corpora is very time-consuming and needs to be automated in order to implement the framework on a really large scale. That's where computational semantics comes in. Automatic synonymy extraction and word sense disambiguation can bypass this bottleneck. Within the research unit, semantic vector space models are already used for the large scale automated modelling of semantic relations in an effort to gain a better understanding of the relations between context and word meaning. The successful applicant can take advantage of this in-house expertise not just for scaling up the lectometric framework but also for updating it and exploring new avenues of lectometric research that are opened up by incorporating automated methods.
The project is primarily concerned with fundamental research into lexicology. However, it does touch upon issues in neighbouring domains and the insights gained in the project are likely to advance research there too. Therefore, applicants with different backgrounds are encouraged to apply.
Close ties exist to other branches of linguistics that analyse variation. Lexical variation is of course a legitimate research object in Sociolinguistics, but it has been relatively less studied there. A methodology based on the statistical analysis of large text corpora can fill that gap.
Lexical variation has been studied intensively and quantitatively within dialectometry. However, this tradition has focused on historical data of local dialects (as opposed to the standard language). The project offers an extension of this work in that it studies contemporary varieties on a cline from regional over substandard to standard.
Stylometry in the tradition of Biber et al. has studied the differences between situationally defined registers in a statically advanced and corpus-based way, but the approach has not paid much attention to lexical variables, which are also strongly indicative of certain registers. The same holds for studies into authorship identification and forensic linguistics, where independently determined lexical differences between varieties could help identify specific speakers. The project offers the possibility to investigate such lexical variables.
Apart from these ?variationist? fields of research, there are also domains that are methodologically related because they are the prime users of computational semantic techniques, but for which the relevance of lexical variation research might seem less obvious at first sight. However, they too could greatly benefit from a better insight into lexical variation.
Semantic vector space models are primarily used in Computational Linguistics for tasks related to information retrieval. Basically, these approaches try to find texts about the same topic on the basis of the similar words that appear in them. From that perspective, the fact that one concept can be expressed by multiple words introduces ?noise? that has to be circumvented by using latent semantic structures or ontology-based expansion of search terms. However, this ?noise? can be highly informative of a text's genre, register or variety. Exploiting this information may lead to genre specific search or aid with the ?translation? of scientific prose into popularized versions.
In Cognitive Psychology, semantic vector spaces are used in categorization research to model word associations and semantic memory. Although contextual effects are more and more taken into account in these studies, the register or sociolinguistic properties of words, to which speakers are often very sensitive, have been largely neglected. The project offers a way to factor in these effects. This link with Cognitive Psychology is particularly important for the current project. It ties in with a parallel project conducted within the research unit, in which we study categorization processes - like prototype effects and basic level phenomena - both from a corpus-based and an experimental, psycholinguistic point of view.
Although the successful applicant will focus primarily on the development of lectometric tools, he or she will be given the opportunity to relate the results to his or her original field of interest.
Candidates must have completed all requirements for their PhD degree by the time of appointment. The ideal candidate should have one of the following profiles:
Candidates with experience in the statistical analysis of natural language and/or the use of scripting languages are especially encouraged to apply.
Because the successful candidate will work as a member of a project group, team spirit is indispensable.
Knowledge of Dutch is not required at the time of application, but non-native candidates will be asked to acquire a working knowledge of Dutch in the first year of their employment.
Funding is guaranteed as of October 1, 2008, but according to the availability of the candidate, employment may start between October 1, 2008 and January 1, 2009.
Please send applications to both Dirk Geeraerts and Dirk Speelman.
Closing date: 1 December 2008