Task: Adding interactivity to our virtual course "Applied Computional Linguistics" Context: Our virtual course "Applied Computational Linguistics" is centered around a tool for intelligent lexical lookup. The students of the virtual course should understand the basic linguistic and computational tasks which have to be solverd within such a software package. We have some tools available (tagger, chunk parser, tokenizer) and want to add some other tools which should become available via interfaces on our web site. The idea of the virtual course is that students can play with the tools and get fast access to the outcome of their manipulations. Requirements: The candidate should have at least basic knowledge of Perl and CGI. Some experience in Java programming would be fine but are not absolutely neccessary. Benefits: The student would get a deeper insight into the design of on-line teaching materials for computational linguistics. Contact: Lothar Lemnitz (lemnitzer@sfs.nphil.uni-tuebingen.de) -------------------------------------------------------------------------- Task: Reducing full word forms to lemmas Context: German Reference Corpus project (DeReko) The aim of this task is to find the lemma of each word form in a given text. There is a growing number of XML texts available with all tokens already marked up. These tokens should be reduced to their lemmas, so that, in the end, each token in the original text is annotated with the corresponding lemma. Promising steps into this direction include 1. building a lexicon of full forms from the corpus 2. using on-line German lexicons 3. analysing morphology with existing tools 4. building and implementing your own set of rules, or using machine learning methods (generic software for maximum entropy modelling / memory-based learning is available) Skills likely to be useful: - Perl or any other programming language supporting easy text processing - Unix-Tools (sed, grep, sort, uniq etc.) - Basic knowledge of German (esp. Morphology) Contact: Tylman Ule (ule@sfs.nphil.uni-tuebingen.de) -------------------------------------------------------------------------- Task: Editor POS-Tags Context: German Reference Corpus project (DeReko) The automatic POS (part of speech) taggers used to tag German text are fed with training data which is already available to a fair amount. There is, however, still a need to produce more training data, and because supervision is expensive, an editor for POS tags should offer an environment that facilitates manual correction of the automatic taggers' output. The editor should - consider tags chosen by several automatic taggers - use statistics available from training data (frequency of token/tag combinations etc.) - offer a list of most likely tags to the human annotator Skills likely to be useful: - Perl or any other programming language supporting easy text processing - XML processing (I am happy to explain this for Perl) - basic knowledge of GUI programming (e.g. Perl/Tk) Contact: Tylman Ule (ule@sfs.nphil.uni-tuebingen.de) -------------------------------------------------------------------------- Task: Voting for POS-Tagging Context: German Reference Corpus project (DeReko) Four different machine learning techniques will be employed to assign POS tags to German text. There will be software available implementing these techniques. Applying the taggers results in XML text that has each token marked up with a number of POS tags. The purpose of the present task is to implement a voting scheme following [Halteren98:combin] in order to increase tagging precision. Skills likely to be useful: - Perl or any other programming language supporting easy text processing - XML processing (I am happy to explain this for Perl) - Basic knowledge of German (to understand the meaning of the POS tags) @InProceedings{Halteren98:combin, author = {Hans van Halteren and Jakub Zavrel and Walter Daelemans}, title = {Improving Data Driven Wordclass Tagging by System Combination}, crossref = {acl98}, url = {ftp://ilk.kub.nl/pub/papers/coling98.ps.gz}, urldate = {22.12.1999}, note = {use for tagging}, } Contact: Tylman Ule (ule@sfs.nphil.uni-tuebingen.de) -------------------------------------------------------------------------- Task: Automatic Syntactic Validation of Corpora Validation automatique de corpus parenthésés. Le propos de ce travail est de concevoir un logiciel d'aide à la validation/correction (semi-)automatique de corpus annotés (manuellement ou automatiquement). Le but est de vérifier la cohérence interne de l'annotation des corpus. La vérification se fera au niveau morphosyntaxique (étiquetage des mots) et syntaxique (parenthésage). Les corpus utilisés seront en langue anglaise et allemande. Langage: JAVA/C++/PERL Responsable: Hervé Déjean (dejean@sfs.nphil.uni-tuebingen.de) ---------------------------------------------------------------------- Task: Implementation of different heuristics for the tree construction component of the chunk parser. Context: The work will be part of Verbmobil, a speech-to-speech translation system. The tree construction component constructs complete trees out of chunks (partial analyses) by a comparison with trees in a treebank. The tree construction makes characteristic mistakes for both languages (English and German), e.g. it cuts off trees after a preposition. Or tagging errors could be corrected in the chunk context. Skills: good programming knowledge in C, knowledge of German and/or English, basic knwoledge of syntax Contact: Sandra Kuebler (kuebler@sfs.nphil.uni-tuebingen.de)