Task: Adding interactivity to our virtual course "Applied Computional
Linguistics"

Context: Our virtual course "Applied Computational Linguistics" is
centered around a tool for intelligent lexical lookup. The students of
the virtual course should understand the basic linguistic and
computational tasks which have to be solverd within such a software
package.  We have some tools available (tagger, chunk parser,
tokenizer) and want to add some other tools which should become
available via interfaces on our web site. The idea of the virtual
course is that students can play with the tools and get fast access to
the outcome of their manipulations.

Requirements: The candidate should have at least basic knowledge of
Perl and CGI. Some experience in Java programming would be fine but
are not absolutely neccessary.

Benefits: The student would get a deeper insight into the design of
on-line teaching materials for computational linguistics.

Contact: Lothar Lemnitz (lemnitzer@sfs.nphil.uni-tuebingen.de)

--------------------------------------------------------------------------

Task: Reducing full word forms to lemmas
Context: German Reference Corpus project (DeReko)

The aim of this task is to find the lemma of each word form in a given
text. There is a growing number of XML texts available with all tokens
already marked up. These tokens should be reduced to their lemmas, so
that, in the end, each token in the original text is annotated with
the corresponding lemma. Promising steps into this direction include
1. building a lexicon of full forms from the corpus
2. using on-line German lexicons
3. analysing morphology with existing tools
4. building and implementing your own set of rules, or using machine
   learning methods (generic software for maximum entropy modelling /
   memory-based learning is available)

Skills likely to be useful:
- Perl or any other programming language supporting easy text processing
- Unix-Tools (sed, grep, sort, uniq etc.)
- Basic knowledge of German (esp. Morphology)

Contact: Tylman Ule (ule@sfs.nphil.uni-tuebingen.de)

--------------------------------------------------------------------------

Task: Editor POS-Tags
Context: German Reference Corpus project (DeReko)

The automatic POS (part of speech) taggers used to tag German text are
fed with training data which is already available to a fair
amount. There is, however, still a need to produce more training data,
and because supervision is expensive, an editor for POS tags should
offer an environment that facilitates manual correction of the
automatic taggers' output. The editor should
- consider tags chosen by several automatic taggers
- use statistics available from training data (frequency of token/tag
  combinations etc.)
- offer a list of most likely tags to the human annotator

Skills likely to be useful:
- Perl or any other programming language supporting easy text processing
- XML processing (I am happy to explain this for Perl)
- basic knowledge of GUI programming (e.g. Perl/Tk)

Contact: Tylman Ule (ule@sfs.nphil.uni-tuebingen.de)

--------------------------------------------------------------------------

Task:  Voting for POS-Tagging
Context: German Reference Corpus project (DeReko)

Four different machine learning techniques will be employed to assign
POS tags to German text. There will be software available implementing
these techniques. Applying the taggers results in XML text that has
each token marked up with a number of POS tags. The purpose of the
present task is to implement a voting scheme following
[Halteren98:combin] in order to increase tagging precision.

Skills likely to be useful:
- Perl or any other programming language supporting easy text processing
- XML processing (I am happy to explain this for Perl)
- Basic knowledge of German (to understand the meaning of the POS
  tags)

@InProceedings{Halteren98:combin,
  author =	 {Hans van Halteren and Jakub Zavrel and Walter
                  Daelemans},
  title =	 {Improving Data Driven Wordclass Tagging by System
                  Combination},
  crossref =	 {acl98},
  url =		 {ftp://ilk.kub.nl/pub/papers/coling98.ps.gz},
  urldate =	 {22.12.1999},
  note =	 {use for tagging},
}

Contact: Tylman Ule (ule@sfs.nphil.uni-tuebingen.de)

--------------------------------------------------------------------------

Task: Automatic Syntactic Validation of Corpora
      Validation automatique de corpus parenthésés. 

Le propos de ce travail est de concevoir un logiciel d'aide à la
validation/correction (semi-)automatique de corpus annotés
(manuellement ou automatiquement). Le but est de vérifier la cohérence
interne de l'annotation des corpus. La vérification se fera au
niveau morphosyntaxique (étiquetage des mots) et syntaxique
(parenthésage). 
Les corpus utilisés seront en langue anglaise et allemande. 

Langage: JAVA/C++/PERL

Responsable: Hervé Déjean (dejean@sfs.nphil.uni-tuebingen.de)

----------------------------------------------------------------------

Task: Implementation of different heuristics for the tree construction
component of the chunk parser.

Context: The work will be part of Verbmobil, a speech-to-speech
translation system. The tree construction component constructs complete
trees out of chunks (partial analyses) by a comparison with trees in a
treebank. The tree construction makes characteristic mistakes for both
languages (English and German), e.g. it cuts off trees after a
preposition. Or tagging errors could be corrected in the chunk context.

Skills: good programming knowledge in C, knowledge of German and/or
English, basic knwoledge of syntax

Contact: Sandra Kuebler (kuebler@sfs.nphil.uni-tuebingen.de)