Summer project "Abbreviations in large text corpora"

The tasks

1. Design and implement a method for identifying abbreviations and
   their definitions in a broad variety of textual sources.

   Definitions typically show up once per document.  Often, the
   abbreviation is given in paranthese after the technical term to be
   abbreviated.  But there may be other forms.

   Abbreviations can be identified in several ways
    - form: upper-case characters, no vowels, trailing dot
    - contained in an abbreviation list
    - not in general dictionary
    - based on typical part of speech sequences in the context

   Condition of success: system should be achieve high precision and
   recall on texts from a variety of sources.  


2. Typology: Find and investigate existing schemes of classifying
   abbreviations.

   Bowden, Lindsay, Halstaed '98 distinguish acronym, initialism,
   shortening, but there may be others, or a more finegrained
   distinction may be useful.  

   Condition of success: Definition of a typology that covers all the
   cases we are interested in.


3. Build some simple methods for generating abbreviations, given the
   expanded form and some additional constraints.  Find implementations
   based on (weighted?) finite state transducers

   Condition of success: The automatically generated abbreviations look
   plausible and cover a large part of the different types found in
   subtask 2.


Required profile

- Good knowledge of English (supervision and most of the data to work on
  will be in English).

- Programming experience with PERL or similar languages

- Interest in data-intensive research methodologies