Summer project "Abbreviations in large text corpora" The tasks 1. Design and implement a method for identifying abbreviations and their definitions in a broad variety of textual sources. Definitions typically show up once per document. Often, the abbreviation is given in paranthese after the technical term to be abbreviated. But there may be other forms. Abbreviations can be identified in several ways - form: upper-case characters, no vowels, trailing dot - contained in an abbreviation list - not in general dictionary - based on typical part of speech sequences in the context Condition of success: system should be achieve high precision and recall on texts from a variety of sources. 2. Typology: Find and investigate existing schemes of classifying abbreviations. Bowden, Lindsay, Halstaed '98 distinguish acronym, initialism, shortening, but there may be others, or a more finegrained distinction may be useful. Condition of success: Definition of a typology that covers all the cases we are interested in. 3. Build some simple methods for generating abbreviations, given the expanded form and some additional constraints. Find implementations based on (weighted?) finite state transducers Condition of success: The automatically generated abbreviations look plausible and cover a large part of the different types found in subtask 2. Required profile - Good knowledge of English (supervision and most of the data to work on will be in English). - Programming experience with PERL or similar languages - Interest in data-intensive research methodologies