Lexicography vs. Terminology :  Lexicography vs. Terminology Martin Volk Computational Linguistics Stockholm University volk@ling.su.se Overview:  Overview Thesauri and related terms Manual thesaurus construction Lexicography vs. Terminography Using a monitor corpus for lexicography Thesaurus definition from http://www.collectionscanada.ca/8/4/r4-282-e.html :  Thesaurus definition from http://www.collectionscanada.ca/8/4/r4-282-e.html A thesaurus is a tool used for vocabulary control.  Using a thesaurus improves search results. A thesaurus is a sub-set of the language we use in daily life. It includes information about the relationships of words and phrases (i.e. broader terms, narrower terms, preferred terms, non-preferred, or related terms). A thesaurus is normally restricted to a specific subject field (e.g. health, education, government documents). Thesaurus-related terms from http://www.willpower.demon.co.uk/thesbibl.htm :  Thesaurus-related terms from http://www.willpower.demon.co.uk/thesbibl.htm Ontologies appear to be a development of a thesaurus, with a greater variety of relationships between concepts, intended to be useful to implementation aspects of the "semantic web“. Topic maps are a way of structuring such an ontology with the addition of links between concepts and information resources. Controlled vocabulary from http://www.collectionscanada.ca/8/4/r4-282-e.html :  Controlled vocabulary from http://www.collectionscanada.ca/8/4/r4-282-e.html A controlled vocabulary is an established list of standardized terminology for use in indexing and retrieval of information. An example of a controlled vocabulary is subject headings used to describe library resources. A controlled vocabulary ensures that a subject will be described using the same preferred term each time it is indexed and this will make it easier to find all information about a specific topic during the search process. Ontology from http://www.collectionscanada.ca/8/4/r4-282-e.html :  Ontology from http://www.collectionscanada.ca/8/4/r4-282-e.html Used in philosophy for centuries, ontology is the study of the nature and relation of being.  The term is now used in the fields of information science and artificial intelligence to mean the hierarchical structuring of knowledge using a set of concepts that are specified in order to create an agreed-upon vocabulary for exchanging information. What is a thesaurus? http://www.bayside-indexing.com/Milstead/about.htm :  What is a thesaurus? http://www.bayside-indexing.com/Milstead/about.htm For writers, it is a tool like Roget’s ­ one with words grouped and classified to help select the best word to convey a specific nuance of meaning. For indexers and searchers, it is an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus. Introductory Tutorial on (Manual) Thesaurus Construction:  Introductory Tutorial on (Manual) Thesaurus Construction based on a web tutorial by Tim Craven at http://instruct.uwo.ca/gplis/677/thesaur/main00.htm Sources of Terms:  Sources of Terms Sources from which terms can be collected include existing lists of terms: other thesauri, indexes, dictionaries, glossaries, etc. texts from which terms can be extracted titles, abstracts, or full texts of indexed items queries by patrons people subject specialists, etc. What Kinds of Terms to Collect?:  What Kinds of Terms to Collect? Where possible, terms in a thesaurus should be nouns or noun phrases. A term should be general enough that it might be used to index a number of items. For example, a thesaurus usually does not include proper names. But a term should not be so general that it might be used to index too many of the items in the thesaurus' subject area. For example, the term "NEWS" would not be much use in a thesaurus for indexing news items. Standardizing the Form of Words :  Standardizing the Form of Words see http://instruct.uwo.ca/gplis/677/thesaur/main03.htm Equivalent terms:  Equivalent terms synonyms (incl. spelling variants "AESTHETICS" and "ESTHETICS“) quasi-synonyms overlapping meanings scope is included in another term For example, "STEEL" might be treated as equivalent to "METAL“. Choosing prefered terms:  Choosing prefered terms see http://instruct.uwo.ca/gplis/677/thesaur/main04.htm Semantic Categories of the Related-To References:  Semantic Categories of the Related-To References see http://instruct.uwo.ca/gplis/677/thesaur/main06.htm Thesaurus criteria:  Thesaurus criteria Exhaustivity: all the themes, objects and concepts dealt with by the document are to be found in the index. Selectivity: only information of interest to users has been selected. Specificity: the description represents the contents of the document as accurately as possible and avoids over-general or over-precise descriptors where specific or less precise terms would be more appropriate. Consistency: another indexer or a user would normally describe the same document, or documents on the same subject, in the same way. Lexicography vs. Terminography:  Lexicography vs. Terminography Structure of a lexicon entry:  Structure of a lexicon entry Representation Lemma Pronunciation, stress, hyphenation Special word forms Orthographic variants Classification as short form, abbreviation etc. Structure of a lexicon entry:  Structure of a lexicon entry Explication Descriptive or iconic representation Grammatical information Paradigmatic information Syntagmatic information Phrase usage Stilistic, areal, domain-specific, dialectal classification Word formation Word history (meaning change, foreign language influence, Etymology) Demonstration Usage examples Statistical information Bibliographical references Goals:  Goals Procedure:  Procedure Type of description:  Type of description Types of entries:  Types of entries Structure of a terminology entry:  Structure of a terminology entry Source language term Definition Usage example Source Synonyms Cross references Grammatical information (Multiple) Target language expression(s) General info: date, author, status Example entry::  Example entry: <Source> IEC <Source-Doc> PART 351: Automatic control <EN> controlled variable <EN-Def> An output variable of the controlled systemwhich is intended to be acted upon by one or more of the manipulated variables. (see figure 1) <DE> Regelgrösse <FR> variable commandée <FR-Syn> variable réglée Computer use in terminology:  Computer use in terminology Using a monitor corpus:  Using a monitor corpus = a constantly evolving corpus, rather than a fixed corpus (like SUC). The "Wortwarte" project:  The "Wortwarte" project at the University of Tübingen (Lothar Lemnitzer, Tylman Ule) http://www.sfs.nphil.uni-tuebingen.de/~lothar/nw/index.html a monitor corpus project for German daily download of newspapers Die Zeit, die Welt, Financial Times Deutschland, Rheinische Post etc. comparison of the words contained in these papers against a word list of the "Deutsches Referenzkorpus", a corpus that consists of 120 million tokens ~ 2.3 million types of German. The "Wortwarte" project:  The "Wortwarte" project What is found: misspelled words, words with unusual spelling (e.g. Betel-Nuss instead of Betelnuss) (partly because of the orthography reform), regular words that are not in the reference corpus by accident, words from spoken language without fixed spelling (boahh, iiiih etc.) new words !!! The "Wortwarte" project:  The "Wortwarte" project New words are seldom really new, based on product names ("Nogger Dir einen"), derivations ("faxen"), often compounds ("rüberfaxen"), loan words (mostly from English), abbreviations. The "Wortwarte" project:  The "Wortwarte" project Reasons for new words new things are being introduced (Handy) new words sound better than their predecessors (Banker vs. Bankangestellter) Summary:  Summary Thesauri and Ontologies are closely related. A term collection (terminology) is also describing domain knowledge but with a focus on translation. Lexicography and terminography differ substantially in corpus, methods and entries. A monitor corpus can be used to find new lexical items.

