psmp3 anna smrz

40 %
60 %
Information about psmp3 anna smrz

Published on November 21, 2007

Author: Simo


Word Association Thesaurus as a Resource for Extending Semantic Networks :  Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova1, 2, Pavel Smrz1 1Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno, Czech Republic 2Saint-Petersburg State University Universitetskaya 11, Saint-Petersburg, Russia {anna, smrz} Overview:  Overview Motivation Word Association and other notions of psycholinguistics WAT vs. Corpus Semantic Information from WAT core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information Types of Semantic Resources used in NLP:  Types of Semantic Resources used in NLP Motivation:  Motivation There is still a need for empirical basis of semantic network construction. Semantic Web initiatives. WAT are available for many languages. Nobody knows what are they good for and how to use them. Word Association and other notions of psycholinguistics:  Word Association and other notions of psycholinguistics Word Association Word Association Test Word Association Norms Word Association Thesaurus Example:  Example Needle stimulates: -> thread: 41, pin: 13, sharp: 6, sew: 5, cotton: 2, dressmaker: 1, fix: 1, prick: 1, sewing: 1, sow: 1, spring; 1, stitch: 1, etc. WATs explored:  WATs explored RAT - Russian WAT by Karaulov et al (1994-1998): 8000 stimuli - 23000 words covered – 1000 subjects, EAT - Edinburgh WAT by Kiss et al (1972): 8400 stimuli – 54000 words covered - 1000 subjects, Czech WAN (Novak et al, 1996): 150 stimuli - 4000 words covered – 250 subjects. Experience gained in projects: RussNet (a wordnet-like database for Russian linking lexical semantics with derivational morphology Czech part of the BalkaNet project (multilingual wordnet-like network for 5 Balkan languages and Czech). WAT vs. Corpus:  WAT vs. Corpus History: Church & Hanks, 1990; Wettler & Rapp, 1993; Willners, 2001 Bokrjonok 3.0. - balanced corpus for Russian (16 mln words), BNC - British National Corpus (112 mln), CNC - Czech National Corpus (160 mln) and its unbalanced version (630 mln words) Research procedure: 5000 pairs e.g. cheese – mouse, dark - alley have been extracted from each WAN in random order, and then searched in the corpora. The window span was fixed to -10; +10 words. WAN vs. Corpus: Russian :  WAN vs. Corpus: Russian Quantitative analysis: (Sinopalnikova, 2004) - 64% word associations do not occur in the corpus, - 49% while excluding unique associations (that with absolute frequency = 1) Qualitative analysis: - high ratio of syntagmatic associations to be absent, - for verbs this number was up to 84%. WAN vs. Corpus: Russian (2):  WAN vs. Corpus: Russian (2) WAN vs. Corpus: English :  WAN vs. Corpus: English Quantitative analysis: - 31% word associations do not occur in the BNC Qualitative analysis: PARADIGMATIC 57,1 SYNTAGMATIC 8,4 DOMAIN 21,7 OTHER 12,8 WAN vs. Corpus: English (2):  WAN vs. Corpus: English (2) acquiring synonymy and hyponymy e.g. sex – fornicate (archaic or humorous), ire (poetic) – anger, cowardly – yellow (slang) acquiring information about low frequent words e.g. perambulate (NBNC = 3), fornicate (NBNC = 6) cf. EAT: perambulate - walk: 30, pram: 17, baby: 9, push: 8, about: 1, dawdle: 1,move: 1, promenade: 1, slowly: 1, stroll:1, through:1, wander:1, etc. acquiring domain relations; absent portion of them was surprisingly large for such corpus as BNC e.g. ink-pot – pen: 24, non-violence – peace 29, offside – soccer 2 WAN vs. Corpus: Czech:  WAN vs. Corpus: Czech Quantitative analysis: - 514 associations missing (10,28%) Qualitative analysis: - proportion of the syntagmatic and paradigmatic ones among them was similar to that for English Extracting semantic information from WAT:  Extracting semantic information from WAT Associations: by form – 10% (e.g. know – no, yellow - mellow) by meaning – 90% (e.g. needle – sew, yellow - sun) core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information Core concepts:  Core concepts In WAT there could be observed words that have an above-average number of direct links to other words. Russian человек, мир, дом, жизнь, есть, думать, жить, идти, большой, хорошо, плохо, нет (не), новый, дерево etc. (295 words with more then 100 relations); English man, sex, no (not), love, house; work, eat, think, go, live; good, old, small etc. (586 words with more then 100 relations); Czech člověk, dům, strom; jíst, jít, myslet; moc, starý, velký, bílý, hezký etc. These words determine the fundamental concepts of a particular language system, and thus should be incorporated into ontology as its core components (e.g., SUMO upper concepts or EWN Base Concepts. Semantic primitives :  Semantic primitives WAT could also provide a list of basic concepts associated with each separate word. Thus revealing semantics of a word (situation) as a list of semantic constituents - separate pieces of information. Abstract words (verbs, adjectives or nouns denoting complex situation or emotional states) are difficult to decompose by means of logic and intuition. E.g. Depression could be reduced to its constituents sad 7, low 5, black 4, manic 4, sadness 3, bored 3, misery 2, tiredness 2, despair 1, gloom 1, grey 1, hopelessness 1, monotony 1, sick 1, mood 1, nerves 1, etc., its probable causes: rain 3, guilt 1, pain 1, unemployment 1, its probable effects: suicide 1, its antipodes elation 3, fun 1, happiness 1 etc. Syntagmatic and paradigmatic relations:  Syntagmatic and paradigmatic relations “Linguistic substitutes for reality” WA reflect the order of events in reality, the way objects are organized in the space, and the way human beings experience them. Associations by contiguity e.g. cry – baby may be treated as a manifestation of syntagmatic relation between verb and its subject, while take – hand as a ROLE_INSTRUMENT relation. Generalization! e.g. drink – water, beer, milk, ale, Coca-cola, coffee, juice, etc. found in WAT should be generalized as drink ROLE_OBJECT beverage relation and in such a form incorporated in the semantic network Syntagmatic and paradigmatic relations (2):  Syntagmatic and paradigmatic relations (2) The law of contiguity could not explain all associations. Law of similarity, e.g. inanimate – dead: 39 (SYNONYMY), seek – find: 56 (CAUSE relation), buy – sell: 56 (CONVERSIVE relation). One of the main benefits of WAT : paradigmatic relations are given explicitly as opposed to other sources of empirical data (e.g. text corpora). Domain information:  Domain information WAT explicitly present the way common words are grouped together according to the fragments of reality they describe. E.g., hospital –> nurse, doctor, pain, ill, injury, load… Types of domain relations: name of domain (situation) – domain member e.g. hospital – nurse:8, finance – money: 61, football – player:4; marriage – husband 2; participant – participant e.g. pepper – salt: 58, tamer – lion: 69, needle – thread: 41 mouse – cat: 22; participant – circumstance e.g. umbrella – rain: 58; actor – stage:23; participant – pointer to its function/role in the situation e.g. larder – food: 58, envelope – letter: 60, actor – play: 15 etc. To differentiate types of domain relations within semantic network, vs. to include them as uniform IS_ASSOCIATED_TO relation? Conclusions:  Conclusions Advantages of using WAT in constructing semantic network: Simplicity of data acquisition. Broad variety of semantic information to acquire. Empirical nature of data extracted (as opposed to theoretical one, cf. conventional ontologies, taxonomies or classification schemes, that supposes the researcher’s introspection and intuition to be involved, and hence, leads to over- and under-estimation of the phenomena under consideration). Probabilistic nature of data presented (data reflects the relative rather then absolute relevance of semantic relations in each particular case). Slide21:  Thank you...

Add a comment

Related presentations