Knoblock p123

20 %
80 %
Information about Knoblock p123
Entertainment

Published on November 15, 2007

Author: Nikita

Source: authorstream.com

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look:  An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute Unstructured, Ungrammatical Text:  Unstructured, Ungrammatical Text Unstructured, Ungrammatical Text:  Unstructured, Ungrammatical Text Car Model Car Year Semantic Annotation:  Semantic Annotation 02 M3 Convertible .. Absolute beauty!!! <Make>BMW</Make> <Model>M3</Model> <Trim>2 Dr STD Convertible</Trim> <Year>2002</Year> “Understand” & query the posts (can query on BMW, even though not in post!) Note: This is not extraction! (Not pulling them out of post…) implied! Reference Sets:  Reference Sets Annotation/Extraction is hard Can’t rely on structure (wrappers) Can’t rely on grammar (NLP) Reference sets are the key (IJCAI 2005) Match posts to reference set tuples Clue to attributes in posts Provides normalized attribute values when matched Reference Sets:  Reference Sets Collections of entities and their attributes Relational Data! Scrape make, model, trim, year for all cars from 1990-2005… Contributions:  Contributions New unsupervised approach: Two Steps:  New unsupervised approach: Two Steps Reference Set Repository: Grows over time, increasing coverage ------------ ----------- ---------- Posts 1) Unsupervised Reference Set Chooser 2) Unsupervised Record Linkage Unsupervised Semantic Annotation Choosing a Reference Set:  Choosing a Reference Set Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 Hotels Cars Restaurants SIM:0.7 SIM:0.4 SIM:0.3 Cars 0.7 PD(C,H) = 0.75 > T Hotels 0.4 PD(H,R) = 0.33 < T Restaurants 0.3 Avg. 0.47 Cars Choosing Reference Sets:  Choosing Reference Sets Similarity: Jensen-Shannon distance & TF-IDF used in Experiments in paper Percent Difference as splitting criterion Relative measure “Reasonable” threshold – we use 0.6 throughout Score > average as well Small scores with small changes can result in increased percent difference but they are not better, just relatively so… If two or more reference sets selected, annotation runs iteratively If two reference sets have same schema, use one with higher rank Eliminate redundant matching Vector Space Matching for Semantic Annotation:  Vector Space Matching for Semantic Annotation Choosing reference sets: set of posts vs. whole reference set Vector space matching: each post vs. each reference set record Modified Dice similarity Modification: if Jaro-Winler > 0.95 put in (p ∩ r) captures spelling errors and abbreviations Why Dice?:  Why Dice? TF/IDF w/ Cosine Sim: “City” given more weight than “Ford” in reference set Post: Near New Ford Expedition XLT 4WD with Brand New 22 Wheels!!! (Redwood City - Sale This Weekend !!!) $26850 TFIDF Match (score 0.20): {VOLKSWAGEN, JETTA, 4 Dr City Sedan, 1995} Jaccard Sim [(p ∩ r)/(p U r)]: Discounts shorter strings (many posts are short!) Example Post above MATCHES: {FORD, EXPEDITION, 4 Dr XLT 4WD SUV, 2005} Dice: 0.32 Jacc: 0.19 Dice boosts numerator If intersection is small, denominator of Dice almost same as Jaccard, so numerator matters more Vector Space Matching for Semantic Annotation:  Vector Space Matching for Semantic Annotation new 2007 altima 02 M3 Convertible .. Absolute beauty!!! Awesome car for sale! It’s an accord, I think… {BMW, M3, 2 Dr STD Convertible, 2002}  0.5 Average score splits matches from non-matches, eliminating false positives Threshold for matches from data Using average assumes good matches and bad ones (see this in the data…) {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007}  0.36 {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007}  0.36 {HONDA, ACCORD, 4 Dr LX, 2001}  0.13 Avg. Dice = 0.33 < 0.33 Vector Space Matching for Semantic Annotation:  Vector Space Matching for Semantic Annotation Attributes in agreement Set of matches: ambiguity in differing attributes Which is better? All have maximum score as matches! We say none, throw away differences… Union them? In real world, not all posts have all attributes E.g.: new 2007 altima {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007}  0.36 {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007}  0.36 Experimental Data Sets:  Experimental Data Sets Reference Sets: Posts: Results: Choose Ref. Sets (Jensen-Shannon):  Results: Choose Ref. Sets (Jensen-Shannon) T = 0.6 Results: Semantic Annotation:  Results: Semantic Annotation Supervised Machine Learning: notion of matches/ nonmatches in its training data In agreement issues Related Work:  Related Work Semantic Annotation Rule and Pattern based methods assume structure repeats to make rules & patterns useful. In our case, unstructured data disallows such assumptions. SemTag (Dill, et. al. 2003): look up tokens in taxonomy and disambiguate They disambiguate 1 token at time. We disambiguate using all posts during reference set selection, so we don’t have their ambiguity issue such as “is jaguar a car or animal?” [Reference set would tell us!] We don’t require carefully formed taxonomy so we can easily exploit widely available reference sets Info. Extraction using Reference Sets CRAM – unsupervised extraction but given reference set & labels all tokens (no junk allowed!) Cohen & Sarawagi 2004 – supervised extraction. Ours is unsupervised Resource Selection in Distr. IR (“Hidden Web”) [Survey: Craswell et. al. 2000] Probe queries required to estimate coverage since they don’t have full access to data. Since we have full access to reference sets we don’t use probe queries Conclusions:  Conclusions Unsupervised semantic annotation System can accurately query noisy, unstructured sources w/o human intervention E.g. Aggregate queries (avg. Honda price?) w/o reading all posts Unsupervised selection of reference sets repository grows over time, increasing coverage over time Unsupervised annotation competitive with machine learning approach but without burden of labeling matches. Necessary to exploit newly collected reference sets automatically Allow for large scale annotation over time, w/o user intervention Future Work Unsupervised extraction Collect reference sets and manage with an information mediator

Add a comment

Related presentations

Related pages

p123 - Ace Recommendation Platform - 1

p123. We found 20 results related to this asset. Document Information; Type: Lecture Notes; Total # of pages: 8. Avg Rating: Price: ...
Read more

An Automatic Approach to Semantic Annotation of ...

An Automatic Approach to Semantic Annotation of Unstructured An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First
Read more

An Automatic Approach to Semantic Annotation of ...

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look∗ Matthew Michelson and Craig A. Knoblock University of ...
Read more

Ebooks-Gratuits.Me > Cram 78.ppt : 253 Résultats 4/6

Ebooks-Gratuits.Me > Cram 78.ppt : 253 Résultats Page 4/6 : Lancer votre recherche d'un document sur le web et trouver tous les types de fichiers ...
Read more

Search › literary chapter 8 key people biology | Quizlet

Search results for: literary chapter 8 key people biology. 500 Study Sets 500 Sets 500 Classes 500 Users; Most relevant Most recent
Read more

Search › random chapter 8 key people | Quizlet

Search results for: random chapter 8 key people 500 Study Sets 500 Sets 500 Classes 500 Users; Most relevant Most recent
Read more

Search › ch 13 key people chapter 8 | Quizlet

Search results for: ch 13 key people chapter 8 500 Study Sets 500 Sets 500 Classes 500 Users; Most relevant Most recent
Read more

Search › 7 and chapter 8 key people | Quizlet

Search results for: 7 and chapter 8 key people 500 Study Sets 500 Sets 500 Classes 500 Users; Most relevant Most recent
Read more

Search › ch and 9 chapter 8 key people | Quizlet

Search results for: ch and 9 chapter 8 key people. 500 Study Sets 500 Sets 500 Classes 500 Users; Most relevant Most recent
Read more