Stalker

50 %
50 %
Information about Stalker
Entertainment

Published on November 1, 2007

Author: Garrick

Source: authorstream.com

A Hierarchical Approach to Wrapper Induction [Muslea et al, In Proceedings of Agents ‘99]:  A Hierarchical Approach to Wrapper Induction [Muslea et al, In Proceedings of Agents ‘99] Acknowledgements: Weijun Lou Dr. Graig Knoblock Dr. Ion Muslea Dr. Steve Minton Dr. Subbarao Kambhampati Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Slide3:  Name: KFC Cuisine: Fast Food Locations: Venice (310) 123-4567, (800) 888-4412. L.A. (213) 987-6543. Encino (818) 999-4567, (888) 727-3131. RESTAURANT Name List ( Locations ) Cuisine City List (PhoneNumbers) AreaCode Phone The Embedded Catalog Tree (ECT) The Embedded Catalog Tree (ECT) (continuous):  The Embedded Catalog Tree (ECT) (continuous) Common conventions for structuring HTML page: The information on a page often exhibits some hierarchy Semistructured information is often presented in the form of lists of tuples The formalism (ECT) can describe the structure of a wide-wage of semistructured documents The Embedded Catalog Tree (ECT) (continuous):  The Embedded Catalog Tree (ECT) (continuous) The formalism: The leaves are the items of interest for user The internal nodes (include root) represent lists of k-tuples Each item in the k-tuple can be either a leaf l or another list L (called an embedded list) k indicates the number of items in the tuple Outline of the presentation :  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Extraction Rules:  Extraction Rules Why need rules: Given the ECT together with an extraction rule attached each edge and a iteration rule associated with each list node, the problem of extracting any item of interest by simply determining the path P from root to the corresponding leaf and by successively extracting each node x ∈P from its parent p. Slide8:  Training Example: 1: <p> Name: <b> Yala </b> <p>Cuisine: Thai <p> <i> 2: 4000 Colfax, Phoenix, AZ 85258 (602) 508-1570 3: </i><br><i> 4: 523 Vernon, Las Vegas, NV 89104 (702) 578-2293 5: </i><br><i> 6: 403 Pico, LA, CA 90007 (213) 789-0008 7: </i> Example of Rule Induction To extract the name, apply R1: SkipTo(<b>) start rule and R2: SkipTo</b> end rule to the parent node to identify the prefix and suffix of the item. Alternative rules to R1 can be R3: SkipTo(Name) SkipTo(<b>) or R4: SkipTo(Name Symbol HtmlTag). - match correctly. To extract list node, apply start rule SkipTo(<p><i>) and end rule skipTo(</i>). To break the list into individual tuples, apply SkipTo(<i>) and end rule SkipTo(</i>) iterately to the list node. Extraction rules:  Extraction rules How to express extraction rules ? – a sequence of landmarks: Landmark – a group of consecutive tokens (e.g. words, numbers, HTML tags, substrings, etc), eg. <b> Wildcard – a class of tokens, eg. Number, Sign, HtmlTag, AllCaps, etc. Function-like expression which take a landmark or a wildcard as arguments, eg. SkipTo(:<b>) Cascade functions, eg. SkipTo(AllCaps) NextLandmark(Number) Disjunctive rule, eg. either SkipTo(<b>) or SkipTo(<i>) Slide10:  Training Example: 1: <p> Name: <b> Yala </b> <p>Cuisine: Thai <p> <i> 2: 4000 Colfax, Phoenix, AZ 85258 (602) 508-1570 3: </i><br><i> 4: 523 Vernon, Las Vegas, NV 89104 (702) 578-2293 5: </i><br><i> 6: 403 Pico, LA, CA 90007 (213) 789-0008 7: </i> Example of Rule Induction (cont’s) To extract area-code, apply SkipTo(‘(‘) and end rule SkipTo(‘)’) iterately to each tuple To find beginning of ZIP code, apply R5: SkipTo(,) SkipUntil(Number) or R6: SkipTo(AllCaps) NextLandmark(Number) Extraction Rules (cont’s):  Extraction Rules (cont’s) Advantages Hierarchical extraction based on the ECT formalism allows to wrap info sources that have arbitrary many levels Each node is extracted independently of its siblings, allows missing items or items ordering variation in info sources. In general, using inductive algorithm (Stalker) with hierarchical extraction (ECT) turns an extremely hard problem into several simple ones: rather than finding a single rule, find a number of simpler rules which deal with the easier task of extracting one item from its ECT Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Extraction Rules as Finite Automata:  Extraction Rules as Finite Automata Landmark automata: a group of function-like SkipTo()s that must be applied in a pro-established order represents a landmark automaton. Any extraction rule in previous slides is a landmark automaton. Linear landmark: landmark described by a sequence of tokens and wildcards. Linear landmarks are sufficiently expressive to allow efficient navigation within ECT Linear landmarks can be generated and refined in a simple way during learning Extraction Rules as Finite Automata (cont’s):  Extraction Rules as Finite Automata (cont’s) Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone: (818)508-1570 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 E4: 403 Vernon, <b>Watts</b>, Phone: (310)798-0008 Rule for start of the area code: either Skipto(‘(‘) or SkipTo (Phone) SkipTo (<b>) An SLG for the start of the area code Landmark automata (LA) are nondeterministic finite automata in which Si->Sj (i=j) is labeled by a landmark li,j; i.e. the transition takes place iff the input in the state is a string that is accepted by the landmark li,j. Simple Landamark Grammars (SLGs) are the class of Las that correspond to the disjuctive rules Extraction Rules as Finite Automata (cont’s):  Extraction Rules as Finite Automata (cont’s) The properties of SLG: The initial state S0 has a branching –factor of k It has exactly k accepting states (one per branch) All k branches that leave the S0 are sequential Las Linear Landmarks label each non loop transitions; All looping transitions have the meaning “consume all tokens until you encounter the linear landmark that leads to the next state s3 s0 s1 s2 ( Phone <b> s0 Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Example of Extraction Task:  Example of Extraction Task NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751 The Wrapper Architecture:  Extraction Rules Query Data Information Extractor EC Tree The Wrapper Architecture Learning the Extraction Rules:  Inductive Learning System Extraction Rules Labeled Pages Learning the Extraction Rules GUI The STALKER Algorithm:  The STALKER Algorithm STALKER(Examples) let RetVal be an empty SLG WHILE Examples =0 aDisjunct = LearnDisjunct(Examples) remove all examples covered by aDisjunct add aDisjunct to RetVal return RetVal LearnDisjunct(Examples) Terminals = Wildcards U GetTokens(Examples) Candidates = GetInitailCandidateds(Examples) WHILE candidates ≠0 DO let D = BestDisjunct (Candidates) IF D is a perfect Disjunct THEN return D FOR EACH t ∈Terminals DO Candidates = Candidates U Refine(D, t) remove D from Candidates return best disjunct Slide21:  Example of Rule Induction Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone: (818)508-1570 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 E4: 403 Vernon, <b>Watts</b>, Phone: (310)798-0008 Initial candidates in the first iteration R1 R2 R3 R4 Slide22:  Example of Rule Induction Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 Initial candidates and Refinements in the second iteration Slide23:  STALKER (cont’s) Stalker is a typical sequential covering algorithm Stalker can be easily extended to uses SkipUntil() and NextLandMark() feature No change on the rule refining process GenerateInitCandidates() need to change to also generate SkipTo()SkipUntil() and SkipTo()NexLandmark() Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Experimental Results:  Experimental Results Illustrative information sources Comparison between WIEN and STALKER Correctness levels for S1 and S2 Results Analysis:  Results Analysis Stalker usually requires less than 10 examples to obtain a 97% average accuracy over 500 trials Termination criteria Stalker: either reach the 97% accuracy threshold or consumed 10 examples WIEN: 100% accuracy The most impressiveness lies in WIEN uses two orders of magnitude more example than Stalker While Stalker allows both missing items and various item ordering, accuracy ranges between 85% and 100% Stalker is reasonably fast Stalker: less than 20 sec per experiment for S1 and S2; less than 40 sec per item for S2 and S4. WIEN: 5 sec per experiment for S1 and 83 sec for S2 For easier task like S2 which has extremely regular structure, except for list extraction rule, all other rules have a 100% accuracy based on a single example! The learning curves for the list extraction and list iteration for all four sources show that these two rules can be extracted relative independently of the extraction difficulty of different source, they all converges quick and get accuracy level above 95% - From another perspective, it proves the usefulness of EC formalism Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Strength vs Weakness:  Strength vs Weakness Strength: Powerful extraction language EC Formalism: turns hard problem into a problem that is extremely easy in practice. Stalker: allows miss items and various item orderings One hard-to-extract item does not affect others Disadvantage: Requires hand-craft labelling, though usually 10 examples are sufficient. Man-made thresholds in termination criteria. For some hard tasks like S4, the accuracy obtained is still not satisfactory. Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Add a comment

Related presentations

Related pages

Stalking – Wikipedia

Stalking, juristisch Nachstellung ist das willentliche und wiederholte (beharrliche) Verfolgen oder Belästigen einer Person, deren physische oder ...
Read more

Stalking » Hilfe & Informationen für Stalking Opfer

Stalking - Informationen für Stalking Opfer. Sie werden von einem Stalker belästigt, verfolgt und terrorisiert? Dann helfen Ihnen die Informationen für ...
Read more

STALKER

S.T.A.L.K.E.R.: Shadow of Chernobyl, official site of new FPS computer game from GSC Game World.
Read more

Stalker - Film 1979 - FILMSTARTS.de

Stalker, Ein Film von Andreï Tarkovski mit Alexandre Kaidanovski, Alissa Feindikh. Übersicht und Filmkritik. Die Handlung spielt in einer nicht näher ...
Read more

Stalker - Serie - SAT.1

Stalker: Die Thrillerserie aus L.A. - Sendetermine, Infos, Bilder und Videos zur Crime-Serie mit Dylan McDermott und Maggie Q.
Read more

Stalker (TV Series 2014–2015) - IMDb

Created by Kevin Williamson. With Dylan McDermott, Maggie Q, Victor Rasuk, Mariana Klaveno. A team of detectives investigates stalkers in Los Angeles.
Read more

Stalking: "Der Stalker sucht eine Beziehung zum Opfer ...

Wie wird ein Mensch zum Stalker? Wo verläuft die Grenze zu Mobbing? Der Psychotherapeut Wolf Ortiz-Müller über die Tätermotive und die Versuchung ...
Read more

Stalker (Film) – Wikipedia

Stalker entstand in den Jahren 1978/79 als fünfter Spielfilm des sowjetischen Regisseurs Andrei Tarkowski. Das von Mosfilm produzierte Werk gilt als ...
Read more

Stalker 2 (PC) - Test, Download, Systemanforderungen ...

Auf GameStar.de erfahren Sie alles zum Actionspiel Stalker 2 von GSC Gameworld: Test, News, Wertung, Download, Systemanforderungen, Release ...
Read more

Stalking - Wikipedia

Stalking is unwanted or obsessive attention by an individual or group towards another person. Stalking behaviors are related to harassment and intimidation ...
Read more