advertisement

Tries - Tree Based Structures for Strings

100 %
0 %
advertisement
Information about Tries - Tree Based Structures for Strings
Technology

Published on February 28, 2014

Author: amrinderarora

Source: slideshare.net

Description

Trie (aka radix tree or prefix tree), is an ordered tree data structure where the keys are usually strings. Tries have tremendous applications from all sorts of things like dictionary to
advertisement

CS 6213 – Advanced Data Structures

 Instructor Prof. Amrinder Arora amrinder@gwu.edu Please copy TA on emails Please feel free to call as well  TA Iswarya Parupudi iswarya2291@gwmail.gwu.edu CS 6213 - Advanced Data Structures - Arora L6 - Tries 2

Record / Struct Basics Arrays / Linked Lists / Stacks / Queues Graphs / Trees / BSTs Trie, B-Tree CS 6213 Advanced Splay Trees Union Find Spatial Databases Applications String In Memory CS 6213 - Advanced Data Structures - Arora L6 - Tries 3

 Michael T. Goodrich and Roberto Tamassia Data Structures and Algorithms in Java (4th edition) John Wiley & Sons, Inc. ISBN: 0-471-73884-0  Haim Kaplan, Tel Aviv University  Jörg Liebeherr, University of Toronto CS 6213 - Advanced Data Structures - Arora L6 - Tries 4

Naïve, brute force for searching a text of size n and a pattern of size m requires O(nm) time. CS 6213 - Advanced Data Structures - Arora Preprocessing the pattern speeds up pattern matching queries. E.g., KMP algorithm performs pattern matching in time proportional to the text size: O(n) If the text is large, immutable and searched often (e.g., Shakespeare) , we may want to preprocess the text itself. Want to perform the searching in O(m) time. L6 - Tries 5

 A trie is a compact data structure for representing a set of strings, such as all the words in a text. A trie supports pattern matching queries in time proportional to the pattern size: O(m) CS 6213 - Advanced Data Structures - Arora L6 - Tries 6

Standard Tries Compressed Tries Compact Representation Suffix Trie CS 6213 - Advanced Data Structures - Arora L6 - Tries 7

 The standard trie for a set of strings S is an ordered tree such that:  Each node but the root is labeled with a character  The children of a node are alphabetically ordered  The paths from the root to the leaves yield the strings of S  Example: set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } b e s i a l r d l CS 6213 - Advanced Data Structures - Arora u l l e y t l o l c k p L6 - Tries 8

 A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: total size of the strings in S size of the string parameter of the operation size of the alphabet n m d b e s i a l r d l CS 6213 - Advanced Data Structures - Arora u l l e y t l o l c k p L6 - Tries 9

 We insert the words of the text into a trie  Each leaf a e s e e a r l 6 s e l l s t o c k ! b u l l ? b u y s t o c k ! 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 s t o c k ! b i d s t o c k ! 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 h e a r t h e b e l l ? s t o p ! 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 b h i l b e a r ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 b i d stores the occurrences of the associated word in the text a s e e d 47, 58 78 CS 6213 - Advanced Data Structures - Arora u l s e y 36 a l r 30 69 e e t l 0, 24 l 12 o c k 17, 40, 51, 62 p 84 L6 - Tries 10

 A compressed trie has b internal nodes of degree at least two e  It is obtained from standard trie by compressing chains ar of “redundant” nodes id ll l r l d ell y to ck p s i a u ll b e s u l l CS 6213 - Advanced Data Structures - Arora e y t l o l c k p L6 - Tries 11

 Compact representation of a compressed trie for an array of strings:  Stores at the nodes ranges of indices instead of substrings  Uses O(s) space, where s is the number of strings in the array  Serves as an auxiliary index structure 0 1 2 3 4 0 1 2 3 S[4] = S[1] = s e e b e a r S[2] = s e l l S[3] = s t o c k S[0] = S[7] = S[5] = b u l l b u y S[8] = h e a r b e l l S[6] = b i d S[9] = s t o p 1, 0, 0 1, 1, 1 1, 2, 3 0 1 2 3 4, 1, 1 6, 1, 2 8, 2, 3 CS 6213 - Advanced Data Structures - Arora 0, 0, 0 7, 0, 3 4, 2, 3 0, 1, 1 5, 2, 2 0, 2, 2 3, 1, 2 2, 2, 3 3, 3, 4 9, 3, 3 L6 - Tries 12

 Begins with: where name like „x%‟  Ends with: where name like „%x‟  Substring: where name like „%x%‟ CS 6213 - Advanced Data Structures - Arora L6 - Tries 13

 The suffix trie of a string X is the compressed trie of all the suffixes of X m i n i m i z e 0 1 2 3 4 5 6 7 e mize i nimize CS 6213 - Advanced Data Structures - Arora mi ze nimize nimize ze ze L6 - Tries 14

 Compact representation of the suffix trie for a string X of size n from an alphabet of size d  Uses O(n) space  Supports arbitrary pattern matching queries in X in O(dm) time, where m is the size of the pattern  Can be constructed in O(n) time m i n i m i z e 0 1 2 3 4 5 6 7 7, 7 4, 7 1, 1 2, 7 CS 6213 - Advanced Data Structures - Arora 0, 1 6, 7 2, 7 2, 7 6, 7 6, 7 L6 - Tries 15

 Auto complete: User types “Rob” and you can type with all words that begin with Rob, or all contacts that begin with Rob, etc.  Sequence Assembly in Genetics Sequences  Sorting of Large Sets of Strings: BurstSort  Big Data: See “TeraSort.java” source code CS 6213 - Advanced Data Structures - Arora L6 - Tries 16

Packets of Fun CS 6213 - Advanced Data Structures - Arora L6 - Tries 17

Routing Table Output Scheduling Switch Fabric Routing Decision Routing Table Forwarding Decision Routing Table Forwarding Decision CS 6213 - Advanced Data Structures - Arora L6 - Tries 18

 A standardized exterior gateway protocol designed to exchange routing and reachability information between autonomous systems (AS) on the Internet.  Makes routing decisions based on paths, network policies and/or rule-sets configured by a network administrator.  Plays a key role in the overall operation of the Internet and is involved in making core routing decisions.  [Itself uses TCP to exchange its own data.] CS 6213 - Advanced Data Structures - Arora L6 - Tries 19

Source: Geoff Huston, APNIC CS 6213 - Advanced Data Structures - Arora L6 - Tries 20

128.143.71.21 With CIDR, there can be multiple matches for a destination address in the routing table Destination address 10.0.0.0/8 128.143.0.0/16 Longest Prefix Match: Search for the routing table entry that has the longest match with the prefix of the destination IP address (Most Specific Router): Search for a match on all 32 bits 2. Search for a match for 31 bits ….. 32. Search for a match on 0 bits 1. Needed: Data structure that supports a FAST longest prefix match lookup! CS 6213 - Advanced Data Structures - Arora 128.143.64.0/20 Next hop R1 R2 R3 128.143.192.0/20 R3 128.143.71.0/24 R4 128.143.71.55/32 R3 Default R5 The longest prefix match for 128.143.71.21 is with 128.143.71.0/24  Datagram will be sent to R4 L6 - Tries 21

 The following algorithms are suitable for Longest Prefix Match routing table lookups  Tries  Path-Compressed Tries  Disjoint-prefix binary Tries  Multibit Tries  Binary Search on Prefix  Prefix Range Search CS 6213 - Advanced Data Structures - Arora L6 - Tries 22

 A trie is a tree-based data t p t e o te n ten p o to a tea po o top t pot CS 6213 - Advanced Data Structures - Arora structure for storing strings:  There is one node for every common prefix  The strings are stored in extra leaf nodes  Prefixes are not only stored at leaf nodes but also at internal nodes L6 - Tries 23

Structure  Each leaf contains a possible address  Prefixes in the table are marked (dark) Search  Traverse the tree according to destination address  Most recent marked node is the current longest prefix  Search ends when a leaf node is reached CS 6213 - Advanced Data Structures - Arora L6 - Tries 24

Update  Search for the new entry  Search ends when 1  If there is no branch z 1010* 0 z CS 6213 - Advanced Data Structures - Arora a leaf node is reached to take, insert new node(s) L6 - Tries 25

d  Goal: Eliminate long sequences of 1-child nodes  Path compression  collapses 1-child branches  Path Compression:  Requires to store additional information with nodes Bit number field is added to node  Bit string of prefixes must be explicitly stored at nodes  Need to make comparison when searching the tree CS 6213 - Advanced Data Structures - Arora L6 - Tries 26

d  Search: “010110”  Root node: Inspect 1st bit and move left  “a” node:  Check with prefix of a (“0*”) and find a match  Inspect 3rd bit and move left  “b” node:  Check with prefix of b (“01000*”) and determine that there is no match  Search stops. Longest prefix match is with a CS 6213 - Advanced Data Structures - Arora L6 - Tries 27

 Multiple matches in longest prefix rule require backtracking of search  Goal: Transform tree as to avoid multiple matches  Disjoint prefix:  Nodes are split so that there is only one match for each prefix (“Leaf pushing”)  Consequence: Internal nodes do not match with prefixes  Results:  a (0*) is split into: a1 (00*), a3 (010*), a2 (01001*)  d (1*) is represented as d1 (101*) CS 6213 - Advanced Data Structures - Arora L6 - Tries 28

 Goal: Accelerate lookup by inspecting more than one bit at a time  “Stride”: number of bits inspected at one time  With k-bit stride, node has up to 2k child nodes  2-bit stride:  1-bit prefix for a (0*) is split into 00* and 01*  1-bit prefix for d (1*) is split into 10* and 11*  3-bit prefix for c has been expanded to two nodes  Why are the prefixes for b and e not expanded? CS 6213 - Advanced Data Structures - Arora L6 - Tries 29

 Bounds are expressed for  Look-up time: What is the longest lookup time?  Update time: How long does it take to change an entry?  Memory: How much memory is required to store the data structure?  W: length of the address (32 bits)  N: number of prefix in the routing table Scheme Lookup Update Memory Binary trie O(W) O(W) O(NW) Path-compressed trie O(W) O(W) O(NW) k-stride multibit trie O(W/k) O(W/k+2k) O(2kNW/k) CS 6213 - Advanced Data Structures - Arora L6 - Tries 30

 Excellent data structure for managing Strings  Supports prefix and suffix kind of lookups  Extremely fast – After the Trie has been built, the search time is O(m) where m is the size of the pattern.  Can be used to build indexes  Various applications in areas that use Strings (Literature/Dictionary/Content, as well as Networks and Bioinformatics) CS 6213 - Advanced Data Structures - Arora L6 - Tries 31

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Trie - Wikipedia

... the keys are usually strings. Unlike a binary search tree, ... character-based trie except that ... Data Structures: Trie; Tries by Lloyd ...
Read more

HAT-trie: A Cache-conscious Trie-based Data Structure for ...

HAT-trie: A Cache-conscious Trie-based Data ... Tries are the fastest tree-based data structures for ... Based on the success of copying strings to ...
Read more

Tries | Brilliant Math & Science Wiki

Tries (also known as radix ... are tree-based data structures that are typically used to store ... The unbalanced tree, would clip off unnecessary strings ...
Read more

Tries – Tree Based Structures for Strings | Software Journal

Search for: Blog Topics. Algorithms; Analytics & BI; Education; Healthcare; HR; Product Management
Read more

Figure 1 from HAT-Trie: A Cache-Conscious Trie-Based Data ...

Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces ...
Read more

HAT-Trie: A Cache-Conscious Trie-Based Data Structure For ...

Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces ...
Read more

B-tries for disk-based string management

B-tries for disk-based string management ... large set of strings. The B-tree was proposed by Bayer and ... of highly dense tree structures.
Read more

The Trie Data Structure: Examples in Java | Toptal

The Trie: A Neglected Data Structure ... i.e. the sum of the length of the strings in the dictionary. ... Tries are neglected data structures, ...
Read more

Burst Tries: A Fast, E cient Data Structure for String Keys

... E cient Data Structure for String Keys ... of search tree that maintain strings in sorted order. Tries [4 ... tree structures for tasks such as ...
Read more