Information about Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small...

Published on January 7, 2009

Author: yurylifshits

Source: slideshare.net

Similarity Search: an Example Input: Set of objects Task: Preprocess it

Similarity Search: an Example Input: Set of objects Task: Preprocess it Query: New object Task: Find the most similar one in the dataset

Similarity Search: an Example Input: Set of objects Task: Preprocess it Most similar Query: New object Task: Find the most similar one in the dataset Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Similarity Search Search space: object domain U, similarity function σ Input: database S = {p1 , . . . , pn } ⊆ U Query: q ∈ U Task: ﬁnd argmax σ(pi , q) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29

Nearest Neighbors in Theory Orchard’s Algorithm k-d-B tree Sphere Rectangle Tree Geometric near-neighbor access tree Excluded middle vantage point forest mvp-tree Fixed-height Vantage-point ﬁxed-queries tree AESA tree LAESA R∗ -tree Burkhard-Keller tree BBD tree Navigating Nets Voronoi tree Balanced aspect ratio M-tree tree vps -tree Metric tree Locality-Sensitive Hashing SS-tree R-tree Spatial approximation tree mb-tree Cover Multi-vantage point tree Bisector tree tree Generalized hyperplane tree Hybrid tree Slim tree k-d tree X-tree Spill Tree Fixed queries tree Balltree Quadtree Octree Post-ofﬁce tree Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29

Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Typical web dataset has separation effect 1/ 2 ≤ d(pi , pj ) ≤ 1 For almost all i, j : Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Typical web dataset has separation effect 1/ 2 ≤ d(pi , pj ) ≤ 1 For almost all i, j : Classic methods fail: Branch and bound algorithms visit every object Doubling dimension is at least log n/ 2 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Contribution Navin Goyal, YL, Hinrich Schütze, WSDM 2008: Combinatorial framework: new approach to data mining problems that does not require triangle inequality Nearest neighbor algorithm Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29

Contribution Navin Goyal, YL, Hinrich Schütze, WSDM 2008: Combinatorial framework: new approach to data mining problems that does not require triangle inequality Nearest neighbor algorithm This work: Better nearest neighbor search Detecting near-duplicates Navigability design for small worlds Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29

Outline Combinatorial Framework 1 New Algorithms 2 Combinatorial Nets 3 Directions for Further Research 4 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29

1 Combinatorial Framework Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 8 / 29

Comparison Oracle Dataset p1 , . . . , pn Objects and distance (or similarity) function are NOT given Instead, there is a comparison oracle answering queries of the form: Who is closer to A: B or C? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29

Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Then by similarity to r: rankr (s) s r Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Then by similarity to r: rankr (s) s r Dataset has disorder D if ∀p, r, s : rankr (s) ≤ D(rankp (r) + rankp (s)) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Combinatorial Framework = Comparison oracle Who is closer to A: B or C? + Disorder inequality rankr (s) ≤ D(rankp(r) + rankp(s)) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29

Combinatorial Framework: FAQ Disorder of a metric space? Disorder of Rk ? In what cases disorder is relatively small? Experimental values of D for some practical datasets? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29

Disorder vs. Others If expansion rate is c, disorder constant is at most c2 Doubling dimension and disorder dimension are incomparable Disorder inequality implies combinatorial form of “doubling effect” Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29

Combinatorial Framework: Pro & Contra Advantages: Does not require triangle inequality for distances Applicable to any data model and any similarity function Require only comparative training information Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29

Combinatorial Framework: Pro & Contra Advantages: Does not require triangle inequality for distances Applicable to any data model and any similarity function Require only comparative training information Limitation: worst-case form of disorder inequality Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29

2 New Algorithms Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 15 / 29

Nearest Neighbor Search Assume S ∪ {q} has disorder constant D Theorem There is a deterministic and exact algorithm for nearest neighbor search: Preprocessing: O(D7 n log2 n) Search: O(D4 log n) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29

Near-Duplicates Assume, comparison oracle can also tell us whether σ(x, y) > T for some similarity threshold T Theorem All pairs with over-T similarity can be found deterministically in time poly(D)(n log2 n + |Output|) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29

Visibility Graph Theorem For any dataset S with disorder D there exists a visibility graph: poly(D)n log2 n construction time O(D4 log n) out-degrees Naïve greedy routing deterministically reaches exact nearest neighbor of the given target q in at most log n steps Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29

q p1

p2 q p1

p2 q p1

p2 p3 q p1

p2 p3 q p1

p2 p3 p4 q p1 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

3 Combinatorial Nets Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 20 / 29

Combinatorial Ball B(x, r) = {y : rankx (y) < r} In other words, it is a subset of dataset S: the object x itself and r − 1 its nearest neighbors B(x, 10) x Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29

Combinatorial Net A subset R ⊆ S is called a combinatorial r-net iff the following two properties holds: Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r. Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29

Combinatorial Net A subset R ⊆ S is called a combinatorial r-net iff the following two properties holds: Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r. Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r How to construct a combinatorial net? What upper bound on its size can we guarantee? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29

Basic Data Structure Combinatorial nets: n For every 0 ≤ i ≤ log n, construct a -net 2i Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29

Basic Data Structure Combinatorial nets: n For every 0 ≤ i ≤ log n, construct a -net 2i Pointers, pointers, pointers: Direct & inverted indices: links between centers and members of their balls Cousin links: for every center keep pointers to close centers on the same level Navigation links: for every center keep pointers to close centers on the next level Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29

Fast Net Construction Theorem Combinatorial nets can be constructed in O(D7 n log2 n) time Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 24 / 29

Up’n’Down Trick Assume your have 2r-net for the dataset To compute an r-ball around some object p: Take a center p of 2r ball that is covering p 1 Take all centers of 2r-balls nearby p 2 For all of them write down all members of theirs 3 2r-balls Sort all written objects with respect to p and keep r 4 most similar ones. Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29

4 Directions for Further Research Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29

Future of Combinatorial Framework Other problems in combinatorial framework: Low-distortion embeddings Closest pairs Community discovery Linear arrangement Distance labelling Dimensionality reduction What if disorder inequality has exceptions? Insertions, deletions, changing metric Experiments & implementation Uniﬁcation challenge: disorder + doubling = ? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29

Summary Combinatorial framework: comparison oracle + disorder inequality New algorithms: Nearest neighbor search Deterministic detection of near-duplicates Navigability design Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29

Summary Combinatorial framework: comparison oracle + disorder inequality New algorithms: Nearest neighbor search Deterministic detection of near-duplicates Navigability design Thanks for your attention! Questions? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29

Links http://yury.name http://simsearch.yury.name Tutorial, bibliography, people, links, open problems Yury Lifshits and Shengyu Zhang Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design http://yury.name/papers/lifshits2008similarity.pdf Navin Goyal, Yury Lifshits, Hinrich Schütze Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search http://yury.name/papers/goyal2008disorder.pdf Benjamin Hoffmann, Yury Lifshits, Dirk Novotka Maximal Intersection Queries in Randomized Graph Models http://yury.name/papers/hoffmann2007maximal.pdf Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 29 / 29

BibTeX @INPROCEEDINGS{Lifshits09combinatorialalgorithms, author = {Yury Lifshits and Shengyu Zhang}, title = {Combinatorial algorithms for nearest ...

Read more

Publication » Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design. ... Yury Lifshits Shengyu Zhang. DOI: 10 ...

Read more

Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design Yury Lifshits ... neighbor queries in xed dimensions. In SODA’93,

Read more

Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design. ... near-duplicates and small-world design: Yury Lifshits, ...

Read more

Yury Lifshits is a web researcher and entrepreneur. ... Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design ...

Read more

Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design. Yury Lifshits*, ...

Read more

... near-duplicates and small-world design by Yury Lifshits, Shengyu Zhang ... Combinatorial algorithms for nearest neighbor search have two ...

Read more

Yury Lifshits,California Institute of Technology ... for nearest neighbors, near-duplicates and small ... Algorithms - SODA, pp. 318-326, 2009.

Read more

Yury Lifshits, Shengyu Zhang: Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design. SODA 2009: 318-326: 12: EE: Yury ...

Read more

## Add a comment