Published on February 5, 2014
Addressing scalability challenges in peer-to-peer search PhD seminar 4 Feb, 2014 Harisankar H, PhD scholar, DOS lab, Dept. of CSE Advisor: Prof. D. Janakiram http://harisankarh.wordpress.com
Outline • Issues with centralized search – Can peer-to-peer search help? • Scalability challenges in peer-to-peer search • Proposed architectural extensions – Two-layered architecture for peer-to-peer concept search – Cloud-assisted approach to handle query spikes
Centralized search scenario • Scenario – Search engines crawl available content, index and maintain it in data centers – User queries directed to data centers, processed internally and results sent back – Centrally managed by single company Content End users Datacenters
Some issues with centralized search – Privacy concerns • All user queries accessible from a single location – Centralized control • Individual companies decide what to(not to) index, rank etc. – Transparency • Complete details of ranking, pre-processing etc. not made available publicly • Concerns of censorship and doctoring of results
Some issues with centralized search contd.. • Uses mostly syntactic search techniques – Based on word or multi-word phrases – Low quality of results due to ambiguity of natural language • Issues with centralized semantic search – Difficult to capture long tail of niche interests of users • Requires rich human generated knowledge bases in numerous niche areas
Peer-to-peer search approach • Edge nodes in the internet participate in providing and using the search service • Search as a collaborative service • Crawling, indexing and search distributed across the peers
How could peer-to-peer search help? • Each user query can be sent to a different peer among millions – Obtaining query logs in a single location difficult – Reduced privacy concerns • Distributed control across numerous peers – Avoids centralized control • Search application available with all peers – Better transparency in ranking etc. • Background knowledge of peers can be utilized for effective semantic search – Can help improve quality of results • Led to lot of academic research in the area as well as real world p2p search engines* * e.g., faroo.com, yacy.net; YacyPi Kickstarter project
Realizing peer-to-peer search • Distribution of search index – Term partitioning • Responsibility of individual terms assigned to different peers – E.g., peer1 is currently responsible for term “computer” • Term-to-peer mapping achieved through a structured overlay(e.g., DHT) Image src: http://wwarodomfr.blogspot.in/2008/09/chord-routing.html
Scalability challenges in peer-to-peer search • • • • Peers share only idle resources Peers join/leave autonomously Limited individual resources leads to No SLA – Peer bandwidth bottleneck during query processing • Particularly queries involving multiple terms(index transfer between multiple peers) – Instability during query spikes • Knowledge management issues at large scale – Difficult to have consensus at large scale – Need wide understanding and have to meet requirements of large diverse group
Two-layered architecture for peer-topeer concept search* • Peers organized as communities based on common interest • Each community maintains its own background knowledge to use in semantic search – Maintained in a distributed manner • A global layer with aggregate information to facilitate search across communities • Background knowledge bases extend from minimal universally accepted knowledge in upper layer • Search, indexing and knowledge management proceeds independently in each community *joint work with Prof. Fausto Guinchiglia and Uladzimir, Univ. of Trento
Two-layered architecture for peer-to-peer concept search GLOBAL Comm: index UK BK-1 doc index -1 BK-3 Community-1 doc index -3 Community-3 BK-2 doc index -2 Community-2
Two-layered architecture • Global layer – retrieves relevant communities for query based on universal knowledge • Community layer – retrieves relevant documents for query based on background knowledge of community
Overcoming the shortcomings of singlelayered approaches • Search can be scoped only to the relevant communities for a query – Results in less bandwidth-related issues • Two layers make knowledge management scalable and interoperable – Niche interests supported at community-level background knowledge bases – Minimal universal knowledge for interoperability • Search within community based on community’s background knowledge – Focused interest of community helps in better term-toconcept mapping
Two-layered approach • Index partitioning – Uses partition-by-term • Posting list for each term stored in different peers – Uses Distributed Hash Table(DHT) to realize dynamic termto-peer mapping • O(logN) hops for each lookup • Overlay network – Communities and global layer maintained using twolayered overlay • Based on our earlier work on computational grids* – O(logN) hops for lookup even with two-layers *M.V. Reddy, A.V. Srinivas, T. Gopinath, and D. Janakiram, “Vishwa: A reconfigurable P2P middleware for Grid Computations,” in ICPP'06
Two-layered approach • Community management – Similar to public communities in flickr, facebook etc. • Search within community – Uses Concept Search* as underlying semantic search scheme • Extends syntactic search with available knowledge to realize semantic search • Falls back to syntactic search when no knowledge is available *Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu, “Concept search”, ESWC 2009
Two-layered approach • Knowledge representation – Term -> concept mapping – Concept hierarchy • Concept relations expressed as subsumption relations • Concepts in documents/queries extracted – by analyzing words and natural language phrases – Nounphrases translated into conjunctions of atomic concepts (complex concepts) • Small-1Λdog-2 – Documents/queries represented as enumerated sequences of complex concepts • Eg: 1:small-1Λdog-2 2:big-1Λanimal-3
Two-layered approach • Relevance model – Documents having more specific concepts than query concepts considered relevant • Eg: poodle-1 relevant when searching for dog-2 – Ranking done by extending tf-idf relevance model • Incorporates term-concept and concept-concept similarities also • Distributed knowledge maintenance – Each atomic concept indexed on DHT with id – Node responsible for each atomic concept id also stores ids of • All immediate more specific atomic concepts • All concepts in the path to root of the atomic concept
Two-layered approach • Document indexing and search – Concepts mapped to peer using DHT – Query routed to peers responsible for the query concepts and related concepts – Results from multiple peers combined to give final results • Global search – The popularity(document frequency) of each concept indexed in upper layer – Tf-idf extended with universal knowledge to search for communities – Combined score of doc = (score of community)*(score of doc within community)
Experiments • Single layer syntactic vs semantic: TREC ad-hoc,TREC8 ( simulated with 10,000 peers) – Wordnet as knowledge base • Single vs 2 layer – 18 communities (doc: categories in dMoz*) • 18*1000 = 18,000 peers simulated – – – – – UK = domain-independent concepts and relations from wordnet BK = UK + wordnet domains + YAGO BK mapped to communities Queries selected as directory path to a specific subdirectory Standard result: documents in that subdirectory *http://www.dmoz.org/
Experiments • Tools – GATE(NLP), Lucene(search library), PeerSim(peer-topeer system simulator) • Performance metrics – Quality • Precision @10, precision @20 • Mean average precision, MAP – Network bandwidth • Average number of postings transferred – Response time • s-postings, s-hops
Results (1 layer syntactic vs semantic) • Quality improved • But, cost also increased
Results (1 layer vs 2 layer) • Quality improved • Cost decreased – 94% decrease in posting transfer for opt. case
Two-layered approach results • Proposed approach gives better quality and performance over single-layered approaches – Performance can further improved using optimizations like early termination • But, issue of query spikes remain
Query spikes in peer-to-peer search • Query spikes can lead to instability – Replication/caching insufficient due to high document creation rate* rate of queries related to “Bin laden” increased by 10,000 times within one hour in Google on May 1, 2011 after Operation Geronimo.
Some background • Term-partitioned search – Term/popular query responsibility assigned to individual peers • Updates and queries are sent to peer responsible which process them – Term -> peer mapping done using a Distributed Hash Table(DHTs) top-k result list of q
Cloud-assisted p2p search(CAPS) • Offload responsibilities of spiking queries to public cloud
Issues in realizing CAPS • Maintaining full index copy in cloud is very expensive – Storage alone will cost more than 5 million dollars per month* • Approach: transfer only relevant index portion to cloud – Need to be performed fast considering effect on user experience(result quality, response time) • Effect on the desirable properties of peer-to-peer search – Privacy, transparency, decentralized control etc.
CAPS components • Switching decision maker – Decide when to switch – Simple e.g., “switch when query rate increases by X% within last Y seconds” • Switching implementor – Switching algorithm to seamlessly transfer index partition – Dynamic creation of cloud instances
CAPS Switching algorithm • Ensures that result quality is not affected • Controlled bandwidth usage at peer
Addressing additional concerns • Transparency – Index resides both among peers and cloud • Centralized control – Query can switched back to peers or other clouds • Privacy – Only spiking queries(less revealing) are forwarded to cloud • Cost – Cloud used only transiently for spiking queries • Cloud payment model – Anonymous keyword-based advertising model*
CAPS Evaluation • Experimental setup – Target system consists of millions of peers – Implemented the relevant components in a realistic network • Responsible peer, preceding peers, cloud instance • Datasets – Real datasets on query/corresponding updates(rates) not publicly available – Used synthetic queries and updates with expected query/update rates/ratio
Experimental setup • 6 heterogeneous workstations with 4-6 cores, 8-16GB RAM used
Experiments • Two sets of experiments 1. Demonstrate effect of query spike with and without cloud-assistance 2. Effect of switching on user experience • Response time and result quality • Switching time
Results-1 With cloud assistance Without cloud assistance
Results-2(effect of switching on user experience) • Result freshness • Response time
Conclusions • Peer-to-peer search has many advantages by design compared to centralized search • But, peer-to-peer search approaches have scalability issues • Two-layered approach to peer-to-peer search can improve efficiency and result quality of peer-topeer search • Offloading queries to cloud can be an effective method to handle query spikes – Desirable properties of p2p systems not lost
Publications • Janakiram Dharanipragada and Harisankar Haridas, “Stabilizing peer-to-peer systems using public cloud: A case study of peer-topeer search”, In the The 11th International Symposium on Parallel and Distributed Computing(ISPDC 2012), held at Munich, Germany. • Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Uladzimir Kharkevich, “Two-layered architecture for peer-topeer concept search”, In the 4th International Semantic Search Workshop located at the 20th Int. World Wide Web Conference(WWW 2011), 2011), held at Hyderabad, India. • Harisankar Haridas, Sriram Kailasam, Prateek Dhawalia, Prateek Shrivastava, Santosh Kumar and Janakiram Dharanipragada, “Vcloud: A Peer-to-peer Video Storage-Compute Cloud”, In the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing(HPDC 2012), held at Delft, The Netherlands[Poster].
THANK YOU Questions/Suggestions harisankarh[ at ]gmail.com
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
Read "A peer-to-peer service architecture for the ... Important challenges in ... that reconciles the reliability and scalability of Peer-to-Peer ...
Survey: Data Management Issues in Peer-to ... we highlight a number of important challenges in P2P ... “Improving Search in Peer-to-Peer Networks ...
... A Scalable Peer-to-peer Lookup Protocol for Internet Applications ... search , authentication ... organized peer-to-peer applications. Scalability: ...
... functionality can assist in addressing business challenges in ... Search Microsoft Search. Cart ... 2013 to address challenges in enterprise business ...
... Addressing scalability issues using the CLF ... Advanced Search Include ... to a particular topic has become a challenge in itself ...
Addressing the terawatt challenge: scalability in the supply of chemical elements for renewable energy ... Search Articles By.
... worked on scalability challenges in ... Harisankar H; Two-layered peer-to-peer search. A two-layered architecture for peer-to-peer search supporting ...
... A Scalable Peer-to-peer Lookup Service for Internet Applications ... organized peer-to-peer applications. Scalability: ... challenge in implementing ...
Efficient Peer-to-Peer Keyword Searching ... a peer-to-peer search system can take advantage of explicit insert operations to build a ... scalability, and ...