Strategies for Processing and Explaining Distributed Queries on Linked Data

50 %
50 %
Information about Strategies for Processing and Explaining Distributed Queries on Linked Data

Published on March 7, 2014

Author: rakebhasan

Source: slideshare.net

Description

Strategies for Processing and Explaining Distributed Queries on Linked Data

Strategies for Processing and Explaining Distributed Queries on Linked Data Rakebul Hasan Wimmics Inria Sophia Antipolis

Research Theme • Distribute Query Processing – Optimization techniques for querying Linked Data • Distributed Query Explanation – Query plan explanation • How query solving works – Query result explanation • Why • Where 1

DISTRIBUTE QUERY PROCESSING 2

Querying Linked Data • Bottom-up strategies – Discover sources during query processing by following links between sources • Top-down strategies – Sources are known 3

Querying Linked Data FedBench Query CD5: Find the director and the genre of movies directed by Italians SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationalitydbpedia:Italy . ?xowl:sameAs ?film . ?xlinkedMDB:genre ?genre } 4

Querying Linked Data FedBench Query CD5: Find the director and the genre of movies directed by Italians SELECT ?film ?director ?genre WHERE { SERVICE <http://dbpedia.org/sparql>{ ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationalitydbpedia:Italy . ?xowl:sameAs ?film . } SERVICE <http://data.linkedmdb.org/sparql> { ?xlinkedMDB:genre ?genre } } SPARQL 1.1 SERVICE clause Need knowledge of which part of the query should be solved by which endpoint 5

• Top-down approaches – Data warehousing approach – Virtual integration approach 6

Querying Linked Data • Data warehousing approach – Collect the data in a central triple store – Process queries on that triple store • Disadvantages • Expensive preprocessing (data collection + integration) and maintenance • Data not up to date 7

Querying Linked Data • Virtual integration approach – A query federation middleware • Parse and split into subqueries • Source selection for subqueries • Evaluate the subqueries against corresponding sources directly • Merge the results – Advantages • no preprocessing and maintenance • up to date data 8

Running Example Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient drugbank http://www4.wiwiss.fu-berlin.de/drugbank/sparql SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } KEGG http://cu.bioportal.bio2rdf.org/sparql DBpedia http://data.linkedmdb.org/sparql 9

Parsing and Source Selecting drugbank Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient http://www4.wiwiss.fu-berlin.de/drugbank/sparql ?drug drugbank:drugCategorydrugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } KEGG http://cu.bioportal.bio2rdf.org/sparql ?keggDrugrdf:typekegg:Drug ?keggDrug bio2rdf:xRef ?id Send ask queries to all the sources ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient } ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrugrdf:typekegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrugpurl:title ?title } DBpedia http://data.linkedmdb.org/sparql ?keggDrugpurl:title ?title 10

Evaluating subqueries • Two options – All triple patterns are individually evaluated – Nested loop join (NLJ): evaluate iteratively pattern by pattern 11

Evaluating subqueries • Example of NLJ SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } 12

Optimization techniques • Source Selection – Indexing: characterization of RDF graphs – Statistics based catalogue – Caching 13

Optimization techniques • Exclusive grouping SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } 14

Optimization techniques • Bound join – Effective for NLJ – Mappings of variable values from the intermediate result to the next subquery – SPARQL 1.1 VALUES clause 15

Optimization techniques • Hash join – Hash table for intermediate mappings 16

Optimization techniques • FILTER optimization – Evaluating the subqueries with corresponding FILTERS – Reduces the number of intermediate results 17

Optimization techniques • Parallelization – Effective for individual subquery evaluation approach 18

Optimization techniques • Join order – Selectivity estimation: join order heuristic based on the number of bound and free variables [1] – Statistics: based on cardinalities of the triple patterns [2] [1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08) [2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10) 19

Selectivity Estimation • Ideal for the Linked Data scenario – No need for statistics about the underlying data – Estimations can be wrong however BGP Number of results Selectivity estimation ?v1 p1 ?v2 ?v1 p2 ?v3 5 2 ?v4 p3 c1 ?v4 p4 ?v1 10 1 20

Query Performance Prediction • Learn query performance from already executed queries • Statistics about the underlying data not required • Applications – Query optimization: join order – Workload management/scheduling 21

Regression Learn a mapping function f(X) = y X = feature vector, vector representation of SPARQL query y = performance metric (e.g. latency, number of results) Support vector machine with nu-SVR k-nearest neighbors regression 22

Feature Extraction • How can we represent SPARQL queries as vectors? 23

SPARQL • SPARQL algebra features • Graph pattern features 24

SPARQL Algebra http://www.w3.org/TR/sparql11-query/#sparqlQuery 25

SPARQL Algebra Features 26

Graph Pattern Features 27

Graph Pattern Features • Clustering Training Queries – K-mediods clustering algorithm with approximated edit distance as distance function • Selects data points as cluster centers • Arbitrary distance function 28

Graph Pattern Features • Graph Edit Distance – Minimum amount of distortion needed to transform one graph to another – Compute similarity by inversing distance 29

Graph Pattern Features • Graph Edit Distance – Usually computed using A* search • Exponential running time – Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with classification problems 30

Experiment Setup • Triple store and hardware – Jena TDB 1.0.0 – 16 GB memory – 2.53 GHz CPU – 48 GB system RAM – Linux 2.6.32 operating system 31

Experiment Setup • Datasets – Training, validation, test datasets generated from 25 DBPSB query templates • 1260 training queries • 420 validation queries • 420 test queries – RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework 32

Prediction Models • Support Vector Machine (SVM) with the nuSVR kernel for regression • k-nearest neighbors (k-NN) regression – Euclidean distance as the distance function – k-dimensional tree (k-d tree) data structure to compute the nearest neighbors 33

Evaluation Measures • Coefficient of determination • Root mean squared error (RMSE) 34

Predicting Query Execution Time • SPARQL algebra features 35

Predicting Query Execution Time • SPARQL algebra and graph pattern features 36

What’s Next • Apply QPP in join order optimization • Benchmarking – FedBench: a benchmarking framework for federated query processing • Systematic generation of training queries – Bootstrapping – Refining training queries from query logs 37

Statistical Analysis of Query Logs • Approach to Systematic generation of training queries Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011) 38

Summary • Distributed query processing • Optimization techniques • Query performance prediction – Join order optimization 39

DISTRIBUTED QUERY EXPLANATION 40

Query Explanation • Query plan explanation • Query result explanation • Motivation – Understanding – Transparency – Trust 41

Query Explanation in the Semantic Web Query plan explanation Query result explanation Jena Sesame Virtuoso * BigOWLIM + * Explanation of SPARQL to SQL translation + a debugging feature on the query engine side 42

Distributed Query Explanation in the Semantic Web Query plan explanation Query result explanation FedX DARQ SemWIQ ADERIS Anapsid 43

ADERIS Query Plan Explanation 44

Related Work • Database: Why, how, where provenance • Semantic Web: – Inference explanation • Provenance • Generating justifications 45

Distributed Query Explanation What We Provide • Query plan explanation – Work in progress – Prior to query execution • Includes predicted performance metrics – After query execution with performance metrics • Query result explanation – Why provenance – Where provenance 46

Query Result Explanation (RDF model, Query, Result) Result Explainer Plug-in Can be any RDF model (RDF graph, SPARQL endpoint) We generate result explanations by querying this model for why provenance 47

Why Provenance • Triples in virtual model from which the result is derived <http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" . 48

Where Provenance • Keep the provenance of sources in the virtual model fedqld:source1 { <http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . } fedqld:source2 { <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . Where the triples in this graph come from <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" . } fedqld:prov { fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> . fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> . } 49

What’s next • Explanation user interfaces • Evaluating the impacts of our explanations 50

References • • • Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for distributed query processing on linked open data. In ESWC, 2011 Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura. Adaptive integration of distributed semantic web data. Databases in Networked Information Systems Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11) 51

Add a comment

Related presentations

Related pages

Predicting SPARQL Query Performance and Explaining Linked Data

Predicting SPARQL Query Performance and Explaining Linked Data? ... distributed scenario, consumers of these data may need ... Web query processing, ...
Read more

Distributed Query Architecture

... OLE DB data sources in Transact ... name to an OLE DB data source. Objects in these linked servers ... OLE DB providers. Distributed queries can ...
Read more

FedX: Optimization Techniques for Federated Query ...

... Optimization Techniques for Federated Query Processing on Linked Data. Look ... Especially in distributed settings that require joining data ...
Read more

Distributed Queries

Distributed queries access data from ... Queries. Distributed queries in distributed ... for distributed queries and linked ...
Read more

Distributed Database Concepts - Oracle Help Center

Distributed Database Concepts. ... system that can appear to applications as a single data source. Distributed processing ... Distributed Query ...
Read more

Advantages of Distributed Data Processing | Chron.com

Business Planning & Strategy | ... Distributed data processing considerably lowers the ... Individual computers that comprise a distributed network are ...
Read more

Learning from the History of Distributed Query Processing ...

... for query processing over Linked Data. Being based upon techniques originally developed for distributed ... Distributed Query Processing ...
Read more

An Architecture for Fast and General Data Processing on ...

An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia Electrical Engineering and Computer Sciences University of California ...
Read more