Published on January 9, 2009
A New Enterprise Data Management Strategy for the US Government:Support for the Semantic Web : 1 A New Enterprise Data Management Strategy for the US Government:Support for the Semantic Web Brand Niemann, Senior Enterprise Architect, US EPA Co-chair, Federal Semantic Interoperability Community of Practice (SICoP) Position Paper for the W3C Workshop on RDF Access to Relational Databases Hosted by Novartis, Cambridge, MA October 25-26, 2007 Overview : 2 Overview 1. U.S. Government Data 2. Federal Enterprise Architecture Data Reference Model 3. SICoP White Papers for the Federal CIO Council 4. SICoP White Paper Updates for the Federal Community 5. Federal Statistical Data System 6. U.S. EPA Report on the Environment 2007 7. DRM 3.0 and the Semantic Web 8. RDF from Data Tables and Relational Databases 9. Recommendations 10. Post Script 1. U.S. Government Data : 3 1. U.S. Government Data Not readily accessible to search engines and reuse projects: See June 2007 W3C / WSRI Workshop. Major data projects need an enterprise architecture for funding: See Federal Enterprise Architecture. Working on both of these problems: Federal Sitemaps Initiative. Position Paper for This Workshop. 2. Federal Enterprise Architecture Data Reference Model : 4 2. Federal Enterprise Architecture Data Reference Model Source: Expanding E-Government, Improved Service Delivery for the American People Using Information Technology, December 2005, pp. 2-3. http://www.whitehouse.gov/omb/budintegration/expanding_egov_2005.pdf with annotations by the author. DRM 1.0 SICoP Ontologies All Three DRM 3.0 Unify 3. SICoP White Papers for the Federal CIO Council : 5 3. SICoP White Papers for the Federal CIO Council SICoP White Paper Series Module 1 (February 16, 2005): Introducing Semantic Technologies and the Vision of the Semantic Web ("DRM of the Future"): W3C Semantic Web and DARPA DAML Program/SICoP Semantic Web Applications for National Security (SWANS) Conference April 7-8, 2005 (40 exhibits). DRM 2.0 Implementation Guide Version 1.0 (October 15, 2005) and DRM 2.0 Education Pilot. SICoP White Paper Series Module 2 (January 6, 2006): Semantic Wave 2006 - Executive Guide to the Business Value of Semantic Technologies: Semantic Wave 2007 Update at the 2007 Semantic Technology Conference.. Also see Four SICoP Contributions to the 2007 Semantic Technology Conference. SICoP White Paper Series Module 3 (June 18, 2007): Operationalizing the Semantic Web/Semantic Technologies: Advanced Intelligence Community R&D Meets the Semantic Web! (ARDA AQUAINT Program): A roadmap for agencies on how they can take advantage of semantic technologies and begin to develop Semantic Web implementations. Semantic Interoperability – Yes!. 4. SICoP White Paper Updates for the Federal Community : 6 4. SICoP White Paper Updates for the Federal Community SICoP is working on updates to each of their three White Papers as follows: 1. Semantic Interoperability Data Management Strategy: Net-Centric Operations Industry Consortium (NCOIC) and Others , Brand Niemann, US EPA (September 2007 Draft): Semantic Interoperability: The What, Why, Who, and How. Semantic Interoperability: My NCOIC Roadmap. 2. Semantic Wave 2008: Industry Roadmap to Web 3.0, Mills Davis, Project 10X (October 2007 Draft): Semantic Social Computing, Web 2.0 Summit Brief, and Semantic Desktop Pilot (TWINE from Radar Networks). 3. Semantic Interoperability with Relational Databases (e.g. Data marts and Data warehouses): Solving the Schema Mismatch Problem with Ontology, Lucian Russell, Private Consultant (December 2007 Draft). In conjunction with the Interoperable Knowledge Representation for Intelligence Support (IKRIS) Program. 5. Federal Statistical Data System : 7 5. Federal Statistical Data System About 200 programs in 70 agencies!: Decennial Census - moving towards more frequent and detailed surveys (e.g. American Community Survey). Annual Statistical Abstract - most popular government data publication (about 40 chapters in PDF & 1500 data tables in Excel). FedStats pilot of federation of databases (distributed content network using XML for the data and XML for distributed queries). Repurposed documents and databases and recombined data and metadata to support information sharing and reuse 2003 Annual Statistical Abstract Data Table Example (presentation) and (XML database with metadata) 5. Federal Statistical Data System : 8 5. Federal Statistical Data System Like to do for government data tables in Excel: MindSwap Utility: The ConvertToRDF tool is designed to take plain-text delimited files, like .csv files dumped from Microsoft Excel, and convert them to RDF. Like to do for a few selected government relational databases: Digital Harbor Composite Applications Pilots with Business Ontology (Voting and Census Data) Like to do for lots of selected government relational databases: Tried in 1999 without the benefit of RDF/OWL and newer technologies. 6. U.S. EPA Report on the Environment 2007 : 9 6. U.S. EPA Report on the Environment 2007 Spent lots of time and money on peer review, production of comprehensive metadata, and electronic publication. Specifically, EPA's 2007 Report on the Environment contains thorough documentation and standard metadata templates for the 86 indicators selected using six criteria based on EPA’s Information Quality Guidelines and a Peer Review Process described in Appendix B of the report. Basis for showing a New Enterprise Data Management Strategy for the US EPA. Want to use RDF and reason over this data and metadata. 6. U.S. EPA Report on the Environment 2007 : 10 6. U.S. EPA Report on the Environment 2007 * One question without an indicator. The Summary Statistics of the Data Asset Database 6. U.S. EPA Report on the Environment 2007 : 11 6. U.S. EPA Report on the Environment 2007 The individual data tables with their elements and attributes were compiled into 5 multi-sheet spreadsheets, one for each of the 5 topics in the 2007 EPA Report on the Environment. The multi-sheet spreadsheet for “water” is shown for the index (table of contents) and the Exhibit 5-2 indicator data tables. Question: Is this the right thing to tell people to do to get ready for RDF/OWL? 6. U.S. EPA Report on the Environment 2007 : 12 6. U.S. EPA Report on the Environment 2007 6. U.S. EPA Report on the Environment 2007 : 13 6. U.S. EPA Report on the Environment 2007 7. DRM 3.0 and the Semantic Web : 14 7. DRM 3.0 and the Semantic Web Knowledgebases are defined as: A semantic model = ontology(s) + the database of instances built as a social contract between those the know how to build them and those that need them (business partners). An ontology is a formal description of the meaning of the information used by software systems. Just like relational databases use SQL as a query language, ontologies developed using Semantic Web standards are queried with a query language called SPARQL. SPARQL is a simple yet powerful language. A single SPARQL query can combine the selection criteria based on the data values as well as their meaning. Unlike relational databases and SQL which are tightly bound to a specific data model, ontologies are highly flexible making it possible to: (1) easily accommodate changes in the data model, and (2) create generic queries that work in multiple situations and don't need changing when the data model must change. 7. DRM 3.0 and the Semantic Web : 15 7. DRM 3.0 and the Semantic Web Building DRM 3.0 Knowledgebases: Where Do the Semantics Come From?: Interoperable Knowledge Representation for Intelligence Support (IKRIS) has now produced the ISO Common Logic Standard (ISO/IEC 24707:2007). Building DRM 3.0 for the Federal Community, February 6, 2007: Free Text (unstructured) - Language Computer Corporation - extract about 40 semantic relationships and build an ontology. Databases (structured) - Princeton WordNet Knowledgebase (reasoning) - Open CYC Data Modeling and OWL: Two Ways to Structure Data, David Hay, Essential Strategies, Inc. See next slides. 7. DRM 3.0 and the Semantic Web : 16 7. DRM 3.0 and the Semantic Web Data Modeling and OWL: Two Ways to Structure Data, David Hay, Essential Strategies, Inc.: Objectives of a Data Model: Capture the semantics of an organization. Communicate these to the business without requiring technical skills. Provide an architecture to use as the basis for database design and system design. Now: Provides the basis for designing Service Oriented Architectures. See http://www.semantic-conference.com/2007/handouts/2-UpBW/Hay_David_2_2UpBW.pdf 7. DRM 3.0 and the Semantic Web : 17 7. DRM 3.0 and the Semantic Web Data Modeling and OWL: Two Ways to Structure Data, David Hay, Essential Strategies, Inc. (continued): Synopsis: Both data modeling and ontology languages represent the structure of business data (ontologies). Data modeling represent data being collected, and filters according to the rules. Ontology languages represent data being used, with ability to have computer make inferences. Comment from Lucian Russell (SICoP White Paper 3 Author): So ontology can improve data quality in legacy systems (David Hay agreed) and solve the Schema Mismatch Problem (recall slide 6). 8. RDF from Data Tables and Relational Databases : 18 8. RDF from Data Tables and Relational Databases At the June 2007 W3C/WSRI Workshop entitled “Toward More Transparent Government on eGovernment and the Web”, SICoP suggested that a clear message about the role of RDF in data exchange and a series of pilots using government data sources would help educate and demonstrate the value of the Semantic Web (aka the Data Web) to the Federal Government. Consumers and potential consumers of RDF data will provide use cases and goals (e.g. SICoP). The W3C has a new Semantic Web Layer Cake (see next slide) in which RDF has moved into the XML space and has been expanded with query and rules! 8. RDF from Data Tables and Relational Databases : 19 8. RDF from Data Tables and Relational Databases 9. Recommendations : 20 9. Recommendations A New Enterprise Data Management Strategy for the US Government Based on: The premise of reusing the data and information rather than changing the data systems themselves: Putting the business and technical rules, logic, etc. into the data itself using markup languages. The concepts and standards of the Semantic Web: Also called the Data Web or Web 3.0. The most important tenets of the reuse are: Bring the data and the metadata back together. Bring the structured and unstructured data and information back together. Bring the data and information description and context back together. Looking for partners to work with Federal Government and US EPA data and metadata sources. 10. Post Script : 21 10. Post Script Google 2.0 Embraces Semantic Web: Google's new Programmable Search Engine might require more work from agency Webmasters, though increased site visibility may result from the effort. Government Computer News, May 18, 2007. Equity research firm Bear, Stearns & Co. report concludes that Google can become the Semantic Web because of: Recent patents filed by Ra Guha, co-creator of RDF. Supporting infrastructure – 400,000 servers in 100 data centers. Lack of interest /focus by Microsoft. Ra Guha denied the accuracy of this story and report at the 2007 Semantic Technology Conference, May 22, 2007, SDForum Meeting. 10. Post Script : 22 10. Post Script Common Barriers to Web Search Engine Crawling EPA Web Sites with Uncrawlable Databases Strategies for Access to EPA Databases Now Closed to Search Engine Crawlers Common Barriers to Web Search Engine Crawling : 23 Common Barriers to Web Search Engine Crawling What can make a site effectively invisible to search engine users: Content “hidden” behind search forms • Non-HTML links • Outdated robots.txt crawling restrictions • Server errors (crawler times out when fetching content) • Orphaned URLs • Rich media: audio, video • Premium content Source: J.L. Needham (Google), Ensuring government is only one search away: Implementing the Sitemap protocol EPA Web Sites with Uncrawlable Databases : 24 EPA Web Sites with Uncrawlable Databases Sample list of EPA Web sites with uncrawlable databases: http://spreadsheets.google.com/pub?key=pUb62ZKHnzgqEoGF4LFf3Gw Total: 27 Strategies for Access to EPA Databases Now Closed to Search Engine Crawlers : 25 Strategies for Access to EPA Databases Now Closed to Search Engine Crawlers Get Web database vendors (Oracle, IBM, etc.) to support automatic generation of the Sitemaps Protocol Files. Convert to HTML. Repurpose to XML. Repurpose to Semantic Knowledgebases (RDF/OWL).