Published on April 17, 2008
Challenges in Building a Strategic Information Infrastructure: Challenges in Building a Strategic Information Infrastructure Laura Haas Distinguished Engineer and Manager, Information Integration Architecture Agenda: Agenda The integration challenge Two motivating examples Technologies for information integration Areas where invention needed The Integration Challenge: The Integration Challenge Complex and heterogeneous environments Many different types of systems Many inter-related applications Escalating needs Variety, velocity, volume People are expensive The world produces 250MB of information every year for every man, woman and child on earth. The Challenge Continued… : Sources: IBM & Industry Studies, Customer Interviews, Forrester The Challenge Continued… 40% of IT budgets may be spent on integration. 30% of people’s time: searching for relevant information. The average billion dollar company: 48 disparate financial systems 2.7 ERP systems 42% of transactions are still paper-based. 85% of information is unstructured. Trx. Documents Reports e-Mails Media Customers Employees Partners Databases Orgs. Financials Products Web Content 79% of companies have more than two repositories and 25% have more than 15 60% + of CEOs: Need to do a better job capturing and understanding information rapidly in order to make swift business decisions. Only 1/3rd of CFOs believe that the information is easy to use, tailored, cost effective or integrated. 30-50% of application design time is spent on copy management. Taikang Life Insurance : Taikang Life Insurance Business Challenge Technical Challenge 4th largest Chinese insurance company 8,000 employees, 150,000 agents 3.5 million customers 28 branches, 170 sub-branches Data in DB2 UDB, Informix, Oracle, SQL Server, XML, e-mail, CRM and Portal applications Goals: Up-to-the-minute status for executives Increased employee productivity Better customer service Background Taikang Information Platform – Before Integration: Taikang Information Platform – Before Integration Phone Fax SMS E-mail Web Store Front Letter CSC Personal Life Insurance Systems Group Life, Banking Insurance Financials Client Data Client Data Client Data Client Data Client Data Client Data Channels not effectively integrated, client data dispersed and not effectively shared Multiple application systems, multiple application development tools Taikang Integrated Information Platform Architecture: Taikang Integrated Information Platform Architecture Phone Fax SMS Email Web Store Front Mail Agents Financial Planner Core Systems Information Integration Platform Application Platform Channels Group & Banking CSC Personal Life Financials Mapping (nicknames) Integrated Information Data Service XML SQL Web Services Creating Enterprise Reference Information: Creating Enterprise Reference Information Web Hierarchy and Sub Category Marketing Benefits Cross-Sell & Up-sell Promo. Price Sizes Colors Images Challenges in Integrating Information : Challenges in Integrating Information Structured and unstructured data Diversity of data sources (content repos, pricing application, databases, …) Coming up with the model of how information fits together Understanding what info exists Finding related pieces Creating a common format Deciding how to access and transform data What should be materialized, what accessed in real-time, how maintained What pre-defined paths, what unplanned (navigation vs. search) Configuring the appropriate software Accessing information in the application Monitoring the system and understanding usage, problems, etc Multiple Technologies Are Needed: Multiple Technologies Are Needed Discovery and preparation Metadata and information registries Exploration, analysis, cleansing Transformation Within and across models (e.g., record -> record, relational -> XML) Integration Consolidation Federation Connection to the applications Push and pull Interfaces appropriate to the tasks Services Systems management Maintenance, monitoring, fault tolerance, … A Platform for Information Integration: A Platform for Information Integration -- Multiple access paradigms -- Multiple integration disciplines Complementary Information Integration Approaches: Complementary Information Integration Approaches Consolidate (“place”) data for local access Access performance or availability requirements demand centralized data. Currency requirements demand point-in-time consistency, e.g. close of business Complex transformation is required to achieve semantically consistent data Production applications, data warehouses, operational data stores Typically managed by ETL (Extract, Transform, and Load) or replication technologies Federate data for integrated access to distributed sources Access performance and load on source systems can be traded for overall lower cost implementation Currency requirements demand a fresh copy of the data Data security, licensing restrictions, or industry regulations restrict data movement Combining mixed format data, e.g. customer ODS with related contract documents or images Query requires real-time data, e.g. stock quote, on-hand inventory Search provides a third option Search a local index, create a local result set (hit list) Distributed access to live data possible through result set links No single approach is the best for all scenarios Consolidation prepares the data in advance: Consolidation prepares the data in advance Scripts and hand-written applications Extract, Transform, Load (ETL) tools Uses: build warehouses, data marts, operational data stores, … Typically include Data Flow editor to define jobs, code generation Libraries of functions for doing transformations Connectors to many information sources Replication Uses: high availability, warehouse maintenance, application integration, … Includes changed data capture (log readers, triggers, application programs) Transport/storage for changes Logic to apply changes to replica Slide14: SQL Federation Federated Database Server Data Relational Data Source Data Global Catalog SQL API (JDBC/ODBC) Wrappers 00001|SONY|Television|... 00002|RCA|VideoPlayer|.. 00004|SONY|DVDPlayer 00003|SONY|VideoRecorder ....... Database Application SELECT I.man, count(*) FROM transactions T, items I WHERE I.id=T.item_id AND I.category='Television' AND YEAR(T.tran_date)=2001 GROUP BY I.man; SELECT tran_date, item_id FROM transactions WHERE YEAR(T.tran_date)=2001 ITEMS TRANSACTIONS List the number of TV sales per manufacturer in 2001 Desired properties: Transparency – Heterogeneity – Extensibility High Function – Autonomy – Performance Search versus Query: Search versus Query Search User doesn’t need to know where the information is User doesn’t need to know structure of information (schema) User doesn’t need to know precisely how the information is expressed User can use native language “Search” typically focuses on returning documents, but could become “information finding” Query Information need can be (must be) expressed precisely Information can be combined and summarized in powerful ways Both provide great value in integration scenarios SELECT I.man, count(*) FROM transactions T, items I Invention Needed for Information Integration: Invention Needed for Information Integration Semantic integration Metadata Discovery and Design Tools Virtualizing large-scale systems Information integration in grid environments Data placement Precise queries over unstructured information Text analytics Metadata Landscape (not exhaustive): Metadata Landscape (not exhaustive) Metadata-driven Design Across Integration Disciplines: Web Service Metadata-driven Design Across Integration Disciplines Build These Using These New Business Process New Integrated View Legacy and packaged apps Relational databases XML documents New DataFlow WBI II ETL Integration Tasks: Find and visualize related information Connect it together Generate useful information or artifacts Remember what you discovered and share it Clio: Schema Discovery and Mapping for Integration: Clio: Schema Discovery and Mapping for Integration Find it: Discovery Use ontologies and graph algorithms to find similar objects (for mapping, e.g.) Connect it: Mapping algorithms Using mapping composition to handle schema evolution Inverse mapping Advanced features in mapping semantics Conditional mapping, “nested” mapping, ETL-like procedural constructs Round trip support between mappings and generated queries Mapping-based data lineage in the context of query execution Generate it: Transformations XML transformation engine Schema integration Grid Computing: Grid Computing Storage Applications Processing Operating System Data I/O Distributed Computing Over Heterogeneous Resources, Using Open Standards to Provide Virtual Services Grid Computing Information Integration in a Large-Scale Grid: Information Integration in a Large-Scale Grid Dynamic, distributed data access Directory service and/or p2p discovery protocols Logical specification of data desired Handle dynamic arrival and departure of data sources Automated data transformations without human intervention Graceful degradation/Fault tolerance To handle data source failures, missing data sources and performance issues Use of redundancy to mask failures Partial result ("friendly") delivery when can't hide Automatic placement of data for performance, scalability, availability Policy- and workload-driven Quality of service and data guarantees Economic model for location, placement? Defining and working with data quality What characteristics matter? What’s a “good” answer? How does quality compose across sources? characteristics? For different activities? Tesla: Tesla Data Placement: Data Placement Goal: most critical data accesses are local, subject to constraints on space and other resources Best blend of federation and consolidation for workload How: policy-based advice on data caching and data movement, driven by workload Highest priority applications see the best performance DB2 II Workload: Gold customers who bought expensive products Recent transactions involving expensive products Gold customers in the 95120 zipcode Extending enterprise search with analytics: Extending enterprise search with analytics Distinguish between different semantics of the same term rock (stone), rock (music), rock (to move back and forth) Search for information about higher level concepts that are not directly expressed in text tokens named-entities (drug, gene, person) relationships (inhibits, causes, is CEO of) Find answers, not just documents who is the CEO of IBM? Support advanced applications such as patent mining repair record analysis for early detection of problems drug discovery Unstructured Information Management Architecture (UIMA): Unstructured Information Management Architecture (UIMA) A framework for integrating advanced text analytics technologies Natural language, Machine learning, Information Retrieval, Bayesian Statistics In combination these can given higher accuracy results than individually Encouraging reuse and sharing across organizations UIMA is being developed in collaboration with The academic community Government sponsored organizations Being applied to problems in Life Sciences, Compliance, Finance, etc. Identify Language Find Words Find Word Roots Add Synonyms Find parts of speech Named-entity extraction (drugs, people, etc.) Find Relationships Index Documents A well-disciplined architecture for natural-language content analysis and integration 3 Research Paradigms for Exploring Text with Data: 3 Research Paradigms for Exploring Text with Data The SCORE approach: Automatically determine a ‘context’ for queries Use to select documents relevant to that query’s results Request: the 5 poorest performing stocks in my portfolio over the last 6 months Inferred ‘context’ will potentially fetch analyst reports about their common sector Business Insights Workbench Examine characteristics of a collection of business documents Select a subset for more detailed text mining Special UIMA annotators extract chemical names from patents Special chemical converter takes names to SMILES string SMILE strings used to retrieve chemical & physical properties Chemical & Physical properties combined with patent metadata and Patents to augment the warehouse cube. Competitive positioning analyses can all now be run against this more precise and data rich warehouse. The AvatarBI approach Enrich a Business Intelligence cube with ‘qualitative’ information extracted and quantified from business documents Probabilistic OLAP From which regions, and for which products, are we getting angry service calls. UIMA Annotators Text Search Summary: Summary Information integration is a challenging task Structured and unstructured data from many diverse sources Variety, velocity, volume Application needs vary widely Several technologies needed – no silver bullet Consolidation, federation, search Metadata, cleansing, transformation Need a unified framework for their use Exciting research opportunities Metadata and tools Large-scale grids Text analytics
... (Tief) (B) Klavier HAAS0324-9 3 Fruehlingslieder Mwv 223 Nach Gedichten Von August Von Platen Moeckl Franz / Gesang(Tief) (B/Bariton) ...