introduction to Databases in bioinformatics

50 %
50 %
Information about introduction to Databases in bioinformatics
Education

Published on December 3, 2013

Author: raniashok

Source: authorstream.com

Introduction to Databases in bioinformatics: Introduction to Databases in bioinformatics MRS. RANI ASHOK ASSOCIATE PROFESSOR OF ZOOLOGY LADY DOAK COLLEGE, MADURAI – 2 EMAIL: eaarani@gmail.com 1 Rani Ashok, Associate Professor of Zoology, LDC Databases: 2 Databases A database is a collection of data, typically describing the activities of one or more related organizations. A database is a repository for a collection of computerized data files. Rani Ashok, Associate Professor of Zoology, LDC Databases: 3 Databases Databases typically support the following operations Retrieval Insertion Updating Deletion Rani Ashok, Associate Professor of Zoology, LDC Goals: 4 Goals Identify basic data acquisition tasks Review fundamental database concepts Identify issues with bioinformatic data Survey bioinformatic data formats Introduce issues of data management Rani Ashok, Associate Professor of Zoology, LDC Databases …: Databases … For researchers to benefit from the data stored in a database, two additional requirements must be met: Easy access to the information; and A method for extracting only that information needed to answer a specific biological question 5 Rani Ashok, Associate Professor of Zoology, LDC Functions of databases: Functions of databases Make biological data available to scientists To make biological data available in computer-readable form Availability of a particular type of information in one single place (book, site, database) Published data - difficult to find or access Collecting data from the literature is very time-consuming Not all data is actually published explicitly in an article (genome sequences!) Availability of the data in computer-readable form (rather than printed on paper) is a necessary first step for analysis of biological data that always involves computers 6 Rani Ashok, Associate Professor of Zoology, LDC first biological sequence database: first biological sequence database The book "Atlas of Protein Sequences and Structures" by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein sequences determined at the time, and new editions of the book were published well into the 1970s. Its data became the foundation for the PIR database. 7 Rani Ashok, Associate Professor of Zoology, LDC Storage & distribution of databases: Storage & distribution of databases Computer became the storage medium of choice as soon as it was accessible to ordinary scientists. Databases were distributed on tape, and later on various kinds of disks. When universities and academic institutes were connected to the Internet or its precursors (national computer networks), from the beginning of the 1990’s,the World Wide Web (WWW, based on the Internet protocol HTTP) is the standard method of communication and access for nearly all biological databases 8 Rani Ashok, Associate Professor of Zoology, LDC Accession codes : Accession codes An accession code (or number) is a number (possibly with a few characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049 . Accession code is supposed to be stable & also called primary key for the entry Any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. In the case where two entries are merged into one single , then the new entry will have both accession codes , where one will be the primary and the other the secondary accession code. When an entry is split into two , both new entries will get new accession codes, but will also have the old accession code as secondary codes 9 Rani Ashok, Associate Professor of Zoology, LDC identifiers: identifiers An identifier ("locus" in GenBank , "entry name" in SWISS-PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name. SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens. An identifier can usually change . When the database curators decide that the identifier for an entry no longer is appropriate an identifier is changed. However, change so rarely that it's not really a big problem. 10 Rani Ashok, Associate Professor of Zoology, LDC Systems for searching, indexing and cross-referencing: Systems for searching, indexing and cross-referencing Rani Ashok, Associate Professor of Zoology, LDC 11 Systems for searching, indexing and cross-referencing: Systems for searching, indexing and cross-referencing usefulness of a database can be increased enormously if it is easy to find entries that satisfy certain search criteria. databases themselves may contain all necessary information, but some software systems must be used to actually perform the right kind of search. Systems for searching, indexing ... SRS Entrez 12 Rani Ashok, Associate Professor of Zoology, LDC (Sequence Retrieval System) SRS: ( Sequence Retrieval System ) SRS Developed by Thure Etzold System for integrating heterogenous databases . based on premade indexes of the items (words, entries, data fields, text,...) found in a set of documents (database files). indexing procedure requires a grammar ( Icarus ) that describes what different words in the data files mean, how they are to be indexed, and how they cross-reference to other items in other databases. 13 Rani Ashok, Associate Professor of Zoology, LDC (Sequence Retrieval System) SRS contd …: ( Sequence Retrieval System ) SRS contd … A web-oriented system located on a server which is accessed through HTML pages and CGI scripts. Started as an academic project, but is now a commercial system which used to be developed and marketed by LION Bioscience AG. EBI runs an SRS service which can be used by anyone. It indexes a large number of databases, and it also provides a well-defined web interface which allows programs or web sites to create links that query SRS at EBI. 14 Rani Ashok, Associate Professor of Zoology, LDC entrez: entrez developed and accessible at the NCBI Entrez site. provides search facilities for a large number of databases, and provides links between them. provides a well-defined web interface which allows programs or web sites to define links that will query Entrez . not available to set up at one's own server. purely a system for accessing and searching the databases at NCBI. 15 Rani Ashok, Associate Professor of Zoology, LDC Characterization of databases: Characterization of databases Rani Ashok, Associate Professor of Zoology, LDC 16 Bioinformatic Databases: 17 Bioinformatic Databases Different kinds of bioinformatic databases General Purpose Data type specific Organism specific Pathway information Specialized data … Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Data entry & Quality control Type of Data Technical design Maintenance status Primary or Derived data Availability Biological databases 18 Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Type of Data nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways 19 Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Data entry and quality control Scientists (teams) deposit data directly Appointed curators add and update data Are erroneous data removed or marked? Type and degree of error checking consistency, redundancy, conflicts, updates 20 Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Primary or Derived Data Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases Links to other data items Combination of data Consolidation of data 21 Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Technical Design Flat-files Relational database (SQL) Object-oriented database (e.g. CORBA, XML) 22 Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Maintainer status Large, public institution (e.g. EMBL, NCBI) Quasi-academic institute (e.g. Swiss Institute of Bioinformatics, TIGR) Academic group or scientist Commercial company 23 Rani Ashok, Associate Professor of Zoology, LDC Characterization of biological databases based on different properties: Characterization of biological databases based on different properties Availability Publicly available, no restrictions Available, but with copyright Accessible, but not downloadable Academic, but not freely available Proprietary, commercial; possibly free for academics 24 Rani Ashok, Associate Professor of Zoology, LDC DATa models: DATa models Rani Ashok, Associate Professor of Zoology, LDC 25 What is a Data Model?: What is a Data Model? Definition: precise description of the data content in a system Types of data models: Conceptual: describes WHAT the system contains Logical: describes HOW the system will be implemented, regardless of the DBMS Physical: describes HOW the system will be implemented using a specific DBMS 26 Rani Ashok, Associate Professor of Zoology, LDC Why do we need to create data models?: Why do we need to create data models? To aid in the development of a sound database design that does not allow anomalies or inconsistencies Goal: to create database tables that do not contain duplicate data values that can become inconsistent 27 Rani Ashok, Associate Professor of Zoology, LDC Database Models: 28 Database Models Defines data organization Relational Entities and relationships stored in tables Oracle, DB2, MySQL, PostgreSQL Predefined schema Object Oriented/Object Relational Abstract data types, data and operations Structured types (arrays, lists, sequences, etc.) Inheritance of attributes Hierarchical/Semistructured Implicit schema Flexible Rani Ashok, Associate Professor of Zoology, LDC Creating an Entity-Relationship Model: Creating an Entity-Relationship Model Identify entities Identify entity attributes and primary keys Specify relationships 29 Rani Ashok, Associate Professor of Zoology, LDC Data Entities: Data Entities Entity A "thing" about which you want to store data in an application Multiple examples (instances) of the entity must exist Goal: Store data about each entity in a separate table Do not store duplicate data in multiple tables or records Examples: DNA, Protein 30 Rani Ashok, Associate Professor of Zoology, LDC Data Model Naming Conventions: Data Model Naming Conventions Entity names are short , descriptive , compound word singular nouns K-RAF - HUMAN 31 Rani Ashok, Associate Professor of Zoology, LDC Formats: 32 Formats Data is stored/presented in a variety of formats FASTA, GenBank , SwissProt , ASN.1,XML When considering a format for retrieval What is easy to parse? What format do the tools need? What information is needed? Rani Ashok, Associate Professor of Zoology, LDC USES OF DATABASES: USES OF DATABASES Rani Ashok, Associate Professor of Zoology, LDC 33 Typical Bioinformatic Project: Typical Bioinformatic Project Pose Hypothesis Read Relevant Papers Identify Relevant Information Identify Relevant Data Sources Retrieve data Store data in local database Analyze data Visualize Results Publish Results This is an iterative process. You may loop back at almost any step. 34 Rani Ashok, Associate Professor of Zoology, LDC Data Acquisition: 35 Data Acquisition Pose Hypothesis Usually suggested by researcher Important to state clearly and precisely Motivates choice of data selection E.g. Selective pressure is different in disease related genes. Identify Relevant Literature Usually suggested by researcher, you read also Identifies important, genes, proteins, pathways, etc. Record these as you read! Rani Ashok, Associate Professor of Zoology, LDC Data Acquisition: 36 Data Acquisition Identify relevant data In collaboration with researchers What is needed? Gene sequences, amino acid sequences, structure information, literature, etc. What quality is required? No errors, some errors Rani Ashok, Associate Professor of Zoology, LDC Data Acquisition: 37 Data Acquisition Identify Relevant Databases Which databases contain the data you are interested in? Which ones have the required quality? Do you need general or specific data (or both)? Look in January NAR, Google search, ask other researchers in the field Rani Ashok, Associate Professor of Zoology, LDC Data Acquisition: 38 Data Acquisition Retrieve Data What format is required? Track where and when data is retrieved Do you need to update the data regularly? How do updates impact your analysis? Rani Ashok, Associate Professor of Zoology, LDC Data Acquistion: 39 Data Acquistion Store data in local database Data management is a fundamental piece of every project Use a DBMS over flat files for projects Flexible queries, remote access, etc. Consider how data from multiple sources will be integrated Examples of successes and failures surround you. Pay attention to the databases you use! Rani Ashok, Associate Professor of Zoology, LDC Knowledge Discovery in Databases (KDD): Knowledge Discovery in Databases (KDD) Data Warehouse Prepared data Data Cleaning Integration Selection Transformation Data Mining Patterns Evaluation Visualization Knowledge Knowledge Base 40 Rani Ashok, Associate Professor of Zoology, LDC Thank You: Thank You 41 Rani Ashok, Associate Professor of Zoology, LDC

Add a comment

Related presentations

Related pages

Bioinformatics - Wikipedia, the free encyclopedia

Introduction. Bioinformatics has become an important ... An alternative method to build public bioinformatics databases is to use the MediaWiki ...
Read more

Introduction to Bioinformatics - Home | Lehigh University

Introduction to Bioinformatics Lopresti BioS 10 October 2010 Slide 1 HHMI Howard Hughes Medical Institute ... http://www.cbs.dtu.dk/databases/DOGS/
Read more

Introduction to Bioinformatics - Weizmann Institute of Science

Lecture Outline: • Technical Course Items • Introduction to Bioinformatics • Introduction to Databases – This week and next week
Read more

Bioinformatic Databases - University of Alabama

1 Course Module: Introduction to Bioinformatics – CS 2001 Lab 1 1 Bioinformatic Databases • Genbank • PubMed • They link with each other -
Read more

Introduction to Bioinformatics - CBRG, Oxford University

Computational Biology Research Group Bioinformatics databasesBioinformatics databases • Public databases are the most important entity in bioinformatics
Read more

Introduction to Bioinformatics - Utsav Bali | "Ask an ...

organized the Introduction to Bioinformatics course held in Oxford on the 10th and 11th of ... database (CIBEX ) (http://cibex.nig.ac.jp/index.jsp)
Read more

BIOINFORMATICS Introduction - Gerstein Lab

First Previous Next Last Index Home
Read more

Introduction to Bioinformatics - Home | Lehigh University

Introduction to Bioinformatics Lopresti BioS 95 November 2008 Slide 1 Dan Lopresti ... http://www.cbs.dtu.dk/databases/DOGS/
Read more

Biological database - Wikipedia, the free encyclopedia

... for understanding biological databases. Biological database ... to biology and bioinformatics. A companion database to the issue called ...
Read more