Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

50 %
50 %
Information about Bioinformática y supercomputación. Razones para hacerse bioinformático...
Health & Medicine

Published on April 27, 2014

Author: MGonzaloClaros

Source: slideshare.net

Description

¿En qué consiste la bioinformática? ¿Cómo puedo especializarme? ¿Dónde? Capacidad de supercomputación en la UMA. Recientes logros bioinformáticos relacionados con la medicina y con la ciencia en general, muchos de ellos realizados por equipos de la UMA.

Bioinformática y supercomputación M. Gonzalo Claros Díaz Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática 1 Centro de Bioinnovación http://about.me/mgclaros/ @MGClaros

http://www.scbi.uma.es Empecemos con unas palabras que no son mías 2 http://everydaylife.globalpost.com/medical-schools-bioinformatics-37686.html La bioinformática es un campo científico nuevo y muy atractivo que está en la interfase entre la informática, la biología y las matemáticas para descubrir informaciones nuevas sobre las enfermedades y el cuerpo humano La bioinformática utiliza la biología y la informática para descubrir cómo funcionan los seres vivos y sus enfermedades

http://www.scbi.uma.es Empecemos con unas palabras que no son mías 2 http://everydaylife.globalpost.com/medical-schools-bioinformatics-37686.html La bioinformática es un campo científico nuevo y muy atractivo que está en la interfase entre la informática, la biología y las matemáticas para descubrir informaciones nuevas sobre las enfermedades y el cuerpo humano La bioinformática utiliza la biología y la informática para descubrir cómo funcionan los seres vivos y sus enfermedades

http://www.scbi.uma.es La bioinformática no sólo se aplica a los humanos 3 http://mscbioinformatics.uab.cat/base/base3.asp?sitio=msbioinformatics Pero entiendo que para un Ingeniero de la Salud, el interés en los humanos esté por encima de lo demás

http://www.scbi.uma.es La bioinformática es IMPRESCINDIBLE hoy en día 4 http://bioinformatics.biol.ntnu.edu.tw/sher/Teaching.html

http://www.scbi.uma.es ¿Cómo surge la bioinformática? 5 Margaret Oakley Dayhoff Había que poner orden en…. ! ¡¡¡ 65 proteínas !!!

http://www.scbi.uma.es Tras una base de datos, viene otra 6 1975 ¡¡¡ 12 estructuras !!!

http://www.scbi.uma.es Llamarlas BD es un casi un insulto a un informático 7 HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58 REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67 CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68 ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69 ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70 ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71 SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72 SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73 SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74 ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75 ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76 ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916 ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917 TER 844 C B 9 1DGC 918 MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919 END 1DGC 920 FORTRAN era el rey

http://www.scbi.uma.es 1977: el punto de inflexión 8 Proc. Nati. Acad. Sci. USA Vol. 74, No. 12, pp. 5463-5467, December 1977 Biochemistry DNA sequencing with chain-terminating inhibitors (DNA polymerase/nucleotide sequences/bacteriophage 4X174) F. SANGER, S. NICKLEN, AND A. R. COULSON Medical Research Council Laboratory of Molecular Biology, Cambridge CB2 2QH, England Contributed by F. Sanger, October 3, 1977 ABSTRACT A new method for determining nucleotide se- quences in DNA is described. It is similar to the "plus and minus" method [Sanger, F. & Coulson, A. R. (1975)J. Mol. Biol. 94,441-4481 but makes use of the 2',3'-dideoxy and arabinonu- cleoside analogues ofthe normal deoxynucleoside triphosphates, which act as specific chain-terminating inhibitors of DNA polymerase. The technique has been applied to the DNA of bacteriophage 4bX174 and is more rapid and more accurate than either the plus or the minus method. The "plus and minus" method (1) is a relatively rapid and simple technique that has made possible the determination of the sequence of the genome of bacteriophage 4X174 (2). It depends on the use of DNA polymerase to transcribe specific regions of the DNA under controlled conditions. Although the method is considerably more rapid and simple than other available techniques, neither the "plus" nor the "minus" method is completely accurate, and in order to establish a se- quence both must be used together, and sometimes confirma- tory data are necessary. W. M. Barnes (J. Mol. Biol., in press) has recently developed a third method, involving ribo-substi- tution, which has certain advantages over the plus and minus method, but this has not yet been extensively exploited. Another rapid and simple method that depends on specific chemical degradation of the DNA has recently been described by Maxam and Gilbert (3), and this has also been used exten- a stereoisomer of ribose in whic ented in trans position with res The arabinosyl (ara) nucleotide hibitors of Escherichia coli DN comparable to ddT (4), although 3' araC can be further extende polymerases (5). In order to obta from which an extensive sequen to have a ratio of terminating tr phate such that only partial in occurs. For the dideoxy derivati for the arabinosyl derivatives ab METH Preparation of the Triphosp ration of ddTTP has been descr now commercially available. McCarthy et al. (8). We essenti and used the methods of Tener to convert it to the triphosphate DEAE-Sephadex, using a 0.1-1. carbonate at pH 8.4. The prepa has not been described previou same method as that used for d

http://www.scbi.uma.es Y un mes «antes» la primera suite bioinformática 9 Volume 4 Number 11 November 1977 Nucleic Acids Research Sequence data handling by computer R.Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received 10 October 1977 ABSTRACT The speed of the new DINA sequencing techniques has created a need for computer programs to handle the data produced. This paper describes simple programs designed specifically for use by people with little or no computer experience. The programs are for use on small computers and provide facili- ties for storage, editing and analysis of both DNA and amino acid sequences. A magnetic tape containing these programs is available on request. INTRODUCTION The development of rapid DNA sequencing techniques12 now enables large amounts of sequence data to be accumulated in a short period of time. The complete sequence of bacteriophage 0X174 has recently been published3 and the sequences of other, similarly sized molecules are near to completion. During the sequencing of 0X174 DNA it became necessary to develop computer programs to process the large amounts of data produced. Some of the programs are specific to DNA sequences but many are equally applicable to amino acid sequences. These programs are designed for small computers in common use, such as the PDP 11/45, and are simplified so that they can be used by people with little or no experience of computers. This paper describes some of the programs currently being used in this laboratory. They provide facilities for (1) storage and editing of a sequence, (2) producing copies of the sequence in various forms, e.g. in single or double stranded form, (3) translation into the amino acid sequence coded by the DNA

http://www.scbi.uma.es El Staden Package es hoy de dominio público 10 http://staden.sourceforge.net

http://www.scbi.uma.es Y surgen las BD de secuencias 11 1983 1980: 
 563 secuencias 1988

http://www.scbi.uma.es También eran BD de «texto» 12

http://www.scbi.uma.es Empezamos a necesitar algoritmos de comparación 13 J. Mol. Bid. (1981) 147, 195-197 Identification of Common Molecular Subsequences The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another. These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathemat- ically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970). In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions. Algorithm The two molecular sequences will be h=alaz . . . an and IZj= blb, b,. A similarity a(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wt. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set Proc. Natt Acad. Sci. USA Vol. 80, pp. 726-730, February 1983 Biochemistry Rapid similarity searches of nucleic acid and protein data banks (global homology/optimal alignment) W. J. WILBUR AND DAVID J. LIPMAN Mathematical Research Branch, National Institute ofArthritis, Diabetes, and Digestive and Kidney Diseases, National Institutes of Health, Building 31 Room 4B-54, Bethesda, Maryland 20205 Communicated by Maxine Singer, November 8, 1982 ABSTRACT With the development oflarge data banks ofpro- tein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar toagiven sequence has become evident. We present an algorithm for the global compar- ison ofsequences basedonmatchingk-tuples ofsequenceelements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separateimplementa- tion, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the en- tire Protein DataBankofthe NationalBiomedical Research Foun- dation with a 350-residue query sequence in less than 3 min and carryoutasimilar analysiswith a500-base query sequence against large banks of sequences. We shall describe here a global al- gorithm for comparing two nucleic acid or two amino acid se- quences. This algorithm involves the construction ofan optimal alignment that is useful in its own right. The algorithm also re- quires a computation time on the order ofN X M, where N and M are the lengths of-the sequences being compared, but, for given sequences, the computation is many times faster than the above-mentioned methods. Results obtained by the method and its limitations and advantages are discussed. METHODS Computational Methods and Data Sources. All computing Son buenos, pero lentos Aparece FASTA

http://www.scbi.uma.es Empezamos a necesitar algoritmos de comparación 13 J. Mol. Bid. (1981) 147, 195-197 Identification of Common Molecular Subsequences The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another. These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathemat- ically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970). In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions. Algorithm The two molecular sequences will be h=alaz . . . an and IZj= blb, b,. A similarity a(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wt. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set Proc. Natt Acad. Sci. USA Vol. 80, pp. 726-730, February 1983 Biochemistry Rapid similarity searches of nucleic acid and protein data banks (global homology/optimal alignment) W. J. WILBUR AND DAVID J. LIPMAN Mathematical Research Branch, National Institute ofArthritis, Diabetes, and Digestive and Kidney Diseases, National Institutes of Health, Building 31 Room 4B-54, Bethesda, Maryland 20205 Communicated by Maxine Singer, November 8, 1982 ABSTRACT With the development oflarge data banks ofpro- tein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar toagiven sequence has become evident. We present an algorithm for the global compar- ison ofsequences basedonmatchingk-tuples ofsequenceelements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separateimplementa- tion, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the en- tire Protein DataBankofthe NationalBiomedical Research Foun- dation with a 350-residue query sequence in less than 3 min and carryoutasimilar analysiswith a500-base query sequence against large banks of sequences. We shall describe here a global al- gorithm for comparing two nucleic acid or two amino acid se- quences. This algorithm involves the construction ofan optimal alignment that is useful in its own right. The algorithm also re- quires a computation time on the order ofN X M, where N and M are the lengths of-the sequences being compared, but, for given sequences, the computation is many times faster than the above-mentioned methods. Results obtained by the method and its limitations and advantages are discussed. METHODS Computational Methods and Data Sources. All computing Son buenos, pero lentos Aparece FASTA La bioinformática es una ciencia que se plantea problemas y les busca soluciones La bioinformática es una ciencia porque busca descubrir información

http://www.scbi.uma.es Se acumulan más secuencias, por lo que se necesitan comparaciones más eficaces 14 Se mejora el algoritmo, no el ordenador: llega BLAST

http://www.scbi.uma.es El coste de secuenciar disminuye, gracias a los ingenieros 15

http://www.scbi.uma.es Menos coste: más secuenciación, más datos y más BD 16

http://www.scbi.uma.es Las BD son IMPRESCINDIBLES hoy para los bioinformáticos 17

http://www.scbi.uma.es Pero la ley de Moore no perdona 18 La información se acumula más rápido de lo que aumenta la velocidad de los procesadores Número de transistores en los procesadores Intel Crecimiento de datos en las bases de datos Ingenieros informáticos: ¡SOCORRO!

http://www.scbi.uma.es La «info» no logra ponerse al ritmo de la «bio» 19

http://www.scbi.uma.es Si no aumentan los recursos, habrá que dedicar más gente a analizar los datos 20

http://www.scbi.uma.es Se necesitan bioinformáticos a pesar de (¿gracias a?) la crisis 21 http://www.indeed.com/jobtrends?q=molecular+biology,+bioinformatics,+biomedical +engineering&l=&relative=1

http://www.scbi.uma.es Vamos, que hay trabajo para bioinformáticos 22

http://www.scbi.uma.es Vamos, que hay trabajo para bioinformáticos 22

http://www.scbi.uma.es Vamos, que hay trabajo para bioinformáticos 22

http://www.scbi.uma.es Todos los días hay nuevas peticiones de bioinformáticos 23

http://www.scbi.uma.es Todos los días hay nuevas peticiones de bioinformáticos 23 30-dic-13

http://www.scbi.uma.es Todos los días hay nuevas peticiones de bioinformáticos 23 30-dic-13

http://www.scbi.uma.es Y también en España y Europa 24http://www.eurosciencejobs.com/jobs/bioinformatics

http://www.scbi.uma.es Si lo que quieres es ganar dinero, también 25 Puedes anunciarte aquídesde 50euros Contacta:633601207 publicidad@lamarea.com LaMareatieneunCÓDIGO ÉTICO consensuadoconlos sociospararegularlasinser- cionespublicitarias.Larevista nuncapublicaráanunciosque entrenencontradiccióncon nuestrosprincipios.Noacep- tamospublicidadconconte- nidossexistas,racistasoque frutossecosylegumbres.Todocondeno- minacióndeagriculturaecológica. Ctra.AV923,km.0,5. Mombeltrán.Ávila. Teléfono:920370297 Genoma4u Conocertugenomayeldetushijosesla llavedelamedicinapersonalizada. www.genoma4u.com ElCanterodeLetur Alimentoslácteosecológicosdealtaca- lidad.Eslógico.Esecológico. Teléfono:967426066 www.elcanterodeletur.com ¿Sepuede cambiar Europa através delvoto? ElParlamentodelaUE ganapoderperocarecede competenciasparacontrolar organismoscomolatroika ABRIL2014 LA REV ISTA M ENSUA L DE LA COOPERATIVA M Á SPÚ BLICO MERCADONA Elreydelos supermercados imponesuspropias condicioneslaborales AGUA ElGobiernoultima laprivatización demanantialesyde caudalesderíos 22-M LasMarchas delaDignidad, unsímbolodeunidad ypoderpopular ABRIL 2014 | Nº15 | 3€

http://www.scbi.uma.es Se les paga bien, al menos en el extranjero 26 Se paga mejor linux y OSX que Windows http://www.r-bloggers.com/r-skills-attract-the-highest-salaries/ En la rama de bioinformática de Ing. de la Salud se estudia R

http://www.scbi.uma.es ¿Sabías que tras las BD, R es lo que más se usa en la bioinformática? 27 Lo que más se usan son las BD Y luego R

http://www.scbi.uma.es ¿Y que hay ofertas de trabajo para bioinformáticos con R? 28 http://www.r-bloggers.com/r-jobs-march-24th/

http://www.scbi.uma.es Tenéis este mundo a vuestro alcance en la UMA 29 http://www.uma.es/grado-en-ingenieria-de-la-salud

http://www.scbi.uma.es Siempre nos quedan los cursos de especialización 30

http://www.scbi.uma.es ¡Y los libros! Que como Teruel, también existen 31

http://www.scbi.uma.es El bioinformático puede ejercer de muchas formas • Como un ingeniero • Facilitando tareas difíciles o tediosas • Flujos de trabajo y automatización • Como un informático • Mejorando los algoritmos existentes • Creando algoritmos nuevos • Por ejemplo, ensamblaje de secuencias • Como un científico • Descubriendo información biológica con el ordenador • Por ejemplo, relacionar enfermedades aparentemente inconexas 32

http://www.scbi.uma.es Se están definiendo las competencias del bioinformático 33 Message from ISCB Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies Lonnie Welch1 *, Fran Lewitter2 , Russell Schwartz3 , Cath Brooksbank4 , Predrag Radivojac5 , Bruno Gaeta6 , Maria Victoria Schneider7 1 School of Electrical Engineering and Computer Science, Ohio University, Athens, Ohio, United States of America, 2 Bioinformatics and Research Computing, Whitehead Institute, Cambridge, Massachusetts, United States of America, 3 Department of Biological Sciences and School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America, 4 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom, 5 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America, 6 School of Computer Science and Engineering, The University of New South Wales, Sydney, New South Wales, Australia, 7 The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom Introduction Rapid advances in the life sciences and in related information technologies neces- sitate the ongoing refinement of bioinfor- matics educational programs in order to maintain their relevance. As the discipline of bioinformatics and computational biol- ogy expands and matures, it is important to characterize the elements that contrib- ute to the success of professionals in this field. These individuals work in a wide variety of settings, including bioinformatics core facilities, biological and medical re- search laboratories, software development organizations, pharmaceutical and instru- ment development companies, and institu- tions that provide education, service, and The skill sets required for success in the field of bioinformatics are considered by several authors: Altman [2] defines five broad areas of competency and lists key technologies; Ranganathan [3] presents highlights from the Workshops on Education in Bioinformatics, discussing challenges and possible solutions; Yale’s interdepartmental PhD program in computational biology and bioinformatics is described in [4], which lists the general areas of knowledge of bioinfor- matics; in a related article, a graduate of Yale’s PhD program reflects on the skills needed by a bioinformatician [5]; Altman and Klein [6] describe the Stanford Bio- medical Informatics (BMI) Training Pro- gram, presenting observed trends among BMI students; the American Medical Infor- matics Association defines competencies in the related field of biomedical informatics in [7]; and the approaches used in several German universities to implement bioinfor- matics education are described in [8]. Several approaches to providing bioin- life sciences curricula. Pevzner and Shamir [11] propose that undergraduate biology curricula should contain an additional course, ‘‘Algorithmic, Mathematical, and Statistical Concepts in Biology.’’ Wingren and Botstein [12] present a graduate course in quantitative biology that is based on original, pathbreaking papers in diverse areas of biology. Johnson and Friedman [13] evaluate the effectiveness of incorpo- rating biological informatics into a clinical informatics program. The results reported are based on interviews of four students and informal assessments of bioinformatics faculty. The challenges and opportunities rele- vant to training and education in the context of bioinformatics core facilities are discussed by Lewitter et al. [14]. Relatedly, Lewitter and Rebhan [15] provide guid- ance regarding the role of a bioinformatics core facility in hiring biologists and in furthering their education in bioinfor- matics. Richter and Sexton [16] describe and educate bioinformaticians. The previ- ous report of the task force summarized a survey that was conducted to gather input regarding the skill set needed by bioinfor- maticians [1]. The current article details a subsequent effort, wherein the task force broadened its perspectives by examining bioinformatics career opportunities, survey- ing directors of bioinformatics core facili- ties, and reviewing bioinformatics educa- tion programs. The bioinformatics literature provides valuable perspectives on bioinformatics edu- cation by defining skill sets needed by bioinformaticians, presenting approaches for providing informatics training to biologists, and discussing the roles of bioinformatics core facilities in training and education. of the ‘‘-omics’’ era. They define a requisite skill set by analyzing responses to questions about the knowledge, skills, and abilities that biologists should possess. The authors in [10] present examples of strategies and methods for incorporating bioinformatics content into undergraduate This manuscript expands the body of knowledge pertaining to bioinformatics curriculum guidelines by presenting the results from a broad set of surveys (of core facility directors, of career opportunities, and of existing curricula). Although there is some overlap in the findings of the Citation: Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P, et al. (2014) Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies. PLoS Comput Biol 10(3): e1003496. doi:10.1371/ journal.pcbi.1003496 Published March 6, 2014 Copyright: ß 2014 Welch et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: No specific funding was received for writing this article. Competing Interests: The authors have declared that no competing interests exist. * E-mail: welch@ohio.edu PLOS Computational Biology | www.ploscompbiol.org 1 March 2014 | Volume 10 | Issue 3 | e1003496 database management languages (e.g., Oracle, PostgreSQL, and MySQL), and also desirable for a bioinformatician to have modeling experience or background in one Preliminary Survey of Existing Curricula Table 1. Summary of the skill sets of a bioinformatician, identified by surveying bioinformatics core facility directors and examining bioinformatics career opportunities. Skill Category Specific Skills General time management, project management, management of multiple projects, independence, curiosity, self-motivation, ability to synthesize information, ability to complete projects, leadership, critical thinking, dedication, ability to communicate scientific concepts, analytical reasoning, scientific creativity, collaborative ability Computational programming, software engineering, system administration, algorithm design and analysis, machine learning, data mining, database design and management, scripting languages, ability to use scientific and statistical analysis software packages, open source software repositories, distributed and high-performance computing, networking, web authoring tools, web-based user interface implementation technologies, version control tools Biology molecular biology, genomics, genetics, cell biology, biochemistry, evolutionary theory, regulatory genomics, systems biology, next generation sequencing, proteomics/mass spectrometry, specialized knowledge in one or more domains Statistics and Mathematics application of statistics in the contexts of molecular biology and genomics, mastery of relevant statistical and mathematical modeling methods (including experimental design, descriptive and inferential statistics, probability theory, differential equations and parameter estimation, graph theory, epidemiological data analysis, analysis of next generation sequencing data using R and Bioconductor) Bioinformatics analysis of biological data; working in a production environment managing scientific data; modeling and warehousing of biological data; using and building ontologies; retrieving and manipulating data from public repositories; ability to manage, interpret, and analyze large data sets; broad knowledge of bioinformatics analysis methodologies; familiarity with functional genetic and genomic data; expertise in common bioinformatics software packages, tools, and algorithms doi:10.1371/journal.pcbi.1003496.t001 http://www.ploscompbiol.org/article/info:doi %2F10.1371%2Fjournal.pcbi.1003496#pcbi-1003496-g002

http://www.scbi.uma.es El ingeniero, el científico y el usuario 34 http://www.ploscompbiol.org/article/info:doi %2F10.1371%2Fjournal.pcbi.1003496#pcbi-1003496-g002

http://www.scbi.uma.es El perfil de un bioinformático australiano 35 http://www.ebi.edu.au/news/braembl-community-survey-report-2013 ¿Dónde trabaja? ¿Quién es el bioinformático? Este es el bioinformático Esto es un biousuario Otro biousuario Y este también

http://www.scbi.uma.es El bioinformático no tiene problemas de movilidad 36

http://www.scbi.uma.es ¿Cuándo descansan los bioinformáticos? 37 NCBI is the most heavily site in biomedicine. Why? 300,000 200,000 100,000 NCBI Web Traffic – 1997-2006 400,000 January1998 500,000 600,000 700,000 January1999 January2000 January2001 January2002 January2003 January2004 January2005 January2006 722,000 Unique IPs a Day 91 Million Web Hits a Day 3200 Peak Web Hits a Second 1.5 Terabytes FTP a Day 1.8 Million Unique Users a Day

http://www.scbi.uma.es Siempre hay cosas que hará mejor un informático 38 10-04-13

Ya sabemos lo que se espera de un bioinformático Veamos ahora unos ejemplos reales como la vida misma 39

http://www.scbi.uma.es Flujos de trabajo que automaticen tareas repetitivas 40 Data miningMicroarray «Wet» side «Dry» sideAssembling

http://www.scbi.uma.es Dos ejemplos «made in Málaga» 41 SeqTrim FullLengtherNEXT Raw sequences Annotation with Maker SeqTrimNEXT (pre-processing) Assembly Mining with FullLengtherNEXT G EN O M IC S TRANSCRIPTOMICS ntro de Bioinnovación

http://www.scbi.uma.es ¿Por qué se necesitaban estas herramientas? 42 0 15000 30000 45000 60000 OLC DE BRUIjN OLC+De BRUIJN+CAP3 Unigenes # Orthologs for unigenes Complete unigenes with orthologs Unique complete unigenes with orthologs FullLengtherNEXTSeqTrimNEXT Menos contigs Mayor N50 # contigs 0 6 12 18 24 30 BAC1 BAC2 BAC3 Newbler SeaTrimNext + Newbler N50 0 10000 20000 30000 40000 50000 BAC1 BAC2 BAC3 Mejor ensamblaje cuanto más genes completos hay

http://www.scbi.uma.es Hay bioinformática para transcriptómica en la UMA 43 DATABASE Open Access EuroPineDB: a high-coverage web database for maritime pine transcriptome Noé Fernández-Pozo1 , Javier Canales1 , Darío Guerrero-Fernández2 , David P Villalobos1 , Sara M Díaz-Moreno1 , Rocío Bautista2 , Arantxa Flores-Monterroso1 , M Ángeles Guevara3 , Pedro Perdiguero4 , Carmen Collada3,4 , M Teresa Cervera3,4 , Álvaro Soto3,4 , Ricardo Ordás5 , Francisco R Cantón1 , Concepción Avila1 , Francisco M Cánovas1 and M Gonzalo Claros1,2* Abstract Background: Pinus pinaster is an economically and ecologically important species that is becoming a woody gymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply. Therefore, the expressed portion of the genome has to be characterised and the results and annotations have to be stored in dedicated databases. Description: EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster (maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries and high-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic (germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre- processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs and InterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of 32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466 different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freely available at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations, UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only conifer database that provides this information) and will be periodically updated. Small assemblies can be viewed using a dedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen can be downloaded. Retrieval mechanisms for sequences and gene annotations are provided. Conclusions: The EuroPineDB with its integrated information can be used to reveal new knowledge, offers an easy-to-use collection of information to directly support experimental work (including microarray hybridisation), and provides deeper knowledge on the maritime pine transcriptome. 1 Background Conifers (Coniferales), the most important group of gymnosperms, represent 650 species, some of which are the largest, tallest, and oldest non-clonal terrestrial Given that trees are the great majority of conifers, they provide a different perspective on plant genome biology and evolution taking into account that conifers are sepa- rated from angiosperms by more than 300 million years Fernández-Pozo et al. BMC Genomics 2011, 12:366 http://www.biomedcentral.com/1471-2164/12/366 Research Article De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology Javier Canales1† , Rocio Bautista2† , Philippe Label3† , Josefa Gomez-Maldonado1 , Isabelle Lesur4,5,6 , Noe Fernandez-Pozo2 , Marina Rueda-Lopez1 , Dario Guerrero-Fernandez2 , Vanessa Castro-Rodrıguez1 , Hicham Benzekri2 , Rafael A. Ca~nas1 , Marıa-Angeles Guevara7 , Andreia Rodrigues8 , Pedro Seoane2 , Caroline Teyssier9 , Alexandre Morel9 , Francßois Ehrenmann4,5 , Gregoire Le Provost4,5 , Celine Lalanne4,5 , Celine Noirot10 , Christophe Klopp10 , Isabelle Reymond11 , Angel Garcıa-Gutierrez1 , Jean-Francßois Trontin11 , Marie-Anne Lelu-Walter9 , Celia Miguel8 , Marıa Teresa Cervera7 , Francisco R. Canton1 , Christophe Plomion4,5 , Luc Harvengt11 , Concepcion Avila1,2 , M. Gonzalo Claros1,2 and Francisco M. Canovas1,2 * 1 Departamento de Biologıa Molecular y Bioquımica, Facultad de Ciencias, Universidad de Malaga, Malaga, Spain 2 Plataforma Andaluza de Bioinformatica, Edificio de Bioinnovacion, Parque Tecnologico de Andalucıa, Malaga, Spain 3 INRA, Universite Blaise Pascal, Aubiere Cedex, France 4 INRA, Cestas, France 5 Universite de Bordeaux, Talence, France 6 HelixVenture, Merignac, France 7   Plant Biotechnology Journal (2013), pp. 1–14 doi: 10.1111/pbi.12136 Microarrays Bases de datos Herramientas y algoritmos… Genómica, proteómica, metabolómica Biotecnología

http://www.scbi.uma.es Primero se recopilan los datos 44 homology was found, respectively, confirming that most assem- bled unigenes were pine transcripts. In fact, 4608 unigenes had a homologue EST in the Pine Gene Index 9.0 database (http://compbio.dfci.harvard.edu/cgi-bin/tgi/ Table 1 Description of samples used for DNA sequencing Gene library Sequencing platform Sampled plant material Experimental conditions SRA code EuroPineDB Sanger/454 Bud, xylem, phloem, stem, needles, roots, stem, embryos, callus, cone, male and female strobili ESTs and SSH libraries from different tissues and conditions as described by Fernandez-Pozo et al., 2011 SRS479769 Biogeco1 454 Xylem, bud and needle ESTs from differentiating xylem, swelling bud and young needles SRX032960, SRX032961, SRX032962, SRX032963 Biogeco2 454 Bud EST from quiescent buds harvested on 2-year-old maritime pine (low growing family) in well-watered or drought-stress conditions SRX031546 Biogeco3 454 Bud EST from quiescent buds harvested on 2-year-old maritime pine (fast growing family) in well-watered or drought-stress conditions SRX031589 UAGPF1 454 Embryome ESTs from developing, immature embryos (1-week maturation) SRX022618 INIA_PPIN 454 Bud ESTs from buds PRJNA221139 U_root 454 Root ESTs from roots (1-month-old seedlings) SRS480239 U_tip 454 Root tips ESTs from root tips (1-month-old seedlings) SRS480265 U_H 454 Hypocotyl ESTS from hypocotyl (1-month-old seedlings) SRS480236 U_N 454 Needle ESTs from needles (1-month-old seedlings) SRS480237 U_Cot_Os 454 Cotyledon ESTs from cotyledons grown under dark conditions SRS479771 U_H_Os 454 Hypocotyl ESTs from hypocotyl grown under dark conditions SRS480236 U_R_6 454 Roots ESTs from roots (6-month-old seedlings) SRS480238 U_S_8 454 Stem ESTs from stem (8-month-old seedlings) SRS480261 UAGPF2 Illumina Somatic embryo Paired-end ESTs from developing, immature embryos (1 week maturation) SRR609713 BIOGECO4 Illumina Bud ESTs from young and aged buds SRX031587 BIOGECO5 Illumina Root ESTs from drought-stressed and control roots in hydropony SRX031592, SRX031590 BIOGECO6 Illumina Bud ESTs from young and aged buds SRX031594 IBET Illumina Zygotic embryo Paired-end ESTs from embryos SRS481044 The maritime pine transcriptome 3

http://www.scbi.uma.es Después se diseña el flujo de trabajo 45 Unmapped contigs Full-LengtherNext v3 Non-coding #1 Short reads SeqTrimNext (pre-processing) Oases (pre-assembling) kmer 23 & 47 paired-end + single CD-HIT 99% Miss-assembly rejection#3 #2 Rejected #1 S. senegalensis long-reads SeqTrimNext (pre-processing) MIRA (pre-assembling) EULER-SR (pre-assembling) CAP3 (reconciliation) Unmapped contigs UNIGENES S.senegalensis v4 #6 Mapped contigs #4 Contigs Debris Non-coding #7 Coding unmapped contigs BOWTIE 2 (mapping test) #3 B #2 Rejected #9 #10 #11 Full-LengtherNext Missassemblies #12 Contigs #8

http://www.scbi.uma.es Los flujos son cada vez más importantes 46 Genes 2012, 3, 545-575; doi:10.3390/genes3030545 genes ISSN 2073-4425 www.mdpi.com/journal/genes Article Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows Federica Torri 1,2 , Ivo D. Dinov 2,3 , Alen Zamanyan 3 , Sam Hobel 3 , Alex Genco 3 , Petros Petrosyan 3 , Andrew P. Clark 4 , Zhizhong Liu 3 , Paul Eggert 3,5 , Jonathan Pierce 3 , James A. Knowles 4 , Joseph Ames 2 , Carl Kesselman 2 , Arthur W. Toga 2,3 , Steven G. Potkin 1,2 , Marquis P. Vawter 6 and Fabio Macciardi 1,2, * 1 Department of Psychiatry and Human Behavior, University of California, Irvine, CA 92617, USA; E-Mails: ftorri@uci.edu (F.T.); sgpotkin@uci.edu (S.G.P.) 2 Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: ivo.dinov@loni.ucla.edu (I.D.D.); jdames@uci.edu (J.A.); carl@isi.edu (C.K.); toga@loni.ucla.edu (A.W.T.) 3 Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: Alen.Zamanyan@loni.ucla.edu (A.Z.); shobel87@gmail.com (S.H.); alexgenco@gmail.com (A.G.); Petros.Petrosyan@loni.ucla.edu (P.P.); zhizhong.liu@loni.ucla.edu (Z.L.); eggert@cs.ucla.edu (P.E.); jonathan.pierce@loni.ucla.edu (J.P.) 4 Zilkha Neurogenetic Institute, USC Keck School of Medicine, Los Angeles, CA 90033, USA; E-Mails: clarkap@usc.edu (A.P.C.); knowles@med.usc.edu (J.A.K.) 5 Department of Computer Science, University of California, Los Angeles, CA 90095, USA 6 Functional Genomics Laboratory, Department of Psychiatry And Human Behavior, School of Medicine, University of California, Irvine, CA 92697, USA; E-Mail: mvawter@uci.edu * Author to whom correspondence should be addressed; E-Mail: fmacciar@uci.edu; Tel.: +1-949-824-4559; Fax: +1-949-824-2072. Received: 6 July 2012; in revised form: 15 August 2012 / Accepted: 15 August 2012 / Published: 30 August 2012 Abstract: Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. OPEN ACCESS Genes 2012, 3 547 Table 1. Review of the most used software in next-generation sequencing (NGS) data analysis. Which includes two major computational macro-processes: (1) a primary step related to mapping and assembling, with alignment quality control, quality score re- regions of the genome; and (2) secondary, advanced steps focused on variant (single nucleotide polymorphisms (SNPs), insertions-deletions (Indels) and copy number variations (CNVs)) calling and annotation. These macro-processes are briefly reviewed to provide a background for the software algorithms embedded in DNA-Seq analysis. Process Software & Algorithms Website Preprocessing step homemade script (N/A) (1.1) Alignment MAQ http://maq.sourceforge.net BWA http://bio-bwa.sourceforge.net/bwa.shtml BWA-SW (SE only) http://bio-bwa.sourceforge.net/bwa.shtml PERM http://code.google.com/p/perm/ BOWTIE http://bowtie-bio.sourceforge.net SOAPv2 http://soap.genomics.org.cn MOSAIK http://bioinformatics.bc.edu/marthlab/Mosaik NOVOALIGN http://www.novocraft.com/ (1.2) De novo Assembly VELVET http://www.ebi.ac.uk/%7Ezerbino/velvet SOAPdenovo http://soap.genomics.org.cn ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss (1.3) Basic QC SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ PICARD http://picard.sourceforge.net/command-line-overview.shtml (1.4) Advanced QC GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit PICARD http://picard.sourceforge.net/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ IGVtools http://www.broadinstitute.org/igv/igvtools (2.1a) Variant Calling and annotation Sequence Variant Analyzer v1.0, for hg18 annotations SVA http://www.svaproject.org/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ERDS http://www.duke.edu/~mz34/erds.htm SAMTOOLS and ANNOVAR for annotation SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ANNOVAR http://www.openbioinformatics.org/annovar/ UnifiedGenotyper and ANNOVAR for annotation GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit ANNOVAR http://www.openbioinformatics.org/annovar/ (2.1b) CNVs CNVseq CNVseq http://tiger.dbs.nus.edu.sg/cnv-seq/ R http://www.r-project.org/ SAMTOOLS/ERDS/Sequen ce variant analyzer v1.0 ERDS SVA http://www.svaproject.org/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ERDS http://www.duke.edu/~mz34/erds.htm CNVer CNVer http://compbio.cs.toronto.edu/CNVer/ BOWTIE http://bowtie-bio.sourceforge.net SAVANT http://compbio.cs.toronto.edu/savant/ Simulated data generation tool dwgsim http://sourceforge.net/projects/dnaa/

http://www.scbi.uma.es Luego se ejecuta, y se paraleliza todo lo posible 47 Fewer transcripts for genes encoding enzymes of ammonium spruce. In contrast, two genes encode Fd-GOGAT and NADH- Fig. 2 Flow chart showing preprocessing into useful reads, assembly into contigs and overlap-based reconciliation into final unigenes of sequenced data from 5 (591 174 069 short reads, Illumina) or 14 (6 381 011 long reads, 454) cDNA libraries in maritime pine. The maritime pine transcriptome 5 Aquí se ensamblaron «muchas» secuencias Y aquí 10X más

http://www.scbi.uma.es Ahora diseñamos una base de datos 48 Con tablas para las anotaciones y metainformación que encontremos

http://www.scbi.uma.es … y le damos una interfaz web para la comunidad científica 49 gene library and pine species, and can be accessed using the ‘Assemblies’ tab. Each assembly can be inspected in detail, showing a paged list of UniGenes and a summary description. The detailed view of every UniGene means of GO term filtering. 3.1.2 Database retrieval In addition to a guided browsing, EuroPineDB contents can be retrieved by means of text search or sequence Home Gene libraries 96-Well plates 384-Well plates Microarrays Assemblies BLASTSearch Each library All sequences Each 96w_plate Each clone/sequence Each 384w_plate Each microarray block Each UniGeneExternal links Annotations List of assemblies Descriptions GO EC KEGG InterPro SNP SSR ORF Figure 3 Navigating through EuroPineDB. Arrowheads indicate the direction of navigation. Green boxes correspond to available views from all pages (thus, no incoming arrowhead is specified). Violet text indicates the option of downloading sequences in FASTA format.

http://www.scbi.uma.es Ahora podemos descubrir información biológica 50 A total of 5974 putative simple-sequence repeat (SSRs) were found, with trinucleotide repeats (3309) being the most common, and dinucleotide repeats (479) the less abundant. This is in agreement to previously published P. pinaster SSR abundance (Fernandez-Pozo et al., 2011). Discussion Maritime pine transcriptome assembly Long-read sequence data sets are required for transcriptome assembly in nonmodel species for which a reference genome is not available. In conifers, 454 sequencing has been recently used to generate well-defined transcriptomes in several species of ecological and economic interest, that is, Pinus contorta (Parch- man et al., 2010), P. glauca (Rigault et al., 2011), P. pinaster (Fernandez-Pozo et al., 2011), Pinus taeda and 11 other conifers (Lorenz et al., 2012). In the present work, we used a combination of 454 and Illumina sequencing to define a minimal reference transcriptome for maritime pine (P. pinaster). A similar approach was recently used to characterize, for example, the globe artichoke transcriptome (Scaglione et al., 2012). The nonredun- dant transcriptome resulting from the assembly contains 26 020 unique transcripts with orthologue ID in public databases, a number very close to the 27 720 unique cDNA clusters reported for the P. glauca transcriptome (Rigault et al., 2011) and higher than the 17 000 unique coding genes obtained in the assembly of P. contorta transcriptome (Parchman et al., 2010). The number of unique transcripts in maritime pine is also close to the number of genes (28 354) resulting from the draft assembly of the 20-gigabase genome of P. abies (Nystedt et al., 2013). Consid- ering all the available data, an elevated coverage of the maritime pine transcriptome is estimated. † MYB family of TF. ‡ Dof family of TF. § NAC family of TF. Fig. 4 Distribution of unique transcripts corresponding to TF gene families in Pinus pinaster and comparison to other plant transcriptomes. The number of different encoded transcripts with the conserved DNA- binding domain of each family is represented. The distribution of TF gene families in P. pinaster, Picea glauca, Picea abies, Populus trichocarpa and Arabidopsis thaliana is compared. annotation, comparative analysis with other conifer species and also for functional analysis of relevant genes associated to maritime pine growth, development and response to environ- mental changes. Furthermore, this genomic resource will greatly facilitate protein identification as well as protein–protein inter- action studies through proteomics approaches (Canovas et al., 2004). For all these reasons, it was of paramount importance to (Figure 5) present in maritime pine (this work) or spruce genomes (Birol et al., 2013; Nystedt et al., 2013; Rigault et al., 2011) were of similar or even lower size compared with angiosperm species (P. trichocarpa, A. thaliana and V. vinifera). Meanwhile, the existence of large gene families in conifers coding for enzymes of secondary metabolism has been reported (Martin et al., 2004), there are other families in primary metabolism that contain Fig. 5 Comparison of gene families for relevant enzymes in Pinus pinaster, Picea abies, Populus trichocarpa and Arabidopsis thaliana. The following databases were used in addition to SustainpineDB: P. abies v1.0, P. trichocarpa v3.0, A. thaliana TAIR 10. The maritime pine transcriptome 9 El genoma de pino es 10X el humano, pero las familias génicas son más pequeñas que en otras plantas

http://www.scbi.uma.es ¿No acabo de mencionar «paralelización»? 51 Hindawi Publishing Corporation Computational Biology Journal Volume 2013, Article ID 707540, 12 pages http://dx.doi.org/10.1155/2013/707540 Research Article SCBI_MapReduce, a New Ruby Task-Farm Skeleton for Automated Parallelisation and Distribution in Chunks of Sequences: The Implementation of a Boosted Blast+ Darío Guerrero-Fernández,1 Juan Falgueras,2 and M. Gonzalo Claros1,3 1 Supercomputaci´on y Bioinform´atica-Plataforma Andaluza de Bioinform´atica (SCBI-PAB), Universidad de M´alaga, 29071 M´alaga, Spain 2 Departamento de Lenguajes y Ciencias de la Computaci´on, Universidad de M´alaga, 29071 M´alaga, Spain 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, 29071 M´alaga, Spain Correspondence should be addressed to M. Gonzalo Claros; claros@uma.es Received 21 June 2013; Revised 18 September 2013; Accepted 19 September 2013 Academic Editor: Ivan Merelli Copyright © 2013 Dar´ıo Guerrero-Fern´andez et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Current genomic analyses often require the managing and comparison of big data using desktop bioinformatic software that was not developed regarding multicore distribution. The task-farm SCBI MapReduce is intended to simplify the trivial parallelisation and distribution of new and legacy software and scripts for biologists who are interested in using computers but are not skilled programmers. In the case of legacy applications, there is no need of modification or rewriting the source code. It can be used from multicore workstations to heterogeneous grids. Tests have demonstrated that speed-up scales almost linearly and that distribution in small chunks increases it. It is also shown that SCBI MapReduce takes advantage of shared storage when necessary, is fault- tolerant, allows for resuming aborted jobs, does not need special hardware or virtual machine support, and provides the same results than a parallelised, legacy software. The same is true for interrupted and relaunched jobs. As proof-of-concept, distribution of a compiled version of Blast+ in the SCBI Distributed Blast gem is given, indicating that other blast binaries can be used while maintaining the same SCBI Distributed Blast code. Therefore, SCBI MapReduce suits most parallelisation and distribution needs in, for example, gene and genome studies. 1. Introduction The study of genomes is undergoing a revolution: the produc- tion of an ever-growing amount of sequences increases year by year at a rate that outpaces computing performance [1]. This huge amount of sequences needs to be processed with the well-proven algorithms that will not run faster in new computer chips since around 2003 chipmakers discovered that they were no longer able to sustain faster sequential exe- cution except for generating the multicore chips [2, 3]. There- fore, the only current way to obtain results in a timely manner Sequence alignment and comparison are the most impor- tant topics in bioinformatic studies of genes and genomes. It is a complex process that tries to optimise sequence homology by means of sequence similarity using the algorithm of Needleman-Wunsch for global alignment, or the one of Smith-Waterman for local alignments. Blast and Fasta [4] are the most widespread tools that have implemented them. Paired sequence comparison is inherently a parallel pro- cess in which many sequence pairs can be analysed at the same time by means of functions or algorithms that are iter- atively performed over sequences. This is impelling the par- picasso Fundamentos de programación

http://www.scbi.uma.es SCBI_MapReduce: para paralelizar y distribuir 52 Eficiente Robusto Mejora el rendimiento de Blast

http://www.scbi.uma.es Luego la bioinfo no está reñida con la supercomputación 53 Red Española de Supercomputación Picasso Picasso: 
 2310 cores 700 TB disk 7 FAT nodes of shared memory: 
 80 cores 2 TB RAM >25 GB/core Computing nodes: 984 cores 4 TB RAM 4 GB/core «Thin» nodes: 768 cores 3 TB RAM 8 GB/core GPU nodes: 32 GPU 1 TB RAM 8 GB/core

http://www.scbi.uma.es Picasso: CPD para supercomputación y bioinformática 54 Hard disks FAT nodes Computing nodes THIN nodes More disks GPU nodes

http://www.scbi.uma.es Por qué son buenas las infraestruturas de CPD 55 • Providing solid infrastructure for software and hardware • More cost-efficient for large-scale projects • Cost-effective (licenses, computers...) • Including expensive software and multi-user licenses • Specialization • Collaboration with other research groups outside UMA Editorial The Need for Centralization of Computational Biology Resources Fran Lewitter1 *, Michael Rebhan2 *, Brent Richter3 *, David Sexton4 * 1 Bioinformatics and Research Computing, Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, United States of America, 2 Novartis Institutes for BioMedical Research, Basel, Switzerland, 3 Enterprise Research IS and Informatics, Brigham and Women’s Hospital, Massachusetts General Hospital, and Partners Healthcare, Boston, Massachusetts, United States of America, 4 Center for Human Genetics Research, Computational Genomics Core, Vanderbilt University, Nashville, Tennessee, United States of America Biomedical research is benefiting from the wealth of new data generated in the laboratory through new instrumentation, greater computational resources, and mas- sive repositories of public domain data. Using these data to make scientific discov- eries is sometimes straightforward, but can be complicated by the number and breadth of public sources available to the researcher as well as by the plethora of tools from which to choose. Complex searches, anal- yses, or even storage needs require more computational expertise than that available within an individual laboratory. As bio- medical researchers develop more compu- tational skills, this may change over time. Having a centralized group of experts in ‘‘core facility’’, ‘‘platform’’, etc.—and dif- ferent responsibilities for the group based on size and organization. For the purposes of this Editorial and the accompanying Perspectives (doi:10.1371/journal.pcbi. 1000368 and doi:10.1371/journal.pcbi. 1000369), we use the term ‘‘Bioinformatics Core Facility’’ to refer to these centralized resources. No matter what name is used, the primary focus of the centralized resource will be to support the investiga- tors with their computational needs. Be- low, we highlight some of the most important reasons we see for centralizing these resources. Providing Infrastructure On the software side, it can be econom- ical to purchase multi-user, concurrent, or site licenses rather than individual licenses. This also helps with support of the software as purchasers of the larger licenses will likely be better prepared to field questions and offer training opportunities about installation and use of the software. In addition, the Bioinformatics Core Facility may be in a position to purchase expensive software that is used only occasionally by researchers, thus being able to provide more options for individuals to address important research needs. Many researchers in an institution may have the same needs for custom software. A person working in a centralized facility can Why Centralize? Different institutions will have different names for these centralized resources— * E-mail: lewitter@wi.mit.edu (FL); michael.rebhan@novartis.com (MR); brichter@partners.org (BR); sexton@chgr. mc.vanderbilt.edu (DS) The order of authors is alphabetic; each author has contributed equally to the development and writing of this Editorial. PLoS Computational Biology | www.ploscompbiol.org 1 June 2009 | Volume 5 | Issue 6 | e1000372

http://www.scbi.uma.es ¿Cómo se accede? 56 Web tools Command line Web interface Web server Virtual machines Database Home Files Virtual machine File transfer

http://www.scbi.uma.es La bioinformática no se limita a secuencias y BD 57

Aplicaciones de la bioinformática y la supercomputación 58

http://www.scbi.uma.es El descubrimiento de nuevos fármacos «era» carísimo 59 Hay que sintetizar cada compuesto y comprobarlo en los animales Método clásico Método bioinformático Solo se sintetizan los candidatos. Ahorro en síntesis, tiempo y animales Ligand database

http://www.scbi.uma.es Ha valido para el Nobel de química en 2013 60 Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos Químico teórico Biofísico Bioquímico http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/ Bioquímico

http://www.scbi.uma.es Ha valido para el Nobel de química en 2013 60 Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos Químico teórico Biofísico Bioquímico http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/ Bioquímico This Nobel Prize is the first given to work in computational biology, indicating that the field has matured and is on a par with experimental biology ! The blog of PLOS Computational Biology

http://www.scbi.uma.es La biología de sistemas nos revela las claves 61 La regulación celular se va complicando a medida que aumenta la complejidad del organismo

http://www.scbi.uma.es allow the formation of supramolecular activator or inhibitory complexes, depending on their components and possible combinations. Transcription factors (TFs) are an essential subset of interacting proteins responsible for the control of gene expression. They interact with DNA regions and tend to form transcriptional regulatory complexes. Thus, the final effect of one of these complexes is determined by its TF composition. The number of TFs varies among organisms, although it appears to be linked to the organism’s complexity. Around 200–300 TFs are predicted for Escherichia coli [18] and Saccharomyces [19,20]. By contrast, comparative analysis in multicellular organ- isms shows that the predicted number of TFs reaches 600–820 in C. elegans and D. melanogaster [20,21], and 1500–1800 in Arabidopsis (1200 cloned sequences) [20–22]. For humans, around 1500 TFs have been documented [21] and it is estimated that there are 2000–3000 [21,23]. Such an increase in the number of TFs is associated with higher control of gene regula- tion [24]. Interestingly, such an increase is based on the use of the same structural types of proteins. Human transcription factors are predominantly Zn fin- gers, followed by homeobox and basic helix–loop–helix [21]. Phylogenetic studies have shown that the amplifi- cation and shuffling of protein domains determine the Fig. 1. Human transcription factor network built from data extracted from the TRANSFAC 8.2 database. Numbered black filled nodes are the highest connected transcription factors. 1, TATA-binding protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5, retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit (RelA); 7, c-jun; 8, c-myc; 9, c-fos. Human transcription factor network topology C. Rodriguez-Caso et al.Nos dice qué proteína más vale no tocar 62 Topology, tinkering and evolution of the human transcription factor network Carlos Rodriguez-Caso1,2 , Miguel A. Medina2 and Ricard V. Sole´1,3 1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain 2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Ma´laga, Spain 3 Santa Fe Institute, Santa Fe, New Mexico, USA Living cells are composed of a large number of differ- ent molecules interacting with each other to yield com- plex spatial and temporal patterns. Unfortunately, this reality is seldom captured by traditional and molecular biology approaches. A shift from molecular to modular biology seems unavoidable [1] as biological systems are Early topological studies of cellular networks revealed that genomic, proteomic and metabolic maps share characteristic features with other real-world networks [8–12]. Protein networks, also called inter- actomes, were studied thanks to a massive two-hybrid system screening in unicellular Saccharomyces cerevisiae Keywords human; molecular evolution; protein interaction; tinkering; transcription factor network Correspondence Ricard V. Sole´, ICREA - Complex System Laboratory, Universitat Pompeu Fabra, Dr Aiguader 80, 08003 Barcelona, Spain Fax: +34 93 221 3237 Tel: +34 93 542 2821 E-mail: ricard.sole@upf.edu (Received 5 August 2005, revised 25 October 2005, accepted 31 October 2005) doi:10.1111/j.1742-4658.2005.05041.x Patterns of protein interactions are organized around complex heterogene- ous networks. Their architecture has been suggested to be of relevance in understanding the interactome and its functional organization, which per- vades cellular robustness. Transcription factors are particularly relevant in this context, given their central role in gene regulation. Here we present the first topological study of the human protein–protein interacting transcrip- tion factor network built using the TRANSFAC database. We show that the network exhibits scale-free and small-world properties with a hierarchi- cal and modular structure, which is built around a small number of key proteins. Most of these proteins are associated with proliferative diseases and are typically not linked to each other, thus reducing the propagation of failures through compartmentalization. Network modularity is consistent with common structural and functional features and the features are gener- ated by two distinct evolutionary strategies: amplification and shuffling of interacting domains through tinkering and acquisition of specific interact- ing regions. The function of the regulatory complexes may have played an active role in choosing one of them. Abbreviations ER, Erdo¨s-Re´nyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor. FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 642 ology, tinkering and evolution of the human nscription factor network Rodriguez-Caso1,2 , Miguel A. Medina2 and Ricard V. Sole´1,3 Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain ment of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Ma´laga, Spain Fe Institute, Santa Fe, New Mexico, USA cells are composed of a large number of differ- ecules interacting with each other to yield com- atial

Add a comment

Related presentations

Related pages

Presentación de la nueva infraestructura del Centro de ...

Presentación de la nueva infraestructura del Centro de Supercomputación y Bioinformática de la UMA. LUGAR: Edificio de Bioinnovación de la UMA en el PTA.
Read more

Bioinformática FJS. Introducción Bioinformática es la ...

Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA ¿En qué consiste la bioinformática? ¿Cómo puedo especializarme?
Read more

Introducción a la Bioinformática para informáticos ...

Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA ¿En qué consiste la bioinformática? ¿Cómo puedo especializarme?
Read more

Facultad de Informática - Universidad de Murcia

Razones para elegirnos; ... Logros y contribución a la bioinformática' ... Inscripción al Curso de Programación en Web para Dispositivos Móviles.
Read more

Cómo ser responsable: 26 pasos (con fotos) - wikiHow

... lo mejor es admitir las razones reales por las que no hiciste algo y ... mantenerte enfocado en la recompensa que obtendrás y para saber qué es ...
Read more

Objetivos y competencias - Universidad de Málaga

Soy nuevo en la UMA. Barra ... preexistente que aporta la bioinformática. Para incorporar o mejorar las ... en equipo y para hacer ...
Read more