Published on March 5, 2014
Human Protein Reference Database An analysis of the technology powering the database and website, and how it was developed. Kiran Jonnalagadda
Facts About HPRD • HPRD is a database of all disease causing proteins in the human body. • It is the most comprehensive database of its kind in the world today. • Unlike most other biological databases, HPRD is protein-centric, not gene-centric. 2
Factors Leading to Choice of DB • The biologists hadn’t settled on what information was to be stored and therefore the data type definitions changed often. • Several data types were fairly similar to others but not the same. • Future extensions had to be built by techsavvy biologists with minimal assistance from programmers. 3
What We Used • The Zope application server, comprising of: – – – – The Web publishing object framework. ZODB, the object database storage system. ZCatalog, the indexing and search system. ZEO, the stand-alone database server for multiple front-end Web servers. 4
Why an RDBMS Was Not Suited • Data type definition changed frequently. In an RDBMS, this would have meant redefining tables every week. • The code currently has about forty data classes. Imagine having that many data tables, plus tables for relationships between them, all under frequent revision. 5
How Zope Handled These Issues • Zope is built on Python, which offers dynamic data structures. • ZODB uses this ability to makes the entire database look like one large data structure, transparently swapping unused parts to disk and recovering them as needed. • ZCatalog indexes data for searching. 6
At Zope’s Core is Python • Python is a dynamic language. • When I say dynamic, I mean everything is dynamic! • Code, variables, classes, modules, everything can be modified at run-time. • Most of Zope is built around this ability. Zope could not have been implemented in another language. 7
Data Storage in Zope • In Zope, data is stored in instances of a data class. • The data class has variables, which are like fields, and methods, which manipulate data. • Instances of a data class (objects) are stored in the ZODB, making the database. • Objects can contain other objects, forming hierarchies. 8
Components of Zope • ZServer (formerly Medusa) – Handles incoming requests. – Does HTTP, FTP, WebDAV, XML-RPC; soon SOAP. • ZPublisher – Maps URLs to objects and handles security. • ZODB (Zope Object DataBase) – Stores objects on disk in a transactional DB. • ZEO (Zope Enterprise Objects) – ZODB server for multiple Zope front-end servers. 9
Security in Zope • Security is fine grained. • Security is defined around four concepts: – Users, Roles, Permissions and Hierarchies. • A user is assigned one or more roles. • A role is assigned a set of permissions. • This set can be reassigned at different positions in the hierarchy. 10
Security Outside Zope • Zope’s security mechanism is limited to the Web front. • It is applied only to objects that directly interface with the end-user. • Code written in a module in the filesystem has no security restrictions. It can do anything. 11
Limitations in Zope • The API for creating extensions (called Products) is complicated and poorly documented. • The Property Manager interface is too primitive. It only handles the very basic data types such as strings, integers, boolean fields, selection lists and multi-line text. 12
Our Extensions to Zope • A framework for separating Zope specifics from our data types, making it much simpler to add new data types. • An extended property management system that could handle changes in data type definitions and automatically migrate data. 13
Part II User Interface The rationale behind decisions affecting how a user experiences the database.
User Interface Design • We started with exposing Zope’s hierarchy as the public user interface • But there were some elements such as the category browser and the 15
Templates for the Web UI • Choice of DTML and ZPT for templates. • ZPT for templating system. 16
Part III Project Management Lessons What we learnt about managing a project across continents and distant time zones.
Project Management Issues 1 • We learnt the hard way that a project manager’s place is with his team, not with the client. • Productivity suffers in the absence of an effective collaboration tool. • E-mail and instant messengers are not effective collaboration tools. 18
Project Management Issues 2 • Collaboration over e-mail imposes the burden of articulation on the communicator, which many dislike and therefore avoid. • Instant messaging prevents collecting thoughts before presenting them and is therefore a poor planning tool. 19
Collaboration Tools • We experimented with several collaboration systems, with varying effectiveness: – – – – – Phone calls. Instant messengers. Wikis. Issue tracking software. Mailing lists. 20
Phone Calls • Next best thing to face-to-face discussions. • But only connect two people unless nonstandard equipment is used. • International calls are usually too expensive for the resulting gain. 21
Instant Messengers • Provide critical communication between geographically distributed team members. • But the pressure of maintaining continuity in a conversation hinders pausing to gather thoughts. • Typing is much slower than talking. Therefore little else gets done alongside. 22
Wikis • The easy hyperlinking system of a wiki combined with structured text makes presenting information a snap. • With a little code thrown in, Wikis could make a wonderful project management tool. • A changed page notification system is needed or changes go unnoticed. 23
Issue Tracking Software • We use BugZilla to track issues. • But in eight months using it, only 30 issues have been reported using it. • The other few hundred were reported over email, instant messengers and in person. • Clearly, the problem is with BugZilla’s usability. Search for a new system is on. 24
Mailing Lists • E-mail is push media: the latest is always on top of your inbox. • E-mail makes an effective to-do list in an interface the user is comfortable with. • Mailing lists are e-mail in broadcast mode. • Mailing lists have been the most effective collaboration tool we’ve used so far. 25
Issues With Programmers • Programmer skill levels and attitudes vary. • C programmers tend to write C code in Python. • PHP programmers tend to write PHP code in Python. • Learning Python is easy but thinking in Python takes a long time. 26
Programming Tools We Used • CVS for source control. • ViewCVS for a Web front-end to CVS. • Vim in GUI mode for source editing (preferred editor of everyone in the team). • The print statement for debugging. 27
Tools We Should Have Used • WingIDE is a $35 piece of software that provides an interactive Python debugger usable with Zope that would have in a few minutes of usage more than paid for itself for the hours in programmer time we instead spent debugging using the print statement. 28
Part IV Things Needing Fixing Mistakes we made during development, how they affect things now, and how they can be fixed.
Naming Conventions • We started with assuming HPRD was genecentric and named several things as GeneSomething. • In code, this can be considered just an identifier. • But in a URL, there is potential for confusing users and needs renaming. 30
Reusable Modules • All of the code currently sits in one directory. • Several important pieces have nothing to do with how they are being used. • These modules could be separated and contributed independently to the open source code pool. 31
Data in Code • There are bits of implementation specific data embedded in code in some places, particularly related to graph generation. • These were introduced as quick patches for a temporary problem but have remained in place for months now. • These need to be taken out so that the code is truly reusable. 32
Documentation • DocStrings needed in code. • Consistent language in DocStrings. • HTML documentation files to be distributed with code. 33
... of individual human genomes, with the first draft sequence and ... most protein-coding genes of the human genome, ... the Human Reference ...
The Human Genome Project ... in April 2003. An initial rough draft of the human genome was available in ... the human genome database on the ...
Development of human protein reference database as an initial platform for approaching systems ... , was built by using open source technologies and ...
Human Proteome. Here, we ... The Mouse Brain Atlas is an addition to the Human Protein Atlas presented as an interactive database with fluorescent images ...
... 2003 the National Human Genome ... Examples include advanced drafts of the ... within the limits of today's technology, the human genome is ...
Protein Database; Reference Sequence (RefSeq) ... Human Protein Interaction Database; ... NCBI > RefSeq: NCBI Reference Sequence Database.
There are related worries that the human genome reference sequence ... 2003, “Human Genetic ... of Human Genome Project,” Hermes Database ...
DNA Sequencing Technologies Key to the Human Genome Project. By: Heidi ... drafts of the human genome sequence were published simultaneously by ... , 2003 ...
White House Announcement Draft ... to the completion of the mapping of the human genome (2003). ... Genetic and Genomic Image and Illustration Database A ...
wwPDB data centers serve as deposition, annotation, and distribution sites of the PDB archive. Each site offers tools for searching, visualizing, and ...