20090921 Art Databanken Agosti Final

100 %
0 %
Information about 20090921 Art Databanken Agosti Final
Education

Published on October 7, 2009

Author: agosti

Source: slideshare.net

Description

Lecture presented at the ArtDataBanken meeting in Stockholm on September 21, 2009

Literature and XML: or How to Have More Time to Think Donat Agosti Plazi ArtDataBanken Stockholm, Sept 21, 2009

Who is this? What do I know about her? Where does she live? Who are you? What do you do? Where are you from?

The answers are in several hundred million pages of printed species descriptions in our libraries, including the descriptions redescriptions of an estimated 1.8M species, and an estimated 50K new (re-)descriptions annually.

Taxonomists at work …… T. E. Lawrence: Seven Pillars of Wisdom – a triumph. 1st published for general circulation, 1935: p. 535

The traditional flux of information … … a more or less closed, intransient system

What has this to do with XML, semantic, enhanced documents?

Access

Scanning Pdf-conversion (WWW)

Before antbase.org, Harvard‘s Museum of Comparative Zoology could claim to be the only location with a complete set of ant systematics publications from 1758 - present. Through antbase.org‘s digital library, access to this body of literature is worldwide, and it is actively used (>10,000 visits in one month only).

The Biodiversity Heritage Library is currently digitizing and make accessible >100 million pages, most of them out of copyright, ie older then 1925. ........ to be finished in 2048...

Access to ant taxonomic publications through antbase.org /Smithsonian Institution, including currently the entire body of non-copyrighted publications since 1758 (>4,000 publications or 85,000 pages)

Can taxonomic work be copyrighted? Copyright legislation is national but is based on the Berne Convention for the Protection of Literary and Artistic Works which defines a minimal standard. This international copyright standard does not require the recognition of treatments, the building stones of taxonomic publications, as works.

“ work ” does not mean “text”, does not mean “data”, does not mean “information”. A “ work ” is something more. That kind of something more has many different definitions in the various legislations, but it is always there: It may be called originality, individuality, creation, personal expression, creative shaping or anyhow else, but it is a condition for qualifying a product as a work: “Work” is an intellectual product that is in a certain sense particular, individual, original, new. (Egloff: EDIT IPR and Copyright, 2008)

Taxonomic treatments are highly structured and homogenous, part of a global >100 million page corpus growing at a rate of ca 20,000 new species descriptions per year, not counting 5 times more redescriptions. Its structure is tightly controlled by a peer review process enforcing standards, a domain specific vocabulary, not written as poem or in flowery language but scientific jargon. Treatments do not qualify as work. The publications including the treatments might. (Egloff: EDIT IPR and Copyright, 2008)

It is about digesting millions of pages: >>100 M pages taxonomic literature 25M scientific publications / year 25K journals >1K with taxonomic descriptions 20K descriptions of new species / year

Is this is the access we need?!

No, we need open access to content , not the PDF per se .

It is about machines (not we) doing a great deal of the work for us, extracting data, formulating hypothesis, ....

It is about data and information in context

„ Nothing makes sense in biology except in the light of treatments“.

An example from the Neurocommons text mining pilot: PubMed abstracts: > 16,000,000 CNS classified abstracts: 874,727 text mining recognized: 368,688 text mining processed: 94,381 extracted graph of 30,000+ relationships and 5,500 genes and proteins “ protein-protein interaction networks” John Wilbanks, Neurocommons

An example from the Neurocommons text mining pilot:

PubMed abstracts: > 16,000,000

CNS classified abstracts: 874,727

text mining recognized: 368,688

text mining processed: 94,381

extracted graph of 30,000+ relationships and 5,500 genes and proteins

 

In a semantic Web environment (where machines talk to each other and do most of our work), data need to be able to talk to each other: “ protein-protein interaction networks” John Wilbanks, Neurocommons 27,266 papers 4,563 papers 41,985 papers 10,365 papers 128,437 papers

Relational to Ontological Mapping Drug Neuron Pathological Agent Receptor Channel inhibits inhibits Agent Neuronal Property Pathological Change involves involves inhibits Compartment has is_located_in is_located_in slide courtesy of kei chung, yale

It will open up scientific literature for data mining “ protein-protein interaction networks” John Wilbanks, Neurocommons

TREATMENT Cremastogaster mimosae    Likely Diagnostically Related to: Cremastogaster tricolor   Likely Diagnostically Related to: Cremastogaster tricolor   Likely Diagnostically Related to: Cremastogaster amabilis   Likely Diagnostically Related to: Cremastogaster tricolor   Likely Diagnostically Related to: Cremastogaster amabilis   Associated with: Acacia sienocarpa Living in: Mombasa Lviing in: Tanga

It is more: it is about access to the original or source data

The semantically enhanced treatments, extracted, stored on Plazi.org, and served in a human readable form, are linked to the underlying data: Fisher & Smith, 2008, PLoS ONE.

Semantic, enhanced treatments do the job ...

... and XML is one way to go .

XML XML stands for EXtensible Markup Language XML is a markup language much like HTML XML was designed to carry data, not to display data XML tags are not predefined. You must define your own tags (schema) XML is designed to be self-descriptive XML is a W3C Recommendation

XML Being open and non-proprietary XML is an optimal archival format for the treatment/publication Being a stable and rich data format, XML can be repurposed for a variety of purposes

XML XML application design is an art in itself .... and thus can not be explained in 15 minutes Plenty of resources to dive into XML on Web, eg http://www.w3schools.com/, etc.

This means to develop a schema that models the logic content (e.g TaxonX), insert those tags that define what a word means, so a computer can understand as well. To assure, that everybody talks about the same species, the name can be linked to a reference name server Azteca instabilis Taxonx-schema Would then read like External schema <tax:name> <tax:xmldata> Normalization of data <dc:Genus>Azteca</dc:Genus> <dc:Species>instabilis</dc:Species> </tax:xmldata> Azteca instabilis </tax:name>

This can also be applied to entire sections of text, such as the treatment of a species and its parts. <tax:treatment> <tax:nomenclature> <tax:name> <tax:xid source=&quot;HNS&quot; identifier=&quot;193329&quot;/> <tax:xmldata> <dc:Genus>Mystrium</dc:Genus> <dc:Species>leonie</dc:Species> </tax:xmldata> Mystrium leonie </tax:name> <tax:status>n. sp.</tax:status> Fig 1 D - F </tax:nomenclature> <tax:div type=&quot;description&quot;> <tax:p>HOLOTYPE WORKER: TL 3.95, HL 1.02, HW 0.95, CI 93, SL 1.30, SI 137, PW 0.73, ML 0.38. Mandible outer margin strongly curving to a sharp apical tooth, the apex parallel to the anterior clypeal margin. (Holotype with material in mandibles, so mandibles and anterior clypeus $ described below from paratypes.) Median clypeus .... </treatment>

global unique identifiers (e.g. LSID) link up data

LSID for scientific publications LSID for treatments LSID for names (Zoobank/ HNS..) LSID for specimens LSID for DNA sequences / characters (ontologies) LSID for repositories GPS fixes for locations

Azteca instabilis Would then read like <tax:name> <tax:xid source=“ LSID&quot; identifier=“urn:lsid:biosci.ohio-state.edu.osuc_concetps:13452 &quot;/> Link to external database <tax:xmldata> Normalization of data <dc:Genus>Azteca</dc:Genus> <dc:Species>instabilis</dc:Species> </tax:xmldata> Azteca instabilis </tax:name>

We need XML-schemas, tools to convert and expose semantically enhanced documents.

Plazi workflow: overview Plazi deliverables TaxonX XML schema GoldenGate Dspace application Exist application SRS Exchange protocols (SPM, TAPIR, REST)

- Get LSID from Hymenoptera Name Server for names; ZooBank? Add new names - Get bibliographic Metadata from HNS (MODS) - Get bibliographic Guids from bioguid (or EDIT?) - Get geographic long/lat from geonames.org Plazi workflow: GoldenGate mark up as an example Get Guids for CBOL NCBI specimen images .....

- Get LSID from Hymenoptera Name Server for names; ZooBank?

Add new names

Get Guids for

CBOL

NCBI

specimen

images

.....

Plazi Search and Retrieval Server: Access to data TAPIR, SPM You You You human machine

Materials examined from literature in GBIF

Plazi workflow: content 11,000 descriptions online 500 publications 4,500 publications Handle, SPM and Tapir services Feeds into HNS and Zoobank (soon) Is harvested by GBIF, EOL Support from GBIF, EOL, US-NSF, DFG

Does the retro mark-up process scale up to the millions of pages needed to be processed? Only partially: Mark up takes about 5min/page: For 100 M pages = 700 man years (but it is only a first tool...)

Does the mark-up process scale up to the millions of page needed to be processed? Only partially: Mark up takes about 5min/page: For 100 M pages = 700 man years (but it is only a first tool...); wizards can reduce the time by several factors But: How much does it cost to digitize specimens, and what is its quality?

The cost of converting legacy publications can be avoided by producing marked-up publications up-front

NLM/TaxonX schema allows publishers to maintain richly encoded articles whose data can be distributed and presented in multiple formats for a variety of uses.

NLM/Taxonx XML Document Print

NLM/Taxonx XML Document PDF Print

NLM/Taxonx XML Document HTML PDF Print

NLM/Taxonx XML Document HTML SPM /RDF PDF Print SPM /RDF SPM /RDF

NLM/Taxonx XML Document HTML SPM /RDF PDF Print Database HTML / Species Page HTML SPM /RDF SPM /RDF HTML / Species Page HTML / Species Page Eg. EOL, scrathpads

NLM/Taxonx XML Document HTML SPM /RDF PDF Print Database HTML / Species Page LSID resolver HTML SPM /RDF SPM /RDF HTML / Species Page HTML / Species Page Eg. EOL, scrathpads

NLM/Taxonx XML Document HTML SPM /RDF PDF Print Database HTML / Species Page Google Dataminig, ... LSID resolver HTML SPM /RDF SPM /RDF HTML / Species Page HTML / Species Page Eg. EOL, scrathpads

Semi-automatically generated semantic, enhanced e-publications are the only way to describe the missing 10 M species, and to deal with an increasing flood of data.

ms submission („Taxon-x-version“) new ms alert Posting for review Edited ms Revised ms Publication: pdf Publication: hard copy Publication database („taxon-x-version“) analysis & ms preparation Taxon DB New Data feedback Accepted ms New taxon alert The future of publications: The publication semiautomaticall generated ontology bibliography ZooBank / NS Character DB Specimen DB Description DB Distribution DB Char. Matrix DB Phyl. Tree DB Char-state Im. Specimen Im. Habitat Image Leg. Publicat.

Word MS DB Input forms export export convert NLM taxpub Indesign NLM taxpub author author author publisher publisher publisher Journal authoring and production workflow Ctd.

NLM/Taxonx XML Document HTML SPM /RDF PDF Print Database HTML / Species Page Google Dataminig, ... LSID resolver HTML SPM /RDF SPM /RDF HTML / Species Page HTML / Species Page Eg. EOL, scrathpads Ctd.

Word MS DB Input forms export export convert NLM taxpub Indesign NLM taxpub author author author publisher publisher publisher Journal authoring and production workflow: What do we miss? available prototypes to be developed

Where do we stand? 2008 LSIDs, external links

Where do we stand? 2008 LSIDs, external links, XML

Where do we stand? 2008

Where do we stand? 2009 LSIDs, external links, external data via doi, export services

Where do we stand? 2009 LSIDs, external links

Recommendations: Individual level Assure that all you do is open access Understand copyright – be not afraid of copyright Self archive (the Green Road) Create content for the Web

Recommendations:

Individual level

Assure that all you do is open access

Understand copyright – be not afraid of copyright

Self archive (the Green Road)

Create content for the Web

Self archive (the Green Road): UNIZ as one of the global leaders in self archiving

Recommendations: Individual level Assure that all you do is open access Understand copyright – be not afraid of copyright Self archive (the Green Road) Don‘t sign any contracts giving away rights Talk to your scientific societies and museum to adopt a policy to at least allow self archiving Demonstrate the power of access through inovative research projects and data: Research will be the only motivation to change law and build up infrastructure

Recommendations:

Individual level

Assure that all you do is open access

Understand copyright – be not afraid of copyright

Self archive (the Green Road)

Don‘t sign any contracts giving away rights

Talk to your scientific societies and museum to adopt a policy to at least allow self archiving

Demonstrate the power of access through inovative research projects and data: Research will be the only motivation to change law and build up infrastructure

OECD Declaration for Access to Research Data from Public Funding (Spring 2007) How to implement this?

Recommendations: Assure that all you do is open access Understand, adopt and propagate an adequate copyright policy Talk to your scientific societies and museum to adopt a policy to at least allow self archiving Talk to your publishers to move into XML publishing Support the emergence of standards and transfer protocols

Recommendations:

Assure that all you do is open access

Understand, adopt and propagate an adequate copyright policy

Talk to your scientific societies and museum to adopt a policy to at least allow self archiving

Talk to your publishers to move into XML publishing

Support the emergence of standards and transfer protocols

Recommendations: (ctd.) Science policy has to change to build and maintain the necessary cyberinfrastructure, similarly to the building of libraries Prospective publications must be structured to allow machines to read and understand them. Copyright must be adjusted to accomodate ingenious and best useage of our shared knowledge, such as using Creative Commons licencies or applying the principles of the Conservation Commons. Sharing data has to become standard practice between scientists

Recommendations: (ctd.)

Science policy has to change to build and maintain the necessary cyberinfrastructure, similarly to the building of libraries

Prospective publications must be structured to allow machines to read and understand them.

Copyright must be adjusted to accomodate ingenious and best useage of our shared knowledge, such as using Creative Commons licencies or applying the principles of the Conservation Commons.

Sharing data has to become standard practice between scientists

Recommendations: (ctd.) Science policy has to change to build and maintain the necessary cyberinfrastructure, similarly to the building of libraries Prospective publications must be structured to allow machines to read and understand them. Copyright must be adjusted to accomodate ingenious and best useage of our shared knowledge, such as using Creative Commons licencies or applying the principles of the Conservation Commons. Sharing data has to become standard practice between scientists antbase.org: Freier Zugang als Grundlage…

Recommendations: (ctd.)

Science policy has to change to build and maintain the necessary cyberinfrastructure, similarly to the building of libraries

Prospective publications must be structured to allow machines to read and understand them.

Copyright must be adjusted to accomodate ingenious and best useage of our shared knowledge, such as using Creative Commons licencies or applying the principles of the Conservation Commons.

Sharing data has to become standard practice between scientists

http://plazi.org Thank you very much! Donat Agosti [email_address]

Add a comment

Related presentations

Related pages

Eorzea-Datenbank Aufträge | FINAL FANTASY XIV - Der Lodestone

Eorzea-Datenbank; Gegenstände; Aufträge; Missionen; ... Nach Art des Eintrags sortiert. ... FINAL FANTASY, ...
Read more

Stravinsky / Agosti: FIREBIRD Suite - YouTube

... the Firebird Suite by Igor Stravinsky in trancription for piano solo by Guido Agosti. ... III Finale Live in Madrid ... Museum of Art or ...
Read more

Donat Agosti - antbase.org

Donat Agosti, Fernando Fernandez ... of biodiversity" at the final meeting of the "NASA ... über gefährdete Arten. Neue Zuercher Zeitung 109, 9 Agosti, ...
Read more

Antonio Pacheco – Wikipedia

Antonio Pacheco D'Agosti: Geburtstag: ... Mit der Celeste erreichte er sodann bei der Copa América 1999 das Finale ... Antonio Pacheco in der Datenbank ...
Read more

Final Fantasy Almanach - Wikia

Final Fantasy Almanach ist eine Datenbank, ... Der Final Fantasy Almanach betrachtet die erfolgreichen Rollenspielserien Final Fantasy und Kingdom ...
Read more

A. Vivaldi: Preludio e corrente dalla Sonata n.5, op. 2 in ...

Saggio finale - Scuola di Musica "Mario Casnici", Carpenedolo (BS). Giuseppe Agosti: violino M° Luca Tononi: pianoforte 10.06.2014, Palazzo ...
Read more

EBSCOhost Online Research Databases | EBSCO

Access Free Databases Search free content and use the EBSCOhost platform Access Now
Read more

Spielanleitung | FINAL FANTASY XIV - Der Lodestone

... mit der FINAL FANTASY XIV Spielanleitung. ... Über die Eorzea-Datenbank sind Einzelheiten über Gegenstände, ... Nach Art des Eintrags sortiert.
Read more