Defrosting the Digital Library: A survey of bibliographic tools for the next generation web

33 %
67 %
Information about Defrosting the Digital Library: A survey of bibliographic tools for the...
Technology

Published on March 16, 2009

Author: dullhunk

Source: slideshare.net

Description

After centuries with little change, scientific libraries have recently experienced massive upheaval. From being almost entirely paper-based, most libraries are now almost completely digital. This information revolution has all happened in less than 20 years and has created many novel opportunities and threats for scientists, publishers and libraries.

Today, we are struggling with an embarassing wealth of digital knowledge on the Web. Most scientists access this knowledge through some kind of digital library, however these places can be cold, impersonal, isolated, and inaccessible places. Many libraries are still clinging to obsolete models of identity, attribution, contribution, citation and publication.

Based on a review published in PLoS Computational Biology, http://pubmed.gov/18974831 this talk will discuss the current chilly state of digital libraries for biologists, chemists and informaticians, including PubMed and Google Scholar. We highlight problems and solutions to the coupling and decoupling of publication data and metadata, with a tool called http://www.citeulike.org. This software tool exploits the Web to make digital libraries “warmer”: more personal, sociable, integrated, and accessible places.

Finally issues that will help or hinder the continued warming of libraries in the future, particularly the accurate identity of authors and their publications, are briefly introduced. These are discussed in the context of the BBSRC funded REFINE project, at the National Centre for Text Mining (NaCTeM.ac.uk), which is linking biochemical pathway data with evidence for pathways from the PubMed database.

Defrosting the Digital Library A survey of bibliographic tools for the next generation Web Duncan Hull Faculty of Life Sciences (1992-6) BSc. Computer Science (2002-2007) MSc, PhD. Chemistry (2008-date) Postdoc

It’s all Casey’s fault! Dr. Casey Bergman, Lecturer Faculty of Life Sciences I s Citeulike.org! http://ukpmc.ac.uk/

http://pubmed.gov/19060304

http://pubmed.gov/19060304

Defrosting the Digital Library (in one slide) There are lots of digital libraries out there for scientists! ACM, IEEE, PubMed, DBLP, Scopus, ISI-WoK, Google Scholar, arXiv But they have some fundamental problems with their data Identity crisis: identifying people accurately Identity crisis: identifying publications accurately Keeping data and metadata coupled together Impersonal, unsociable, difficult to use: “Cold” Some new tools exist to make things better: “warmer” Citeulike, Mendeley, Zotero, Papyro, Papers etc BUT Fundamental problems with identity and data need to be fixed before the tools will get any better

There are lots of digital libraries out there for scientists!

ACM, IEEE, PubMed, DBLP, Scopus, ISI-WoK, Google Scholar, arXiv

But they have some fundamental problems with their data

Identity crisis: identifying people accurately

Identity crisis: identifying publications accurately

Keeping data and metadata coupled together

Impersonal, unsociable, difficult to use: “Cold”

Some new tools exist to make things better: “warmer”

Citeulike, Mendeley, Zotero, Papyro, Papers etc

BUT Fundamental problems with identity and data need to be fixed before the tools will get any better

Metawhat? getMetadata getData From the Greek μετ ά (meta) meaning after metadata not just data about data metadata is data after data data first metadata second Reversible reaction (“round-tripping”) Title: defrosting the digital library Authors: Duncan Hull, Steve Pettifer and Douglas Kell Published: 2008 Journal: PLoS Computational Biology Tell me more? What is it about? Where did it come from?

From the Greek μετ ά (meta) meaning after

metadata not just data about data

metadata is data after data

data first

metadata second

Reversible reaction (“round-tripping”)

Metadata in: Chemistry (Science of Matter) Biology (Science of Life) Informatics (Science of Information) Cheminformatics Biochemistry Bioinformatics Science! www.mib.ac.uk nactem.ac.uk/refine www.citeulike.org

R epresenting E vidence F or I nteracting N etwork E lements www.sbml.org from www.biomodels.net database at the EBI.ac.uk

Example from Glycolysis in Yeast reactant reactant product product modifier This is just one reaction, there are at least another 1700+ in Yeast

Synonyms from Pedro Mendes B-Net Database http://www.comp-sys-bio.org/yeastnet/ Robison ester, D-Glucose 6-phosphate Glucose-6-phosphate 5'-adenylphosphoric acid; Adenosine 5'-diphosphate; H3adp ADP Hexokinase-1; Hexokinase-A; Hexokinase PI; YFR053C Hexokinase Adenosine 5'-triphosphate; Adenosine triphosphate; H4atp ATP dextrose; D-Glucose; D-(+)-glucose; D(+)-glucose; grape sugar; Traubenzucker D-Glucose Synonyms Name

Chemistry Biology Informatics Cheminformatics Biochemistry Bioinformatics

For more info. www.nactem.ac.uk/refine One of the biggest challenges is getting hold of accurate metadata from libraries and databases

But first… Before getting into the paper… Some lessons I learnt while working in industrial informatics for a small startup company called CSW Informatics Ltd Ford and BBC How business and governments manage metadata

Before getting into the paper…

Some lessons I learnt while working in industrial informatics for a small startup company called CSW Informatics Ltd

Ford and BBC

How business and governments manage metadata

Ford Focus (launched 1998) getMetadata getData 6 million+ “units” sold worldwide to date: america, europe, middle east, africa, australasia Lots of data, metadata and money! Owner’s handbook Tell me more? What is it about?

Ford Focus (launched 1998)

Final solution: Web XSLT Print

Summary: Lessons from Ford Data often the tip of the iceberg If the data doesn’t sink you, the metadata will Businesses like Ford spent $ £ € keeping data and metadata stay together Data is often worthless without it Can’t sell data (cars) without metadata (manuals) Don’t just “make cars” DATA METADATA

Data often the tip of the iceberg

If the data doesn’t sink you, the metadata will

Businesses like Ford spent $ £ € keeping data and metadata stay together

Data is often worthless without it

Can’t sell data (cars) without metadata (manuals)

Don’t just “make cars”

 

BBC Spooks? Open Source Intelligence (OSINT) Overt not Covert espionage: 370 journalists, 24-7, ~100 languages Caversham, Reading. Keeping an eye on people around the world since 1939 Winston Churchill “ B ig B ritish C astle” (BBC)

Open Source Intelligence (OSINT)

Overt not Covert espionage: 370 journalists, 24-7, ~100 languages Caversham, Reading.

I hate powerpoint Radio MS Word TV

How do they stay in business? Broadcasting House, London Foreign governments, e.g. U.S.A. etc

Word: Not the best way to manage data and metadata

Getting Rid of Word database XML schema Web & Intranet Printed documents XSLT

A solution that worked! getMetadata getData Who is Thabo Mbeki? These documents are all about Thabo Mbeki Thabo Mbeki

Summary: Lessons from the BBC Important decisions made on the basis metadata Crucial that metadata is accurate, high quality and trustworthy Identify people properly is crucial (100%) You know what data is about (getMetadata) You know where it came from (getData) Looked after properly (this can be expensive) Businesses built on buying/selling metadata:

Important decisions made on the basis metadata

Crucial that metadata is accurate, high quality and trustworthy

Identify people properly is crucial (100%)

You know what data is about (getMetadata)

You know where it came from (getData)

Looked after properly (this can be expensive)

Businesses built on buying/selling metadata:

How have libraries managed metadata? On paper since 300 B.C. (Library of Alexandria) Organised in physical space In buildings made from bricks and mortar Expensive and slow distribute Only ever read by humans Filled with content bought from publishers, locked up with copyright  Image via http://en.wikipedia.org/wiki/Library_of_Alexandria

On paper since 300 B.C.

(Library of Alexandria)

Organised in physical space

In buildings made from bricks and mortar

Expensive and slow distribute

Only ever read by humans

Filled with content bought from publishers, locked up with copyright 

From ~1824 until ~1989 Photos via dpicker http://www.flickr.com/photos/dpicker/3107856991/ and pit yacker http://www.flickr.com/photos/78825653@N00/131611136 JRULM (Main Library) Joule Library Mostly “private” only available to an elite (e.g. University of Manchester Students and Staff)

Metadata (after) Data Tightly bound (literally) Rarely separated First published 1687, over 300 years old

Metadata (after)

Data and metadata was like this for centuries! Until…

Until…

+ Tim Berners-Lee 1989

Timeline: Unchanged for centuries but… 20 years ÷ 2309 years = <1%

Everything’s Gone Digital! www.scopus.com www.pubmed.gov http://ukpmc.ac.uk www. isiknowledge .com scholar.google.com

Digital Utopia? Bits and bytes 1010100101000001101010 (not paper) In pervasive cyberspace (not physical space) Databases and/or Web identified by URIs: (not buildings) Cost of distribution fallen by orders of magnitude Read and indexed by machines like Googlebot et al (not just humans) Increasingly public, available to everyone via Open-Access publishing (less private, less restrictive copyright) Everything is great? Alexander Griekspoor www.mekentosj.com

Bits and bytes 1010100101000001101010 (not paper)

In pervasive cyberspace (not physical space)

Databases and/or Web identified by URIs: (not buildings)

Cost of distribution fallen by orders of magnitude

Read and indexed by machines like Googlebot et al (not just humans)

Increasingly public, available to everyone via Open-Access publishing (less private, less restrictive copyright)

Everything is great?

Welcome to Digital Dystopia Isolation each discipline has its own data silo Impersonal and unsociable “ who the hell are you”? Where are “my” papers? (authored by me, or of interest to me) What are my friends and colleagues reading? What are the experts reading? What is popular this week / month / year ? “ Cold”: Identity of publications and authors is inadequate Data divorced from its metadata GetMetadata / GetData unreliable Therefore can be difficult to tell what data is about, or where metadata came from Obsolete models of publication, not everything fits publication-sized holes Micro-attribution Mega-attribution Digital contributions (databases, software, wikis/blogs?)

Isolation

each discipline has its own data silo

Impersonal and unsociable

“ who the hell are you”?

Where are “my” papers? (authored by me, or of interest to me)

What are my friends and colleagues reading?

What are the experts reading? What is popular this week / month / year ?

“ Cold”: Identity of publications and authors is inadequate

Data divorced from its metadata

GetMetadata / GetData unreliable

Therefore can be difficult to tell what data is about, or where metadata came from

Obsolete models of publication, not everything fits publication-sized holes

Micro-attribution

Mega-attribution

Digital contributions (databases, software, wikis/blogs?)

Isolated publication silos Chemistry Informatics Biology impersonal, isolated, unsociable, Generally rubbish

Identity Crisis part 1: Which publication? http://pubmed.gov/18974831 http://www.ncbi.nlm.nih.gov/pubmed/18974831 http://ukpmc.ac.uk/articlerender.cgi?accid=pmcA2568856 http://ukpmc.ac.uk/picrender.cgi?artid=1687256&blobtype=pdf http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000204 http://www.dbkgroup.org/Papers/hull_defrost_ploscb08.pdf http://dx.doi.org/10.1371/journal.pcbi.1000204 One paper, many URIs. Disambiguation algorithms rely on getting metadata for each Big problem for libraries is these redundant duplicates Matching can be done by Digital Object Identifier (DOI) and PubMed ID (PMID); these are frequently absent < 5% (Kevin Emamy, citeulike)

http://pubmed.gov/18974831

http://www.ncbi.nlm.nih.gov/pubmed/18974831

http://ukpmc.ac.uk/articlerender.cgi?accid=pmcA2568856

http://ukpmc.ac.uk/picrender.cgi?artid=1687256&blobtype=pdf

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000204

http://www.dbkgroup.org/Papers/hull_defrost_ploscb08.pdf

http://dx.doi.org/10.1371/journal.pcbi.1000204

One paper, many URIs. Disambiguation algorithms rely on getting metadata for each

Big problem for libraries is these redundant duplicates

Matching can be done by Digital Object Identifier (DOI) and PubMed ID (PMID);

these are frequently absent < 5% (Kevin Emamy, citeulike)

Identity crisis part 2: Who are you? Who, who … who, who? Douglas Kell Doug Kell Douglas B Kell Kell, D Kell, D.B. Douglas Bruce Kell Druglas Kell Neil Smalheiser and Vetle Torvik Typo Attribution would seem to be a simple process and yet it represents a major, unsolved problem for information science. http://tinyurl.com/authorid

Douglas Kell

Doug Kell

Douglas B Kell

Kell, D

Kell, D.B.

Douglas Bruce Kell

Druglas Kell

Identity crisis part 3: Mistaken Identity Google Scholar thinks I’m Maurice Wilkins Dr. Duncan Hull Humble Postdoc Article about Authored-by Authored-by Wrong! “ DNA mania” title http://tinyurl.com/mistakenid

Google Scholar thinks I’m Maurice Wilkins

Can’t get metadata (decoupled from data): PDF getMetadata getData Title: defrosting the digital library Authors: Duncan Hull, Steve Pettifer and Douglas Kell Published: 2008 Tell me more Don’t know, Try google Don’t know, Title might be “ defrosting…” Where did this come from?

Can’t get metadata (decoupled from data): PDF MP3 music file in iTunes Why can't I manage academic papers like MP3s? http: //tinyurl .com/mp3vpdf James Howison, Carnegie Mellon University Data is tightly coupled to its metadata getMetadata getData Artist: The Who Title: Who Are You? Recorded: 1978 Album: Who Are You

MP3 music file in iTunes

Can’t get metadata (decoupled from data): PDF Peter Murray-Rust Hamburger (unstructured data) PDF is a hamburger, and we're trying to turn it back into a cow. http://tinyurl.com/pdfhamburger Cow (structured data) publishing text-mining

Can’t get metadata (decoupled from data): HTTP Arbitrary URI (not just pubmed, but any scientific paper) http://www.ncbi.nlm.nih.gov/pubmed/18974831

Arbitrary URI (not just pubmed, but any scientific paper) http://www.ncbi.nlm.nih.gov/pubmed/18974831

Can’t get metadata (decoupled from data): HTTP Fundamental problem with the way the web is built using HTTP, can’t change it now… Tim Bray, Sun Microsystems One of the Web's distinguishing features is that there's a big gaping hole where the metadata ought to be. http://tinyurl.com/nometadata

Fundamental problem with the way the web is built using HTTP, can’t change it now…

I’ll stop moaning now Isolation Can’t identify people Can’t identify publications Metadata gets divorced from its data But what are the solutions?

Isolation

Can’t identify people

Can’t identify publications

Metadata gets divorced from its data

But what are the solutions?

www.citeulike.org Richard Cameron Kevin Emamy Picture from http://network.nature.com/people/mfenner/blog/2009/01/30/interview-with-kevin-emamy and http://www.citeulike.org/faq/faq.adp The reason I wrote the site [citeulike.org] was, after recently coming back to academia, I was slightly shocked by the quality of some of the tools available to help academics do their job. I found it preferable to start writing proper tools for my own use than to use existing software.

Why should you care about citeulike? Could save you time But also like Green Fluorescent Protein…

Could save you time

But also like Green Fluorescent Protein…

All references in one place

Click Post to Citeulike

Tag it (optional)

Citeulike: Recoupling data and metadata Wouldn’t be a problem if the publishers hadn’t decoupled it in the first place!

Wouldn’t be a problem if the publishers hadn’t decoupled it in the first place!

Citegeist = Citeulike + Zeitgeist

allegedly 2,243,177 ~2,000 /day variable 674,076 2,880 /day 2 papers / min Linear growth ~500,000

Where will citeulike break? The more people that use “social software”, the better they get Citeulike is one of the leading ones, but there is plenty of competition Parsers are fragile, easily (and deliberately) broken by publishers ISI WOK and Scopus Each publisher has its own parser (euuuggh!) Privacy and competition “ I don’t want to share any of my data before publication” “ It’s nobody’s business but mine” (basic human right to privacy) Closer integration with Word (and latex tools) Might go bust? Why put all my precious data in the hands of a commercial company?

The more people that use “social software”, the better they get

Citeulike is one of the leading ones, but there is plenty of competition

Parsers are fragile, easily (and deliberately) broken by publishers

ISI WOK and Scopus

Each publisher has its own parser (euuuggh!)

Privacy and competition

“ I don’t want to share any of my data before publication”

“ It’s nobody’s business but mine” (basic human right to privacy)

Closer integration with Word (and latex tools)

Might go bust? Why put all my precious data in the hands of a commercial company?

Why should you bother with citeulike? Organisation and time saving Searching Browsing Managing references while writing papers Quick and efficient sharing of data before publication e.g. tag “defrost” when writing this paper http://www.citeulike.org/tag/defrost Serendipity Casey Bergman story

Organisation and time saving

Searching

Browsing

Managing references while writing papers

Quick and efficient sharing of data before publication

e.g. tag “defrost” when writing this paper

http://www.citeulike.org/tag/defrost

Serendipity

Casey Bergman story

Casey Bergman story I was importing papers on solexa and 454 genome assembly and came across the following paper: http://www. citeulike .org/user/cisevol/article/1465689 which was a real find in terms of convincing me that light shotgun sequence data is worth analysing. I nicked this from a phd student's library in Brazil http://www. citeulike . org/profile/GustavoLacerda Wouldn’t have found this any other way e.g (keyword searching or following citation trails)

Many different solutions e.g. Papyro: Steve Pettifer http://utopia.cs.manchester.ac.uk/

And the rest… www.mendeley.com www.zotero.org www.connotea.org www.mekentosj.com www.hubmed.org Re-couple metadata that has be de-coupled from data www.2collab.com www.refworks.com “ iTunes for PDF files”

There is still lots more metadata How many times has http://pubmed.gov/19060304 been cited? Who has cited http://pubmed.gov/19060304 ? Give me all the references that cite this one Give me all the references cited by http://pubmed.gov/19060304 Who the hell is Doug Kell? Steve Pettifer? Duncan Hull? What is Doug Kell’s h-index? Remember: Machines ask these questions, not just humans Notify me whenever Steve Pettifer publishes a paper Notify me whenever someone cites http://pubmed.gov/19060304 Impact factor?

Digital Identity would solve some of these problems Give yourself a URI, you deserve it! Tim Berners-Lee http://www.w3.org/People/Berners-Lee/card#i see http://dig.csail.mit.edu/breadcrumbs/node/71

URI’s for Douglas Kell http://blogs.bbsrc.ac.uk http://www.chemistry.manchester.ac.uk/aboutus/staff/showprofile.php?id=194 http://dbkgroup.org/kell.htm http://douglaskell.myopenid.com http://dx.doi.org/10.1371/journal.pcbi.1000204 “ Contributor identifier” from www.myopenid.com www.openid.net (Also Note researcher-id from thomson)

http://blogs.bbsrc.ac.uk

http://www.chemistry.manchester.ac.uk/aboutus/staff/showprofile.php?id=194

http://dbkgroup.org/kell.htm

http://douglaskell.myopenid.com

http://dx.doi.org/10.1371/journal.pcbi.1000204

“ Contributor identifier” from

http://pubmed.gov/19112480 Phil Bourne

http://pubmed.gov/19112480

John Ziman, Physicist Science is public knowledge http://tinyurl.com/publicknowledge

John Ziman, Physicist

Conclusions: What hasn’t changed The Web has revolutionised libraries in just 20 short years but… Still takes time for humans to read and digest: We can get more papers but there are still only 24 hours in a day, 7 days in a week, 52 weeks in a year We need help from machines (and the people that build them) Need to make metadata more machine-friendly

The Web has revolutionised libraries in just 20 short years but…

Still takes time for humans to read and digest: We can get more papers but there are still only 24 hours in a day, 7 days in a week, 52 weeks in a year

We need help from machines (and the people that build them)

Need to make metadata more machine-friendly

Conclusions: Publication metadata matters Managed to convince you metadata matters (and why) People make important decisions based on metadata Funding Hiring (and Firing) Publishing Who to collaborate with Yet our current libraries can’t even accurately identify crucial metadata Individual people - digital identity needed Publications - disambiguation Everything else…

Managed to convince you metadata matters (and why)

People make important decisions based on metadata

Funding

Hiring (and Firing)

Publishing

Who to collaborate with

Yet our current libraries can’t even accurately identify crucial metadata

Individual people - digital identity needed

Publications - disambiguation

Everything else…

Conclusions: Scientists are too blasé about metadata! Leave it to stamp collectors, dusty-librarians, informaticians, database administrators (yawn!), “biocurators” http://biocurator.org/ Boring, unscientific, not cutting-edge innovation? Everyone wants to use good metadata but few people want to spend time curating and cleaning metadata Like a clean toilet We ignore metadata at our peril “not my job” We leave it to publishers, who then mess it up, and charge us for their services, we should be getting better value for money We waste precious time organising metadata We waste precious time searching for metadata Data is more valuable with better metadata Have a look at citeulike (and other tools) metadata

Leave it to stamp collectors, dusty-librarians, informaticians, database administrators (yawn!), “biocurators” http://biocurator.org/

Boring, unscientific, not cutting-edge innovation?

Everyone wants to use good metadata but few people want to spend time curating and cleaning metadata

Like a clean toilet

We ignore metadata at our peril “not my job”

We leave it to publishers, who then mess it up, and charge us for their services, we should be getting better value for money

We waste precious time organising metadata

We waste precious time searching for metadata

Data is more valuable with better metadata

Have a look at citeulike (and other tools)

Conclusions: Do us a favour!

Acknowledgements Refine project: Sophia Ananiadou, Jun'ichi Tsujii, Pedro Mendes, Steve Pettifer, Yoshimasa Tsuruoka, Douglas Kell www.nactem.ac.uk/refine BBSRC grant code BB/E004431/1 CSW Informatics Ltd.: John Chelsom, Mavis Cournane, Niki Dinsey www.csw.co.uk BBC Monitoring, Ford Motor Company School of Chemistry, MIB (now) www.mib.ac.uk Faculty of Life Sciences (a long long time ago) and Casey Bergman, Jean-Marc Schwartz (now) School of Computer Science (not so long ago) Information Management Group http://img.cs.man.ac.uk/ Any Questions?

Refine project: Sophia Ananiadou, Jun'ichi Tsujii, Pedro Mendes, Steve Pettifer, Yoshimasa Tsuruoka, Douglas Kell www.nactem.ac.uk/refine

BBSRC grant code BB/E004431/1

CSW Informatics Ltd.: John Chelsom, Mavis Cournane, Niki Dinsey www.csw.co.uk BBC Monitoring, Ford Motor Company

School of Chemistry, MIB (now) www.mib.ac.uk

Faculty of Life Sciences (a long long time ago) and Casey Bergman, Jean-Marc Schwartz (now)

School of Computer Science (not so long ago) Information Management Group http://img.cs.man.ac.uk/

Any Questions?

Add a comment

Related presentations

Related pages

Defrosting the Digital Library - Secretaría de Educación ...

Defrosting the Digital Library A survey of bibliographic tools for the next generation Web Duncan Hull Faculty of Life Sciences (1992-6) BSc. Computer ...
Read more

Defrosting the Digital Library: Bibliographic Tools for ...

Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web. Duncan Hull ,
Read more

Defrosting the Digital Library: Bibliographic Tools for ...

Review Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web Duncan Hull1,2*, Steve R. Pettifer2,3, Douglas B. Kell1,2
Read more

Defrosting the Digital Library: Bibliographic Tools for ...

... the ACM digital library, ISI Web of ... Defrosting the Digital Library: Bibliographic Tools for the ... Bibliographic Tools for the Next Generation Web.
Read more

Defrosting the digital library: bibliographic tools for ...

Defrosting the digital library: bibliographic tools for the next generation web. ... the ACM digital library, ISI Web of Knowledge, ...
Read more

Defrosting the Digital Library: Bibliographic Tools for ...

Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web
Read more

Defrosting the Digital Library: Bibliographic Tools for ...

... Defrosting the Digital Library: Bibliographic Tools for the ... Tools for the Next Generation Web. ... Defrosting the Digital Library: Bibliographic ...
Read more

Defrosting the Digital Library - O'Really? | A personal ...

Bibliographic Tools for the Next Generation Web. ... Defrosting the digital library: Bibliographic tools for the next generation web.
Read more

Defrosting the Digital Library: Bibliographic Tools for ...

Europe PubMed Central. ... Export Defrosting the digital library: bibliographic tools for the next generation web.
Read more