eScience: A Transformed Scientific Method

50 %
50 %
Information about eScience: A Transformed Scientific Method

Published on March 20, 2009

Author: dullhunk

Source: slideshare.net

Description

Presentation by Jim Gray on eScience

eScience -- A Transformed Scientific Method Jim Gray , eScience Group, Microsoft Research http://research.microsoft.com/~Gray in collaboration with Alex Szalay Dept. Physics & Astronomy Johns Hopkins University http:// www.sdss.jhu.edu/~szalay /

Talk Goals Explain eScience (and what I am doing) & Recommend CSTB foster tools for data capture (lab info management systems) data curation (schemas, ontologies, provenance) data analysis (workflow, algorithms, databases, data visualization ) data+doc publication (active docs, data-doc integration) peer review (editorial services) access (doc + data archives and overlay journals) Scholarly communication (wiki’s for each article and dataset)

Explain eScience (and what I am doing) &

Recommend CSTB foster tools for

data capture (lab info management systems)

data curation (schemas, ontologies, provenance)

data analysis (workflow, algorithms, databases, data visualization )

data+doc publication (active docs, data-doc integration)

peer review (editorial services)

access (doc + data archives and overlay journals)

Scholarly communication (wiki’s for each article and dataset)

eScience: What is it? Synthesis of information technology and science. Science methods are evolving (tools). Science is being codified/objectified. How represent scientific information and knowledge in computers? Science faces a data deluge. How to manage and analyze information? Scientific communication changing publishing data & literature (curation, access, preservation)

Synthesis of information technology and science.

Science methods are evolving (tools).

Science is being codified/objectified. How represent scientific information and knowledge in computers?

Science faces a data deluge. How to manage and analyze information?

Scientific communication changing

publishing data & literature (curation, access, preservation)

Science Paradigms Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) unify theory, experiment, and simulation Data captured by instruments Or generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes database / files using data management and statistics

Thousand years ago: science was empirical

describing natural phenomena

Last few hundred years: theoretical branch

using models, generalizations

Last few decades: a computational branch

simulating complex phenomena

Today: data exploration (eScience)

unify theory, experiment, and simulation

Data captured by instruments Or generated by simulator

Processed by software

Information/Knowledge stored in computer

Scientist analyzes database / files using data management and statistics

X-Info The evolution of X-Info and Comp-X for each discipline X How to codify and represent our knowledge Data ingest Managing a petabyte Common schema How to organize it How to re organize it How to share with others Query and Vis tools Building and executing models Integrating data and Literature Documenting experiments Curation and long-term preservation The Generic Problems Experiments & Instruments Simulations facts facts answers questions Literature Other Archives facts facts ?

The evolution of X-Info and Comp-X for each discipline X

How to codify and represent our knowledge

Data ingest

Managing a petabyte

Common schema

How to organize it

How to re organize it

How to share with others

Query and Vis tools

Building and executing models

Integrating data and Literature

Documenting experiments

Curation and long-term preservation

Experiment Budgets ¼… ½ Software Software for Instrument scheduling Instrument control Data gathering Data reduction Database Analysis Modeling Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning CS can change this Build generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers …

Software for

Instrument scheduling

Instrument control

Data gathering

Data reduction

Database

Analysis

Modeling

Visualization

Millions of lines of code

Repeated for experiment after experiment

Not much sharing or learning

CS can change this

Build generic tools

Workflow schedulers

Databases and libraries

Analysis packages

Visualizers



Experiment Budgets ¼… ½ Software Software for Instrument scheduling Instrument control Data gathering Data reduction Database Analysis Modeling Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning CS can change this Build generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers … Action item Foster Tools and Foster Tool Support

Software for

Instrument scheduling

Instrument control

Data gathering

Data reduction

Database

Analysis

Modeling

Visualization

Millions of lines of code

Repeated for experiment after experiment

Not much sharing or learning

CS can change this

Build generic tools

Workflow schedulers

Databases and libraries

Analysis packages

Visualizers



Project Pyramids In most disciplines there are a few “giga” projects, several “mega” consortia and then many small labs. Often some instrument creates need for giga-or mega-project Polar station Accelerator Telescope Remote sensor Genome sequencer Supercomputer Tier 1, 2, 3 facilities to use instrument + data International Multi-Campus Single Lab

Pyramid Funding Giga Projects need Giga Funding Major Research Equipment Grants Need projects at all scales computing example: supercomputers, + departmental clusters + lab clusters technical+ social issues Fully fund giga projects, fund ½ of smaller projects they get matching funds from other sources “Petascale Computational Systems: Balanced Cyber-Infrastructure in a Data-Centric World ,” IEEE Computer ,  V. 39.1, pp 110-112, January, 2006.

Giga Projects need Giga Funding Major Research Equipment Grants

Need projects at all scales

computing example: supercomputers, + departmental clusters + lab clusters

technical+ social issues

Fully fund giga projects, fund ½ of smaller projects they get matching funds from other sources

“Petascale Computational Systems: Balanced Cyber-Infrastructure in a Data-Centric World ,” IEEE Computer ,  V. 39.1, pp 110-112, January, 2006.

Action item Invest in tools at all levels

Need Lab Info Management Systems (LIMSs) Pipeline Instrument + Simulator data to archive & publish to web. NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived) Needs workflow tool to manage pipeline Build prototypes. Examples: SDSS, LifeUnderYourFeet MBARI Shore Side Data System.

Pipeline Instrument + Simulator data to archive & publish to web.

NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived)

Needs workflow tool to manage pipeline

Build prototypes.

Examples:

SDSS, LifeUnderYourFeet MBARI Shore Side Data System.

Need Lab Info Management Systems (LIMSs) Pipeline Instrument + Simulator data to archive & publish to web. NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived) Needs workflow tool to manage pipeline Build prototypes. Examples: SDSS, LifeUnderYourFeet MBARI Shore Side Data System. Action item Foster generic LIMS

Pipeline Instrument + Simulator data to archive & publish to web.

NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived)

Needs workflow tool to manage pipeline

Build prototypes.

Examples:

SDSS, LifeUnderYourFeet MBARI Shore Side Data System.

Science Needs Info Management Simulators produce lots of data Experiments produce lots of data Standard practice: each simulation run produces a file each instrument-day produces a file each process step produces a file files have descriptive names files have similar formats (described elsewhere) Projects have millions of files (or soon will) No easy way to manage or analyze the data.

Simulators produce lots of data

Experiments produce lots of data

Standard practice:

each simulation run produces a file

each instrument-day produces a file

each process step produces a file

files have descriptive names

files have similar formats (described elsewhere)

Projects have millions of files (or soon will)

No easy way to manage or analyze the data.

Data Analysis Looking for Needles in haystacks – the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling Correlation functions are N 2 , likelihood techniques N 3 We can only do N logN Must accept approximate answers New algorithms Requires combination of statistics & computer science

Looking for

Needles in haystacks – the Higgs particle

Haystacks: Dark matter, Dark energy

Needles are easier than haystacks

Global statistics have poor scaling

Correlation functions are N 2 , likelihood techniques N 3

We can only do N logN

Must accept approximate answers New algorithms

Requires combination of

statistics &

computer science

Analysis and Databases Much statistical analysis deals with Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness Censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing Traditionally performed on files These tasks better done in structured store with indexing, aggregation, parallelism query, analysis, visualization tools.

Much statistical analysis deals with

Creating uniform samples –

data filtering

Assembling relevant subsets

Estimating completeness

Censoring bad data

Counting and building histograms

Generating Monte-Carlo subsets

Likelihood calculations

Hypothesis testing

Traditionally performed on files

These tasks better done in structured store with

indexing,

aggregation,

parallelism

query, analysis,

visualization tools.

Data Delivery: Hitting a Wall You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years Oh!, and 1PB ~4,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec FTP 1 GB / min (~1 $/GB) … 2 days and 1K$ … 3 years and 1M$ FTP and GREP are not adequate

You can GREP 1 MB in a second

You can GREP 1 GB in a minute

You can GREP 1 TB in 2 days

You can GREP 1 PB in 3 years

Oh!, and 1PB ~4,000 disks

At some point you need indices to limit search parallel data search and analysis

This is where databases can help

You can FTP 1 MB in 1 sec

FTP 1 GB / min (~1 $/GB)

… 2 days and 1K$

… 3 years and 1M$

Accessing Data If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality Databases & Procedures being unified Example temporal and spatial indexing Pixel processing Easy to reorganize the data Multiple views, each optimal for certain analyses Building hierarchical summaries are trivial Scalable to Petabyte datasets active databases!

If there is too much data to move around,

take the analysis to the data!

Do all data manipulations at database

Build custom procedures and functions in the database

Automatic parallelism guaranteed

Easy to build-in custom functionality

Databases & Procedures being unified

Example temporal and spatial indexing

Pixel processing

Easy to reorganize the data

Multiple views, each optimal for certain analyses

Building hierarchical summaries are trivial

Scalable to Petabyte datasets

Analysis and Databases Much statistical analysis deals with Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness Censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing Traditionally performed on files These tasks better done in structured store with indexing, aggregation, parallelism query, analysis, visualization tools. Action item Foster Data Management Data Analysis Data Visualization Algorithms &Tools

Much statistical analysis deals with

Creating uniform samples –

data filtering

Assembling relevant subsets

Estimating completeness

Censoring bad data

Counting and building histograms

Generating Monte-Carlo subsets

Likelihood calculations

Hypothesis testing

Traditionally performed on files

These tasks better done in structured store with

indexing,

aggregation,

parallelism

query, analysis,

visualization tools.

Let 100 Flowers Bloom Comp-X has some nice tools Beowulf Condor BOINC Matlab These tools grew from the community It’s HARD to see a common pattern Linux vs FreeBSD why was Linux more successful? Community, personality, timing, ….??? Lesson: let 100 flowers bloom.

Comp-X has some nice tools

Beowulf

Condor

BOINC

Matlab

These tools grew from the community

It’s HARD to see a common pattern

Linux vs FreeBSD why was Linux more successful? Community, personality, timing, ….???

Lesson: let 100 flowers bloom.

Talk Goals Explain eScience (and what I am doing) & Recommend CSTB foster tools and tools for data capture (lab info management systems) data curation (schemas, ontologies, provenance) data analysis (workflow, algorithms, databases, data visualization ) data+doc publication (active docs, data-doc integration) peer review (editorial services) access (doc + data archives and overlay journals) Scholarly communication (wiki’s for each article and dataset)

Explain eScience (and what I am doing) &

Recommend CSTB foster tools and tools for

data capture (lab info management systems)

data curation (schemas, ontologies, provenance)

data analysis (workflow, algorithms, databases, data visualization )

data+doc publication (active docs, data-doc integration)

peer review (editorial services)

access (doc + data archives and overlay journals)

Scholarly communication (wiki’s for each article and dataset)

All Scientific Data Online Many disciplines overlap and use data from other sciences. Internet can unify all literature and data Go from literature to computation to data back to literature. Information at your fingertips For everyone-everywhere Increase Scientific Information Velocity Huge increase in Science Productivity Literature Derived and Re-combined data Raw Data

Many disciplines overlap and use data from other sciences.

Internet can unify all literature and data

Go from literature to computation to data back to literature.

Information at your fingertips For everyone-everywhere

Increase Scientific Information Velocity

Huge increase in Science Productivity

Unlocking Peer-Reviewed Literature Agencies and Foundations mandating research be public domain. NIH (30 B$/y, 40k PIs,…) (see http:// www.taxpayeraccess.org / ) Welcome Trust Japan, China, Italy, South Africa,.… Public Library of Science.. Other agencies will follow NIH

Agencies and Foundations mandating research be public domain.

NIH (30 B$/y, 40k PIs,…) (see http:// www.taxpayeraccess.org / )

Welcome Trust

Japan, China, Italy, South Africa,.…

Public Library of Science..

Other agencies will follow NIH

How Does the New Library Work? Who pays for storage access (unfunded mandate) ? Its cheap: 1 milli-dollar per access But… curation is not cheap : Author/Title/Subject/Citation/….. Dublin Core is great but… NLM has a 6,000-line XSD for documents http://dtd.nlm.nih.gov/publishing Need to capture document structure from author Sections, figures, equations, citations,… Automate curation NCBI-PubMedCentral is doing this Preparing for 1M articles/year Automate it!

Who pays for storage access (unfunded mandate) ?

Its cheap: 1 milli-dollar per access

But… curation is not cheap :

Author/Title/Subject/Citation/…..

Dublin Core is great but…

NLM has a 6,000-line XSD for documents http://dtd.nlm.nih.gov/publishing

Need to capture document structure from author

Sections, figures, equations, citations,…

Automate curation

NCBI-PubMedCentral is doing this

Preparing for 1M articles/year

Automate it!

Pub Med Central International “ Information at your fingertips” Deployed US, China, England, Italy, South Africa, Japan UK PMCI http://ukpmc.ac.uk/ Each site can accept documents Archives replicated Federate thru web services Working to integrate Word/Excel/… with PubmedCentral – e.g. WordML, XSD , To be clear: NCBI is doing 99.99% of the work.

“ Information at your fingertips”

Deployed US, China, England, Italy, South Africa, Japan

UK PMCI http://ukpmc.ac.uk/

Each site can accept documents

Archives replicated

Federate thru web services

Working to integrate Word/Excel/… with PubmedCentral – e.g. WordML, XSD ,

To be clear: NCBI is doing 99.99% of the work.

Overlay Journals Articles and Data in public archives Journal title page in public archive. All covered by Creative Commons License permits: copy/distribute requires: attribution http://creativecommons.org/ Data Archives articles Data Sets

Articles and Data in public archives

Journal title page in public archive.

All covered by Creative Commons License

permits: copy/distribute

requires: attribution

http://creativecommons.org/

Overlay Journals Articles and Data in public archives Journal title page in public archive. All covered by Creative Commons License permits: copy/distribute requires: attribution http://creativecommons.org/ Journal Management System Data Archives articles title page Data Sets

Articles and Data in public archives

Journal title page in public archive.

All covered by Creative Commons License

permits: copy/distribute

requires: attribution

http://creativecommons.org/

Overlay Journals Articles and Data in public archives Journal title page in public archive. All covered by Creative Commons License permits: copy/distribute requires: attribution http://creativecommons.org/ Journal Management System Journal Collaboration System Data Archives articles title page comments Data Sets

Articles and Data in public archives

Journal title page in public archive.

All covered by Creative Commons License

permits: copy/distribute

requires: attribution

http://creativecommons.org/

Overlay Journals Articles and Data in public archives Journal title page in public archive. All covered by Creative Commons License permits: copy/distribute requires: attribution http://creativecommons.org/ Journal Management System Journal Collaboration System Data Archives Action item Do for other sciences what NLM has done for BIO Genbank-PubMedCentral… articles title page comments Data Sets

Articles and Data in public archives

Journal title page in public archive.

All covered by Creative Commons License

permits: copy/distribute

requires: attribution

http://creativecommons.org/

Better Authoring Tools Extend Authoring tools to capture document metadata (NLM tagset) represent documents in standard format WordML (ECMA standard) capture references Make active documents (words and data). Easier for authors Easier for archives

Extend Authoring tools to

capture document metadata (NLM tagset)

represent documents in standard format

WordML (ECMA standard)

capture references

Make active documents (words and data).

Easier for authors

Easier for archives

Conference Management Tool Currently a conference peer-review system (~300 conferences) Form committee Accept Manuscripts Declare interest/recuse Review Decide Form program Notify Revise

Currently a conference peer-review system (~300 conferences)

Form committee

Accept Manuscripts

Declare interest/recuse

Review

Decide

Form program

Notify

Revise

Publishing Peer Review Add publishing steps Form committee Accept Manuscripts Declare interest/recuse Review Decide Form program Notify Revise Publish & improve author-reader experience Manage versions Capture data Interactive documents Capture Workshop presentations proceedings Capture classroom ConferenceXP Moderated discussions of published articles Connect to Archives

Add publishing steps

Form committee

Accept Manuscripts

Declare interest/recuse

Review

Decide

Form program

Notify

Revise

Publish

& improve author-reader experience

Manage versions

Capture data

Interactive documents

Capture Workshop

presentations

proceedings

Capture classroom ConferenceXP

Moderated discussions of published articles

Connect to Archives

Why Not a Wiki? Peer-Review is different It is very structured It is moderated There is a degree of confidentiality Wiki is egalitarian It’s a conversation It’s completely transparent Don’t get me wrong: Wiki’s are great SharePoints are great But.. Peer-Review is different. And, incidentally: review of proposals, projects,… is more like peer-review. Let’s have Moderated Wiki re published literature PLoS-One is doing this

Peer-Review is different

It is very structured

It is moderated

There is a degree of confidentiality

Wiki is egalitarian

It’s a conversation

It’s completely transparent

Don’t get me wrong:

Wiki’s are great

SharePoints are great

But.. Peer-Review is different.

And, incidentally: review of proposals, projects,… is more like peer-review.

Let’s have Moderated Wiki re published literature PLoS-One is doing this

Why Not a Wiki? Peer-Review is different It is very structured It is moderated There is a degree of confidentiality Wiki is egalitarian It’s a conversation It’s completely transparent Don’t get me wrong: Wiki’s are great SharePoints are great But.. Peer-Review is different. And, incidentally: review of proposals, projects,… is more like peer-review. Let’s have Moderated Wiki re published literature PLoS-One is doing this Action item Foster new document authoring and publication models and tools

Peer-Review is different

It is very structured

It is moderated

There is a degree of confidentiality

Wiki is egalitarian

It’s a conversation

It’s completely transparent

Don’t get me wrong:

Wiki’s are great

SharePoints are great

But.. Peer-Review is different.

And, incidentally: review of proposals, projects,… is more like peer-review.

Let’s have Moderated Wiki re published literature PLoS-One is doing this

So… What about Publishing Data? The answer is 42 . But… What are the units? How precise? How accurate 42.5 ± .01 Show your work data provenance

The answer is 42 .

But…

What are the units?

How precise? How accurate 42.5 ± .01

Show your work data provenance

Thought Experiment You have collected some data and want to publish science based on it. How do you publish the data so that others can read it and reproduce your results in 100 years? Document collection process? How document data processing (scrubbing & reducing the data)? Where do you put it?

You have collected some data and want to publish science based on it.

How do you publish the data so that others can read it and reproduce your results in 100 years?

Document collection process?

How document data processing (scrubbing & reducing the data)?

Where do you put it?

Objectifying Knowledge This requires agreement about Units : cgs Measurements : who/what/when/where/how CONCEPTS: What’s a planet, star, galaxy,…? What’s a gene, protein, pathway…? Need to objectify science: what are the objects? what are the attributes? What are the methods (in the OO sense)? This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things

This requires agreement about

Units : cgs

Measurements : who/what/when/where/how

CONCEPTS:

What’s a planet, star, galaxy,…?

What’s a gene, protein, pathway…?

Need to objectify science:

what are the objects?

what are the attributes?

What are the methods (in the OO sense)?

This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things

Objectifying Knowledge This requires agreement about Units: cgs Measurements: who/what/when/where/how CONCEPTS: What’s a planet, star, galaxy,…? What’s a gene, protein, pathway…? Need to objectify science: what are the objects? what are the attributes? What are the methods (in the OO sense)? This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things Warning! Painful discussions ahead: The “O” word: Ontology The “S” word: Schema The “CV” words: Controlled Vocabulary Domain experts do not agree

This requires agreement about

Units: cgs

Measurements: who/what/when/where/how

CONCEPTS:

What’s a planet, star, galaxy,…?

What’s a gene, protein, pathway…?

Need to objectify science:

what are the objects?

what are the attributes?

What are the methods (in the OO sense)?

This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things

The Best Example: Entrez-GenBank http:// www.ncbi.nlm.nih.gov / Sequence data deposited with Genbank Literature references Genbank ID BLAST searches Genbank Entrez integrates and searches PubMedCentral PubChem Genbank Proteins, SNP, Structure,.. Taxonomy… Many more Nucleotide sequences Protein sequences Taxon Phylogeny MMDB 3 -D Structure PubMed abstracts Complete Genomes PubMed Entrez Genomes Publishers Genome Centers

Sequence data deposited with Genbank

Literature references Genbank ID

BLAST searches Genbank

Entrez integrates and searches

PubMedCentral

PubChem

Genbank

Proteins, SNP,

Structure,..

Taxonomy…

Many more

Publishing Data Exponential growth: Projects last at least 3-5 years Data sent upwards only at the end of the project Data will never be centralized More responsibility on projects Becoming Publishers and Curators Data will reside with projects Analyses must be close to the data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists

Exponential growth:

Projects last at least 3-5 years

Data sent upwards only at the end of the project

Data will never be centralized

More responsibility on projects

Becoming Publishers and Curators

Data will reside with projects

Analyses must be close to the data

Data Pyramid Very extended distribution of data sets: data on all scales! Most datasets are small, and manually maintained (Excel spreadsheets) Total volume dominated by multi-TB archives But, small datasets have real value Most data is born digital collected via electronic sensors or generated by simulators.

Very extended distribution of data sets:

data on all scales!

Most datasets are small, and manually maintained (Excel spreadsheets)

Total volume dominated by multi-TB archives

But, small datasets have real value

Most data is born digital collected via electronic sensors or generated by simulators.

Data Sharing/Publishing What is the business model (reward/career benefit)? Three tiers (power law!!!) (a) big projects (b) value added, refereed products (c) ad-hoc data, on-line sensors, images, outreach info We have largely done (a) Need “Journal for Data” to solve (b) Need “VO-Flickr” (a simple interface) (c) Mashups are emerging in science Need an integrated environment for ‘ virtual excursions ’ for education (C. Wong)

What is the business model (reward/career benefit)?

Three tiers (power law!!!)

(a) big projects

(b) value added, refereed products

(c) ad-hoc data, on-line sensors, images, outreach info

We have largely done (a)

Need “Journal for Data” to solve (b)

Need “VO-Flickr” (a simple interface) (c)

Mashups are emerging in science

Need an integrated environment for ‘ virtual excursions ’ for education (C. Wong)

The Best Example: Entrez-GenBank http:// www.ncbi.nlm.nih.gov / Sequence data deposited with Genbank Literature references Genbank ID BLAST searches Genbank Entrez integrates and searches PubMedCentral PubChem Genbank Proteins, SNP, Structure,.. Taxonomy… Many more Action item Foster Digital Data Libraries (not metadata, real data) and integration with literature Nucleotide sequences Protein sequences Taxon Phylogeny MMDB 3 -D Structure PubMed abstracts Complete Genomes PubMed Entrez Genomes Publishers Genome Centers

Sequence data deposited with Genbank

Literature references Genbank ID

BLAST searches Genbank

Entrez integrates and searches

PubMedCentral

PubChem

Genbank

Proteins, SNP,

Structure,..

Taxonomy…

Many more

Talk Goals Explain eScience (and what I am doing) & Recommend CSTB foster tools and tools for data capture (lab info management systems) data curation (schemas, ontologies, provenance) data analysis (workflow, algorithms, databases, data visualization ) data+doc publication (active docs, data-doc integration) peer review (editorial services) access (doc + data archives and overlay journals) Scholarly communication (wiki’s for each article and dataset)

Explain eScience (and what I am doing) &

Recommend CSTB foster tools and tools for

data capture (lab info management systems)

data curation (schemas, ontologies, provenance)

data analysis (workflow, algorithms, databases, data visualization )

data+doc publication (active docs, data-doc integration)

peer review (editorial services)

access (doc + data archives and overlay journals)

Scholarly communication (wiki’s for each article and dataset)

backup

Astronomy Help build world-wide telescope All astronomy data and literature online and cross indexed Tools to analyze the data Built SkyServer.SDSS.org Built Analysis system MyDB CasJobs (batch job) OpenSkyQuery Federation of ~20 observatories. Results: It works and is used every day Spatial extensions in SQL 2005 A good example of Data Grid Good examples of Web Services.

Help build world-wide telescope

All astronomy data and literature online and cross indexed

Tools to analyze the data

Built SkyServer.SDSS.org

Built Analysis system

MyDB

CasJobs (batch job)

OpenSkyQuery Federation of ~20 observatories.

Results:

It works and is used every day

Spatial extensions in SQL 2005

A good example of Data Grid

Good examples of Web Services.

World Wide Telescope Virtual Observatory http://www.us-vo.org/ http://www.ivoa.net/ Premise: Most data is (or could be online) So, the Internet is the world’s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). It’s a smart telescope: links objects and data to literature on them.

Premise: Most data is (or could be online)

So, the Internet is the world’s best telescope:

It has data on every part of the sky

In every measured spectral band: optical, x-ray, radio..

As deep as the best instruments (2 years ago).

It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..).

It’s a smart telescope: links objects and data to literature on them.

Why Astronomy Data? It has no commercial value No privacy concerns Can freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional data (with confidence intervals) Spatial data Temporal data Many different instruments from many different places and many different times Federation is a goal There is a lot of it (petabytes) IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm

It has no commercial value

No privacy concerns

Can freely share results with others

Great for experimenting with algorithms

It is real and well documented

High-dimensional data (with confidence intervals)

Spatial data

Temporal data

Many different instruments from many different places and many different times

Federation is a goal

There is a lot of it (petabytes)

Time and Spectral Dimensions The Multiwavelength Crab Nebulae X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert Brunner @ CalTech. Crab star 1053 AD

SkyServer.SDSS.org A modern archive Access to Sloan Digital Sky Survey Spectroscopic and Optical surveys Raw Pixel data lives in file servers Catalog data (derived objects) lives in Database Online query to any and all Also used for education 150 hours of online Astronomy Implicitly teaches data analysis Interesting things Spatial data search Client query interface via Java Applet Query from Emacs, Python, …. Cloned by other surveys (a template design) Web services are core of it.

A modern archive

Access to Sloan Digital Sky Survey Spectroscopic and Optical surveys

Raw Pixel data lives in file servers

Catalog data (derived objects) lives in Database

Online query to any and all

Also used for education

150 hours of online Astronomy

Implicitly teaches data analysis

Interesting things

Spatial data search

Client query interface via Java Applet

Query from Emacs, Python, ….

Cloned by other surveys (a template design)

Web services are core of it.

SkyServer SkyServer.SDSS.org Like the TerraServer, but looking the other way: a picture of ¼ of the universe Sloan Digital Sky Survey Data: Pixels + Data Mining About 400 attributes per “object” Spectrograms for 1% of objects

Like the TerraServer, but looking the other way: a picture of ¼ of the universe

Sloan Digital Sky Survey Data: Pixels + Data Mining

About 400 attributes per “object”

Spectrograms for 1% of objects

Demo of SkyServer Shows standard web server Pixel/image data Point and click Explore one object Explore sets of objects (data mining)

Shows standard web server

Pixel/image data

Point and click

Explore one object

Explore sets of objects (data mining)

SkyQuery ( http://skyquery.net/ ) Distributed Query tool using a set of web services Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England) Has grown from 4 to 15 archives, now becoming international standard WebService Poster Child Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

Distributed Query tool using a set of web services

Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)

Has grown from 4 to 15 archives, now becoming international standard

WebService Poster Child

Allows queries like:

SkyQuery Structure Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout

Each SkyNode publishes

Schema Web Service

Database Web Service

Portal is

Plans Query (2 phase)

Integrates answers

Is itself a web service

Schema (aka metadata) Everyone starts with the same schema <stuff/> Then the start arguing about semantics. Virtual Observatory: http:// www.ivoa.net / Metadata based on Dublin Core: http:// www.ivoa.net/Documents/latest/RM.html Universal Content Descriptors (UCD): http://vizier.u-strasbg.fr/doc/UCD.htx Captures quantitative concepts and their units Reduced from ~100,000 tables in literature to ~1,000 terms VOtable – a schema for answers to questions http://www.us-vo.org/VOTable/ Common Queries: Cone Search and Simple Image Access Protocol, SQL Registry: http://www.ivoa.net/Documents/latest/RMExp.html still a work in progress.

Everyone starts with the same schema <stuff/> Then the start arguing about semantics.

Virtual Observatory: http:// www.ivoa.net /

Metadata based on Dublin Core: http:// www.ivoa.net/Documents/latest/RM.html

Universal Content Descriptors (UCD): http://vizier.u-strasbg.fr/doc/UCD.htx Captures quantitative concepts and their units Reduced from ~100,000 tables in literature to ~1,000 terms

VOtable – a schema for answers to questions http://www.us-vo.org/VOTable/

Common Queries: Cone Search and Simple Image Access Protocol, SQL

Registry: http://www.ivoa.net/Documents/latest/RMExp.html still a work in progress.

SkyServer/SkyQuery Evolution MyDB and Batch Jobs Problem: need multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “Batch schedule” on portal. Deposits answer in personal database.

Problem: need multi-step data analysis (not just single query).

Solution: Allow personal databases on portal

Problem: some queries are monsters

Solution: “Batch schedule” on portal. Deposits answer in personal database.

Ecosystem Sensor Net LifeUnderYourFeet.Org Small sensor net monitoring soil Sensors feed to a database Helping build system to collect & organize data. Working on data analysis tools Prototype for other LIMS Laboratory Information Management Systems

Small sensor net monitoring soil

Sensors feed to a database

Helping build system to collect & organize data.

Working on data analysis tools

Prototype for other LIMS Laboratory Information Management Systems

RNA Structural Genomics Goal: Predict secondary and tertiary structure from sequence. Deduce tree of life. Technique: Analyze sequence variations sharing a common structure across tree of life Representing structurally aligned sequences is a key challenge Creating a database-driven alignment workbench accessing public and private sequence data

Goal: Predict secondary and tertiary structure from sequence. Deduce tree of life.

Technique: Analyze sequence variations sharing a common structure across tree of life

Representing structurally aligned sequences is a key challenge

Creating a database-driven alignment workbench accessing public and private sequence data

VHA Health Informatics VHA: largest standardized electronic medical records system in US. Design, populate and tune a ~20 TB Data Warehouse and Analytics environment Evaluate population health and treatment outcomes, Support epidemiological studies 7 million enrollees 5 million patients Example Milestones: 1 Billionth Vital Sign loaded in April ‘06 30-minutes to population-wide obesity analysis (next slide) Discovered seasonality in blood pressure -- NEJM fall ‘06

VHA: largest standardized electronic medical records system in US.

Design, populate and tune a ~20 TB Data Warehouse and Analytics environment

Evaluate population health and treatment outcomes,

Support epidemiological studies

7 million enrollees

5 million patients

Example Milestones:

1 Billionth Vital Sign loaded in April ‘06

30-minutes to population-wide obesity analysis (next slide)

Discovered seasonality in blood pressure -- NEJM fall ‘06

HDR Vitals Based Body Mass Index Calculation on VHA FY04 Population Source: VHA Corporate Data Warehouse Total Patients 23,876 (0.7%) 701,089 (21.6%) 1,177,093 (36.2%) 1,347,098 (41.5%) 3,249,156 (100%)

Add a comment

Related presentations

Related pages

EScience -- A Transformed Scientific Method" Jim Gray ...

Presentation on theme: "EScience -- A Transformed Scientific Method" Jim Gray, eScience Group, Microsoft Research http://research.microsoft.com/~Gray."—
Read more

Jim Gray on eScience: A Transformed Scientific Method

THE FourTH Paradigm xvii Jim Gray on eScience: A Transformed Scientific Method e h av e to d o be t t e r at p r o d u c i n g t o o l s to support the ...
Read more

Jim Gray on eScience: a transformed scientific method ...

Semantic Scholar extracted view of "Jim Gray on eScience: a transformed scientific method" by Anthony J. G. Hey et al.
Read more

PPT - eScience -- A Transformed Scientific Method ...

eScience -- A Transformed Scientific Method . Jim Gray , ... Explain eScience (and what I am doing) Toggle navigation. Browse. Recent Presentations;
Read more

eScience- A Transformed Scientific Method midterm reading ...

eScience- A Transformed Scientific Method midterm reading -... This preview shows document pages 1 - 4. Sign up to view the full document.
Read more

eScience A Transformed Scientific Method - powershow.com

The presentation will start after a short (15 second) video ad from one of our sponsors. Hot tip: Video ads won’t appear to registered users who are ...
Read more

JIM GRAY ON ESCIENCE: A TRANSFORMED SCIENTIFIC METHOD ...

the new scientific method. name_____ block _____. there is no fixed sequence of steps that all scientific investigations  ... ...
Read more

Jim Gray on eScience: A Transformed Scientific Method

cycle—from data capture and data curation to data analysis and data visualization. Today, the tools for capturing data both at the mega-scale and at the ...
Read more

eScience Labs

Why eScience Labs? eScience Labs ...
Read more