Published on February 17, 2014
The Role of Data Formats in Long-Term Earth Science Information Preservation Bruce R. Barkstrom Retired NASA/NOAA Asheville, NC 28804
Outline The Difficulty of Preservation Threats – Risk of Loss Digital Artifacts and Representation Networks The OAIS Reference Model – Layered Representation Demonstrating Equality of Science Content Mechanisms for Formating Data Archivally Safe Transformations Strategies for Reducing Risk Concluding Comments
The Difficulty of Preservation To preserve 98% of an archive's contents for 200 years − The average probability of loss per year needs to be below ~0.01% (4 9's in reliability engineering) 200 years covers a lot of events − − − − 50 administrations 200 annual budgets 70 cycles of hardware obsolescence (new models of stuff every 3 years) 10 generations (one every 20 years)
Threats – Risk of Loss Media Failure Hardware Failure Software Failure Communication Errors Failure of Network Services Media & Hardware Obsolescence Software Obsolescence Operator Error Natural Disaster External Attack Internal Attack Economic Failure Organizational Failure List from the LOCKSS Threat Model paper Rosenthal, et al, 2005
Digital Artifacts and Representation Networks Recent work by EU CASPAR Project on Representation Networks: − Network of digital artifacts that designated user community needs in order to understand archived information Examples of Digital Artifacts in a RN − − − − − Calibration Data, Reports, Plans, Procedures Satellite/Instrument Coordinate Descriptions and Plans ... Documentation Reading Software Data Format Documentation and Software
The OAIS Reference Model – Layered Representation The Object Layer, in which the Aggregations identified in the Aggregation Layer are classified into objects that are recognizable and meaningful in the application domain, such as images The Aggregation Layer, in which the individual data elements of the Data Element Layer are aggregated into structural groupings, which are a tree whose leaves are ADE's The Data Element Layer that consists of a sequence of Atomic Data Element (ADE) types (integers, reals, dates, character strings) The Bit Stream Layer that consists of an array of bits The Media Layer that includes data on disks, tapes, and networks
Scientific Data Only Part of File A Data File can have four kinds of ADE's: − − − − Scientific Data Structural Info (array sizes; XML tags) Context Information Tacit Information (ordering of array elements) Aggregation Layer Tree − − − Trees for Data, Struc, Context Subtrees for arrays of records (like relational DB tables) Records contain elements that point to individual ADEs
An Example: NCDC Precip Global Historical Climate Network (GHCN) Monthly Average Precipitation ~2100 rainguage stations at peak Earliest data ~1835 Data recorded in ASCII Station ID, lat, long in another file Station ID Year JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Strc Strc Dat Dat Dat Dat Dat Dat Dat Dat Dat Dat Dat Dat 2112867900011891 203 202-9999 231 96 186 152 646 139 430 169 209 2112867900011892 150 148 26 81 448 262 328 568-9999-9999 66 301 2112867900011893 0 41 121 46 779 198 701 511 107 192 418 249 2112867900011894 150 120 310 185 279 672-9999-9999-9999-9999-9999-9999 Missing Obs Value : -9999 “Trace Precip Value : -8888
Mechanisms for Formating Data Three mechanisms: Use Templates to impress structure on data streams Use Delimiters (e.g. XML tags) Count Bits from beginning of array Mechanisms identify digital artifacts used in RN
Where Does Understanding Lie? With bit counting (e.g. FORTRAN) − With delimiter (e.g. XML) − Compiler creates instructions for interpretation based on text of read program XML Parser and compiler create instructions for interpretation based on DTD or Schema Interpretation may reside in different places − − Array dimensions implicit in read program or conventions of language for file handling Array dimensions included in file
Beware of Tacit Knowledge Some fields may reside in other files or are implicit − − Geolocation of pixels in MODIS: have to consult file outside spectral image file Months of year in GHCN precipitation: month numbering implicit in array ordering Conventions may not be noted − − Date conventions (European vs American vs Astronomical Julian Date) Language encodings (Unicode vs ASCII)
Archivally Safe Transformations Key Question: How Could We Tell If Two Files Contain the Same Data? − Individual data elements may be transformed in type − Order of data elements in aggregation may be permuted or indexed differently − − float -> double; byte -> int; ASCII -> Unicode FORTRAN -> C in order Data may be separated into different files or aggregated into one Tacit information (e.g. Representation info) may be made explicit
Strategies for Reducing Risk Migration − Transparency − Expect to create new files from old on a fairly regular basis – although with changes as needed to avoid risk of loss Make transformations explicit and record mappings from one format to another Diversity − − Record data in more than one format Decentralize management and rely on federated authentication (and audit)
Concluding Comments Use of slightly modified OAIS RM Layered Model − − Gives solid basis for precise identification of particular scientific data For two files to have identical scientific data Use one-to-one and onto mapping, including necessary order permutations One-to-One and onto mapping guarantees inverse mapping − Rigorous basis for identifying scientifically identical subsets
... Term Earth Science Information Preservation. ... Data Formats in Long-Term Earth Science ... Role of Data Formats in Long-Term Earth Science ...
Long-term Preservation of Earth Observation ... and derived information for long-term science and ... of Earth Observation Data 8 The ESA Role in ...
... for Long-Term Preservation of Scientific and ... the long-term preservation and use of data? ... role of software in information preservation, ...
A Data Model and Architecture for Long-term Preservation ... and format migration. Each type of information, ... given the extensive and vital role
The challenges of long-term preservation of digital information have ... for long-term use, and the data can ... digital preservation format ...
View 1642 Long Term Preservation ... The Role of Data Formats in Long-Term Preservation ... Stewardship and long term preservation of earth science data.
Home > Digital Curation > Why Preserve Digital Data. ... and information from third parties. A data ... Role of Microfilm in Digital Preservation;
Long term preservation, discovery, access and exploitation of ... Long Term Data Preservation, Earth Science, ... unique data, information and ...
Long Term Data Knowledge Preservation, Earth Science, ... and derived information for the long term science and the ... data formats for AIPs ...