Published on January 21, 2008

Author: Stentore


THE DIATHESIS NEWSPAPER DIGITIZATION SUITE:  THE DIATHESIS NEWSPAPER DIGITIZATION SUITE Foundation of Research and Technology Institute of Computer Science Centre for Cultural Informatics Martin Doerr, Georgios Markakis, Maria Theodoridou Heraklion, Crete, Greece About DIATHESIS:  About DIATHESIS Diathesis is a newspaper digitization suite whose primary purpose is the digitization, classification and dissemination of archival newspaper material. It was originally used for the digitization of the Vikelaia Municipal Library’s newspaper collection (1890-1960) at Heraclion, Crete. It has evolved as an independent digitization suite since. Used in other projects as well (Filekpedeytiki Etairia Athens, Greece – The “AYGHI” newspaper) The Problem:  The Problem Historical newspapers are one of the most signicant source of information for researchers due to the wealth of information they provide regarding every aspect of everyday political, social and intellectual life. Access to this type of archival material is usually obstructed by the following factors: In order to protect the archival material from potential damage some archives prohibit the access to the largest part of their collection. Direct contact with the original archival material constitutes a potential health hazard (due to dust and fungi). The lack of indexes to newspapers combined with the vastness of information contained in them makes research a very time consuming task. Many archives adopted digitization of newspapers as a straightforward method to deal with the above problems. Digitized material is easier to preserve and much easier to distribute via the Web. However, conversion of archival material into a digital image format (i.e. JPEG, TIFF, PDF or DJVU) does not solve the problem of rapid access to this material. Digitization itself is inadequate if it does not provide the means of rapidly accessing the digitized material in a timely and accurate manner (also known as the searchability issue). Current State of the Art newspaper Digitization Practices:  Current State of the Art newspaper Digitization Practices Currently there are three main approaches for rendering newspaper archival material searchable: The Physical Features Based Approach The OCR Based Full Text Indexing Approach. The Conceptual Classification (Ontology Based) Approach. The physical features based classification approach. :  The physical features based classification approach. Newspapers are classified using a basic set of metadata regarding physical features of the original material (number of issue, date of publication, newspaper name, number of pages etc). Advantages: Simple to implement. Disadvantages: The final user is unable to conduct full-text searches on an article or issue level basis. The final outcome of the digitization effort resembles more a browsing mechanism. There is no explicitly defined conceptual structure of the archive. Institutions: Anno: Austrian newspapers online project ( “Exilpresse digital. deutsche exilzeitschriften 1933-1945" project ( Denmark: Digitaliserede danske aviser 1759-1865 ( The OCR based Full Text Indexing Approach.:  The OCR based Full Text Indexing Approach. Automatic digitization approaches that make use of OCR analysis of digitized newspapers. Full Text Indexing techniques are currently considered to be the state of the art in the area of newspaper digitization and this is mainly for the following reasons: Creation of searchable full - text index via OCR is a much faster process compared to the manual creation of metadata. Separation of searchability and readability. It is possible to conduct searches at a page/issue/article level basis. The search is conducted via keywords in a manner that is familiar to the average user of contemporary Web Search engines. Efficient content dissemination over the Web. Disadvantages: Well known precision/recall issues. Newspaper archives are not as chaotic as the Web. The search of information in OCR based information retrieval systems is conceptually blind. The import process a computationally expensive procedure. The OCR based Full Text Indexing Approach.:  The OCR based Full Text Indexing Approach. Institutions adopting this approach: British library online newspaper archive ( The Brooklyn Daily Eagle online ( Northern New Nork historical newspapers ( Utah Digital Newspapers ( Historical newspapers in Washington ( To mention just a few… The conceptual classification approach.:  The conceptual classification approach. The conceptual classification approach overcomes many of the above weaknesses by enabling the user to perform a knowledge engineering task upon the already digitized material via the use of ontologies. An Ontology: "the specifcation of ones conceptualization of a knowledge domain". Advantages: Ontologies are used to express a specific conceptual view over the digitized material. The use of top level ontologies guarantees to a certain extent the semantic interoperability among different archives. The user may use concepts that classify the document that are not initially contained within the document itself. Disadvantages: Given the density of information in a newspaper, production of metadata is a notoriously time consuming task (knowledge engineering bottleneck). It is almost impossible to manually define all the semantic relations or entities contained even in a single article in a timely manner. The DIATHESIS Approach: a hybrid approach:  The DIATHESIS Approach: a hybrid approach This system attempts to implement a realistic conceptual classification approach by combining the best elements from the three approaches mentioned above: It permits searches on a newspaper issue basis (newspaper issue name, number, publication date) in a similar manner to the physical features based approach. It permits searches on an article level basis via the use of full text queries in a similar manner to the OCR based Full Text Indexing Approach. It permits searches on an article level basis via the semantic relationships assigned to each segment. It permits searches that combine all of the above elements. The system DOES not attempt to create a complete semantic structure that includes all the semantic relationships and entities (Actors, Places) described in the text. Instead it focuses to the creation a coherent semantic backbone that can be easily enriched with semantic relations. DIATHESIS is using CIDOC – CRM as an underlying ontology. Aims of DIATHESIS:  Aims of DIATHESIS To render the digitized newspapers searchable on a document/article level basis. To exploit the use of OCR technology in order to enable full text search in a newspaper collection. To combine full text search with user-defined metadata based search on a document and article level basis in order to enhance the overall precision factor of the system. To provide visualization facilities and an ergonomic interface for: The timely completion of metadata according to a set of predefined thesauri hierarchies. The browsing of the digitized newspaper collection given a set of predefined thesauri hierarchies. To deal with issues of semantic interoperability of digitized material (conformance to international standards). To create a robust semantic backbone that will allow the full implementation of the CIDOC CRM Model. About CIDOC:  About CIDOC What is the CIDOC Conceptual Reference Model? An Object Oriented Ontology of about 80 classes and 130 properties for cultural and natural history CRM instances can be encoded in many forms: RDBMS, ooDBMS, XML, RDF(S), OWL. Accepted as ISO-21127 in June 2005 The CRM Is not a metadata standard It is meant to become our language for semantic interoperability, It is a Conceptual Reference Model for analyzing and designing cultural information systems Is limited to the underlying semantics of database schemata and document structures used in cultural heritage and museum documentation Does not define the terminology used to document these data structures Does not say what cultural institutions should document Aims to explain the logic of what they actually do document Slide13:  An Example Hierarchy: E70 Stuff (Thing) CIDOC Example (1): Modeling an Activity:  CIDOC Example (1): Modeling an Activity P14 performed P11 participated in P94 has created E7 Activity “Crimea Conference” E65 Creation Event * P86 falls within P7 took place at P67 is referred to by February 1945 P81 ongoing throughout P82 at some time within CIDOC Example (2): Describing a composite artifact:  CIDOC Example (2): Describing a composite artifact CIDOC-CRM DIATHESIS implementation: Issue/Segments Relationships:  CIDOC-CRM DIATHESIS implementation: Issue/Segments Relationships CIDOC-CRM DIATHESIS implementation: Issue Physical Features:  CIDOC-CRM DIATHESIS implementation: Issue Physical Features CIDOC-CRM DIATHESIS implementation: Activity References:  CIDOC-CRM DIATHESIS implementation: Activity References Thesauri Hierarchies:  Thesauri Hierarchies CIDOC based newspaper annotation:  CIDOC based newspaper annotation Integration by Factual Relations Ethiopia Johanson's Expedition CIDOC CRM Core Ontology Documents in Digital Libraries Hadar Discovery of Lucy Lucy Donald Johanson Benaki Museum real world nodes (KOS) The System Architecture: Software Components:  The System Architecture: Software Components Apache Tomcat Application Server Newspaper Digitization Suite Diathesis Administrator Diathesis Annotation Mechanism DIATHESIS Web Search Database SIS-TMS Thesaurus Management System Server Side Client Side The System Architecture: Workflow View:  The System Architecture: Workflow View The user interface:  The user interface FEATURES: Fully Web Based. Simple to use / Easy to learn. Intelligent Upload / Download Mechanism. Workflow Control . Data Loss Prevention Mechanism (Temporary Local Storage and Data Recovery). Flexible and Ergonomic Completion of Metadata Fields. Automatic Highlighting of keywords in OCR Text (Actors, Places). Use of SVG thesauri hierarchies for the timely completion of Vocabulary Reserved Metadata fields. The user interface:  The user interface DIATHESIS Annotation Mechanism End User Search Mechanism Administrator Usage Stats Mass Import System Configuration Search for Subjects Search for Issues Demonstration: Annotation Interface:  Demonstration: Annotation Interface Demonstration: End User Search Mechanism:  Demonstration: End User Search Mechanism Future Directions:  Future Directions Enrich the metadata creation process with Information Extraction Techniques. Expand the suite with complementary Deep Semantic Annotation Capabilities (Semantic Wiki) Material Preprocessing Phase Shallow Semantic Annotation – metadata production phase. Deep Semantic Annotation – full CIDOC implementation phase DIATHESIS Semantic Wiki PHASE 1 PHASE 2 PHASE 3 Conclusions:  Conclusions The use of OCR technology in newspaper digitization practices is a hot new technology. However it is not capable to deal with a plethora of issues. Deep Semantic annotation via Semantic Web technologies is a promising future trend. CIDOC CRM provides the theoretical means to achieve this. The problem is how to implement it. Creation of deep semantic relationships that exist within the boundaries of a single newspaper issue is a time – consuming , and therefore expensive task. The DIATHESIS digitization suite encapsulates a digitization strategy towards the creation of a vast semantic network of factual relationships between CIDOC entities while effectively dealing with the following issues: Digitization and Storage of Newspaper Material Rendering digitized material searchable on an issue/article level basis via the use of metadata, thesauri hierarchies and full text queries. Create a semantic backbone that can be used by future implementations. The next step: Link the DIATHESIS semantic backbone with a Semantic Wiki. Slide29:  Thank You!

