Published on October 20, 2008

Author: aSGuest1448



Heuristic Approach for Automatic Metadata Capture of E-books/Journals : Heuristic Approach for Automatic Metadata Capture of E-books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore Agenda : Agenda Earlier Experiment with printed books Present Experiment with E-Books & E-Journals Heuristics for Printed Books : Heuristics for Printed Books Heuristics for the ... Title page Verso of the title page Methodology for Printed Books : Methodology for Printed Books Scan the title page OCR the image Generate the output in HTML Apply Heuristics to HTML pages Identify the bibliographic elements Heuristics for Verso of the Title Page : Heuristics for Verso of the Title Page Identify date & edition etc. See whether prenatal cataloging is available Identify the bibliographic elements in prenatal catalog Counter check the identifications from the title page Resolution in case of conflicts Generating Bibliographic Records : Generating Bibliographic Records Once the bibliographic elements are identified Generate bibliographic records in ISO-2707 Dublin Core Sample Heuristics for Identifying Title : Sample Heuristics for Identifying Title Order of the Bibliographic elements Titles are found in upper or upper middle portion of the title page. The title appears first in the title page (75.15 per cent) (In few cases author or series occupies first position.) Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields. Slide 8: If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen). It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others. Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”. Heuristics for other elements : Heuristics for other elements Sub titles Edition Volume Authors/ Contributor Publisher Place Year Series Present Experiment : Present Experiment E-Books (from sites like ) E-Journals (Non-OAI compliant) Methodology : Methodology Template based Identification Heuristic based Identification Disadvantages of Template Based Approach : Disadvantages of Template Based Approach For every new site / templates are to be created A site may change the appearance and require you to develop more than one template for each site or journal Methodology : Methodology Study few sites to develop heuristics Web Crawler to probe the site Identify the files having documents (filter irrelevant files) Apply heuristics on the files having e-documents Generating Dublin Core Records Thank You : Thank You Welcome to International Conference on Semantic Web & Digital Libraries 21st – 23rd February, 2007 Indian Statistical Institute Bangalore

