advertisement

Mime Magic With Apache Tika

75 %
25 %
advertisement
Information about Mime Magic With Apache Tika

Published on April 14, 2008

Author: jukka

Source: slideshare.net

Description

Fast Feather Track presentation at ApacheCon EU 2008 in Amsterdam
advertisement

MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor

Agenda The Problem The Solution The Project The Client

The Problem PDFBox Apache POI Apache Xerces ICU4J NekoHTML etc. Lucene index

It's even worse! Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?

Agenda The Problem The Solution The Project The Client

The Solution: Technical Generic API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns

Generic API for extracting metadata and structured text content from a document

Input: byte stream + optional metadata

Output: XHTML SAX events + metadata

Automatic content type detection

Magic bytes

File name patterns

The Solution: Legal / Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions

Apache License

(L)GPL projects can implement the Tika API

Pooling of efforts

Active development and maintenance

Already beyond the functionality of most custom solutions

Agenda The Problem The Solution The Project The Client

Project Status Incubating since March 2007 Sponsoring PMC: Apache Lucene First release (0.1-incubating) in December 2007 Interaction with PDFBox, POI, etc. Currently in early adopter phase

Incubating since March 2007

Sponsoring PMC: Apache Lucene

First release (0.1-incubating) in December 2007

Interaction with PDFBox, POI, etc.

Currently in early adopter phase

Current Features 73 registered media types 167 glob patterns 26 magic header patterns 7 built-in parser classes 51 supported media types MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text

73 registered media types

167 glob patterns

26 magic header patterns

7 built-in parser classes

51 supported media types

MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text

Project Statistics

Agenda The Problem The Solution The Project The Client

Tika Parser API package org.apache.tika.parser; public interface Parser { // Parses document content and metadata void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; // Parses document metadata, @since Tika 0.2 void parse( InputStream stream, Metadata metadata) throws IOException, TikaException; }

package org.apache.tika.parser;

public interface Parser { // Parses document content and metadata void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; // Parses document metadata, @since Tika 0.2 void parse( InputStream stream, Metadata metadata) throws IOException, TikaException;

}

Example: Text extraction public static void main(String[] args)‏ throws Exception { InputStream stream = System.in; ContentHandler handler = new WriteOutContentHandler( System.out ); Metadata metadata = new Metadata(); new AutoDetectParser().parse( stream , handler , metadata ); }

public static void main(String[] args)‏

throws Exception {

InputStream stream = System.in;

ContentHandler handler =

new WriteOutContentHandler( System.out );

Metadata metadata = new Metadata();

new AutoDetectParser().parse(

stream , handler , metadata );

}

Demo: Tika GUI

Agenda The Problem The Solution The Project The Client Thank You!

Add a comment

Related pages

Apache Tika - Content Detection

Mime Magic Detection. By looking for special ("magic") patterns of bytes near the start of the file, it is often possible to detect the type of the file.
Read more

Apache Tika - Content Detection

This page gives you information on how content and language detection works with Apache Tika, ... Tika is able to make use of a a mime magic info file, ...
Read more

MIME Magic with Apache Tika - Tistory

MIME Magic with Apache Tika ... import org.apache.tika.parser.Parser; Parser parser = new AutoDetectParser(); parser.parse(InputStream, ContentHandler,
Read more

java - Getting MimeType subtype with Apache tika - Stack ...

Originally, Tika only supported detection by Mime Magic or by file extension (glob), as this is all most mime detection before Tika did. Because of the ...
Read more

Get the Mime Type from a File - Real's Java How-to

... like for example Apache Tika, see Transparently improve Java 7 mime-type ... that retrieves file and stream mime types by checking magic ...
Read more

Hottest 'tika' Answers - Stack Overflow

Hot answers tagged tika. ... Tika only supported detection by Mime Magic or by file extension ... ElasticSearch is asking Apache Tika to extract the text.
Read more

FrontPage - Tika Wiki - Apache Software Foundation

Bayesian MIME selection - Tika's new Bayesian MIME ... Getting Tika and Running with Apache cTAKES - How to use Tika with Apache cTAKES ...
Read more

mod_mime_magic - Apache HTTP Server Version 2.4

Apache Module mod_mime_magic. ... results to stdout now saves them in a list where they're used to set the MIME type in the Apache request record. ...
Read more