Apache Tika

50 %
50 %
Information about Apache Tika

Published on November 15, 2007

Author: jukka

Source: slideshare.net

Apache Tika An extensible, configurable content analysis framework toolkit

Agenda The Problem The Solution The Project The Design

The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index

It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?

Agenda The Problem The Solution The Project The Design

The Solution: Technical Generic API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns

Generic API for extracting metadata and structured text content from a document

Input: byte stream + optional metadata

Output: XHTML SAX events + metadata

Automatic content type detection

Magic bytes

File name patterns

The Solution: Legal / Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions Cool future goals: OCR, speech recognition, …

Apache License

(L)GPL projects can implement the Tika API

Pooling of efforts

Active development and maintenance

Already beyond the functionality of most custom solutions

Cool future goals: OCR, speech recognition, …

Agenda The Problem The Solution The Project The Design

Project Status Initially planned already in early 2006 Incubating since March 2007 Sponsoring PMC: Apache Lucene No releases yet 0.1 release being planned Small development team 6 committers, 3-4 currently active

Initially planned already in early 2006

Incubating since March 2007

Sponsoring PMC: Apache Lucene

No releases yet

0.1 release being planned

Small development team

6 committers, 3-4 currently active

Current Features Media type framework Shared MIME info spec (freedesktop.org) Default media type registry (incl. glob and magic patterns) Parser components PDF (PDFBox) Plain text (ICU4) XML (SAX) HTML (NekoHTML) Word, PowerPoint, Excel (POI) ODF (SAX) RTF (Swing)

Media type framework

Shared MIME info spec (freedesktop.org)

Default media type registry (incl. glob and magic patterns)

Parser components

PDF (PDFBox)

Plain text (ICU4)

XML (SAX)

HTML (NekoHTML)

Word, PowerPoint, Excel (POI)

ODF (SAX)

RTF (Swing)

Project Statistics

Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett

Agenda The Problem The Solution The Project The Design

Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);

Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?

Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?

Agenda The Problem The Solution The Project The Design Thank You!

Add a comment

Related presentations

Related pages

Apache Tika – Apache Tika

Apache Tika - a content analysis toolkit. The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such ...
Read more

Apache Tika – Download

Export control. Apache Tika includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use ...
Read more

TikaJAXRS - Tika Wiki

Introduction. This page is documentation on tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that ...
Read more

Apache Download Mirrors - Apache Software Foundation

Home page of The Apache Software Foundation ... We suggest the following mirror site for your download: http://mirror.stjschools.org/public/apache/tika ...
Read more

Tika – Wikipedia

Apache Tika, einen Parser, der Metadaten und strukturierten Text aus diversen Dokumentformaten extrahiert, siehe Apache Lucene; Türkisches Präsidium für ...
Read more

Apache Download Mirrors - Apache Software Foundation

Home page of The Apache Software Foundation ... We suggest the following mirror site for your download: http://apache.cs.utah.edu/tika
Read more

GitHub - apache/tika: Mirror of Apache Tika

tika - Mirror of Apache Tika ... Use SSH Clone with HTTPS Use Git or checkout with SVN using the web URL.
Read more

Apache Tika Incubation Status - Apache Incubator

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser ...
Read more

Apache™ Tika (@ApacheTika) | Twitter

The latest Tweets from Apache™ Tika (@ApacheTika): "CVE-2016-4434: Apache Tika XXE vulnerability. Upgrade to 1.13. Discovered by Microsoft Vulnerability ...
Read more

FrontPage - Tika Wiki

General Information. Tika Website. Download latest Tika Release. Tika mailing lists: Sign-up. TikaResources - Articles, books, podcasts, etc. on ...
Read more