Inex07

67 %
33 %
Information about Inex07

Published on January 4, 2008

Author: alfonsoeromero

Source: slideshare.net

Probabilistic Methods for Structured Document Classification at INEX´07 Probabilistic Methods for Structured Document Classification at INEX´07 ´ Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Alfonso E. Romero ´ Departamento de Ciencias de la Computacion e Inteligencia Artificial Universidad de Granada {lci,jmfluna,jhg,aeromero}@decsai.ugr.es INEX 2007, Dagstuhl (Germany) December 18, 2007

Probabilistic Methods for Structured Document Classification at INEX´07 Abstract We present here the result of our participation in the Document Mining track at INEX´07. We submitted several runs for this track (only classification). This is the first year we apply for, with relative good results.

Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.

Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.

Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.

Probabilistic Methods for Structured Document Classification at INEX´07 The models Multinomial Naive Bayes The models: Multinomial Naive Bayes Extensively used in text classification (see paper by MacCallum). Posterior probability of a class is computed using the bayes rule: p(ci )p(dj |ci ) p(ci |dj ) = ∝ p(ci )p(dj |ci ) (1) p(dj ) Probabilities p(dj |ci ) are supossed to follow a multinomial distribution: p(tk |ci )njk p(dj |ci ) ∝ (2) tk ∈dj The estimation of the needed values is carried out in the N following way: p(tk |ci ) = Nik+M and p(ci ) = Ni,doc . +1 Ni doc

Probabilistic Methods for Structured Document Classification at INEX´07 The models OR Gate Bayesian Network Classifier The OR Gate Bayesian Network Classifier Tries to model rules of the following kind: IF (ti ∈ d) ∨ (tj ∈ d) ∨ ... THEN classify by c. The canonical model used is a noisy OR gate, wich is related with the notion of causality. The appearance of some terms is the cause for the assignation of a certain class. The general expression of the probability distributions is this: p(ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) Tk ∈R(pa(Ci )) p(c i |pa(Ci )) = 1 − p(ci |pa(Ci ))

Probabilistic Methods for Structured Document Classification at INEX´07 The models OR Gate Bayesian Network Classifier The OR Gate Bayesian Network Classifier (2) We instantiate all terms of a document: if tk ∈ dj , p(tk |dj ) = 1, 0 otherwise. We replicate each term node k with its frequency in the document njk . The computation of the posterior probability is very easy (only depending on the terms appearing on the document): (1 − w(Tk , Ci ))njk . p(ci |dj ) = 1 − (3) Tk ∈Pa(Ci )∩dj Two estimation schemes for the weights: Nik Maximum likelihood: w(Tk , Ci ) = Nk . (Ni −Nih )N = Nik × Better approximation: w(Tk , Ci ) h=k (N−Nh )Ni Nk See paper for details!

Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Example XML file Document representation: example XML file <book> <title>El ingenioso hidalgo Don Quijote de la Mancha</title> <author>Miguel de Cervantes Saavedra</author> <contents> <chapter>Uno</chapter> <text>En un lugar de La Mancha de cuyo nombre no quiero acordarme...</text> </contents> </book>

Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Method: “only text” Document representation 1 Only text (removing all tags). Quijote El ingenioso hidalgo Don Quijote de la Mancha Miguel de Cervantes Saavedra Uno En un lugar de La Mancha de cuyo nombre no quiero acordarme...

Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Method: “text replication” Document representation 5 Text replication (term frequencies are replicated several times, depending on the tag containing the terms). Replication values title 1 author 0 chapter 0 text 2 Quijote El ingenioso hidalgo Don Quijote de la Mancha En En un un lugar lugar de de La La Mancha Mancha de de cuyo cuyo nombre nombre no no quiero quiero acordarme acordarme...

Probabilistic Methods for Structured Document Classification at INEX´07 Results Results Five runs were submitted to the track: (1) Naive Bayes, only text, no term selection. Microaverage: 0.77630. Macroaverage: 0.58536. (2) Naive Bayes, replication (id=2), no term selection. Microaverage: 0.78107 (+0.6%). Macroaverage: 0.6373 (+8.9%). (3) Or gate, maximum likelihood, replication (id=8), selection by MI. Microaverage: 0.75097. Macroaverage: 0.61973. (4) Or gate, maximum likelihood, replication (id=5), selection by MI. Microaverage: 0.75354. Macroaverage: 0.61298. (5) Or gate, better approximation, only text, ≥ 2. Microaverage: 0.78998. Macroaverage: 0.76054. See paper for text replication values (id=2,5,8).

Probabilistic Methods for Structured Document Classification at INEX´07 Conclusion Conclusions Also reached by our previous experiments!! Naive Bayes works bad with few populated categories (low macroaverage). Tagging and adding seems not to work well without feature selection. Text replication improves macroaverage (good for Naive Bayes! ). OR gates by ML needs of a per-class feature selection method (mutual information). The better approximation for the OR gate is our best classifier in previous experiments over flat text.

Probabilistic Methods for Structured Document Classification at INEX´07 Conclusion Thank you very much! Questions or comments?

Add a comment

Related pages

Probabilistic Methods for Structured Document ...

Probabilistic Methods for Structured Document Classification at INEX’07 Luis M. de Campos, Juan M. Fern´ andez-Luna, Juan F. Huete, and Alfonso E ...
Read more

The Garnata Information Retrieval System at INEX07 ...

The Garnata Information Retrieval System at INEX’07 Luis M. de Campos, Juan M. Fern´ andez-Luna, Juan F. Huete, Carlos Mart´ ın-Dancausa, and Alfonso ...
Read more

INEX XML Entity Ranking - Centrum Wiskunde & Informatica

Prefixes inex07-xer-training-and inex07-xer-testing-distinguish the training from the testing qrels; for the latter, ...
Read more

Overview of the INEX 2007 Book Search track

... of the Initiative for the Evaluation of XML Retrieval (INEX 2007), http://inex.is.infonnatik.uni-duisburg.de/2007/inex07/pdf/2007-preproceedings ...
Read more

Using Wikipedia Categories and Links in Entity Ranking

Using Wikipedia Categories and Links in Entity Ranking Anne-Marie Vercoustre, Jovan Pehcevski, James Thom To cite this version: Anne-Marie Vercoustre ...
Read more

Lecture Notes in Computer Science - ResearchGate

Publication » Lecture Notes in Computer Science. ... INEX07 INEX07 adhoc INEX07 xer INEX08 4.3 Considering the INEX Category Query Field
Read more

Overview of the INEX 2007 Book Search Track (BookSearch’07)

Overview of the INEX 2007 Book Search Track (BookSearch’07) Gabriella Kazai Affiliated with Microsoft Research Cambridge, ...
Read more

www.himfg.com.mx

Created Date: 11/30/2011 9:59:22 AM
Read more