trt 10

50 %
50 %
Information about trt 10

Published on January 15, 2008

Author: Patrizia


Slide1:  Text Retrieval and Mining # 10 Text Classification Lecture by Young Hwan CHO, Ph. D. Is this spam?:  Is this spam? From: "" <> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: ================================================= Categorization/Classification:  Categorization/Classification Given: A description of an instance, xX, where X is the instance language or instance space. Issue: how to represent text documents. A fixed set of categories: C = {c1, c2,…, cn} Determine: The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C. We want to know how to build categorization functions (“classifiers”). Text Categorization Examples:  Text Categorization Examples Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" Labels may be genres e.g., "editorials" "movie-reviews" "news“ Labels may be opinion e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “is a toner cartridge ad” :“isn’t” Methods (1):  Methods (1) Manual classification Used by Yahoo!, Looksmart,, ODP, Medline very accurate when job is done by experts consistent when the problem size and team is small difficult and expensive to scale Automatic document classification Hand-coded rule-based systems Used by CS dept’s spam filter, Reuters, CIA, Verity, … E.g., assign category if document contains a given boolean combination of words Commercial systems have complex query languages (everything in IR query languages + accumulators) Accuracy is often very high if a query has been carefully refined over time by a subject expert Building and maintaining these queries is expensive Methods (2):  Methods (2) Supervised learning of document-label assignment function Many new systems rely on machine learning (Autonomy, Kana, MSN, Verity, …) k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) … plus many other methods requires hand-classified training data, But can be built (and refined) by amateurs 전문가가 규칙을 작성하는 것보다 저렴한 비용으로 정답셋을 구축할 수 있음 문서 범주화 시스템 구성도:  문서 범주화 시스템 구성도 문서범주화를 위해 결정하여야 하는 모델들:  문서범주화를 위해 결정하여야 하는 모델들 문서 범주화에 사용할 단어 선택 어떠한 단어를 선택하여야 하는가? 각 문서 범주에서 단어의 중요도 계산 각 범주에 주어진 문서들에서 그 단어들의 중요도는 어떠한가? 새로운 문서가 들어왔을 경우에 각 문서범주 표현과 비교 새로운 문서에 포함된 그 단어들의 분포를 보았을때 현재의 범주에 포함될 확률은 얼마인가? Document Classification:  Multimedia GUI Garb.Coll. Semantics ML Planning planning temporal reasoning plan language... programming semantics language proof... learning intelligence algorithm reinforcement network... garbage collection memory optimization region... “planning language proof intelligence” Training Data: Testing Data: Classes: (AI) Document Classification (Programming) (HCI) ... ... The Naïve Bayes Classifier:  The Naïve Bayes Classifier Conditional Independence Assumption: features are independent of each other given the class: Text Classification Algorithms: Learning:  Text Classification Algorithms: Learning From training corpus, extract Vocabulary Calculate required P(cj) and P(xk | cj) terms For each cj in C do docsj  subset of documents for which the target class is cj Textj  single document containing all docsj for each word xk in Vocabulary nk  number of occurrences of xk in Textj Text Classification Algorithms: Classifying:  Text Classification Algorithms: Classifying positions  all word positions in current document which contain tokens found in Vocabulary Return cNB, where 현재의 문서에서 나타나는 모든 단어들에 대해서 각각의 단어가 각각의 Class에 속할 확률 (유사도)를 계산해서 그중 가장 높은 값을 가지는 class를 선택 Naive Bayes Time Complexity:  Naive Bayes Time Complexity Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D. Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld) time during one pass through all of the data. Generally just O(|D|Ld) since usually |C||V| < |D|Ld Test Time: O(|C| Lt) where Lt is the average length of a test document. Very efficient overall, linearly proportional to the time needed to just read in all the data. Naive Bayes Models:  Naive Bayes Models Multi-variate bernoulli model : 모든 문서에서 특정단어의 출현했는지의 여부로 구별되는 이진속성벡터(vector of binary attributes) 로 표현된 모델로 문서를 정형화 하는데, 각 클래스의 문서 마다 다른 단어세트를 적용하는 모델 입력 문서가 범주에 속할 확률을 용어의 빈도가 아닌 용어의 발생 여부에 의한 확률 계산 방법 Mutinomial Model : 용어에 대한 빈도를 고려하여 확률 값을 계산 Example: AutoYahoo!:  Example: AutoYahoo! Classify 13,589 Yahoo! webpages in “Science” subtree into 95 different topics (hierarchy depth 2) Sample Learning Curve:  Sample Learning Curve (Yahoo Science Data) WebKB Experiment:  WebKB Experiment Train on ~5,000 hand-labeled web pages Cornell, Washington, U.Texas, Wisconsin Crawl and classify a new site (CMU) Results: NB Model Comparison:  NB Model Comparison Feature Selection: Why?:  Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique words – and more Make using a particular classifier feasible Some classifiers can’t deal with 100,000s of feat’s Reduce training time Training time for some methods is quadratic or worse in the number of features (e.g., logistic regression) Improve generalization Eliminate noise features Avoid overfitting Recap: Feature Reduction:  Recap: Feature Reduction Standard ways of reducing feature space for text Stemming Laugh, laughs, laughing, laughed -> laugh Stop word removal E.g., eliminate all prepositions Conversion to lower case Tokenization Break on all special characters: fire-fighter -> fire, fighter Feature Selection:  Feature Selection Yang and Pedersen 1997 Comparison of different selection criteria DF – document frequency IG – information gain MI – mutual information CHI – chi square Common strategy Compute statistic for each term Keep n terms with highest value of this statistic Information Gain:  Information Gain 정보 획득량은 기계 학습 분야에서 자주 사용되는 기법 문서에서의 출현 빈도뿐만 아니라 출현하지 않은 빈도까지 고려해서 각 범주에서의 용어 정보량을 계산 {c1, c2, ..., cm}를 범주 집합이라 할 때 용어 t의 정보 획득량은 다음과 같은 식으로 구해진다. Mutual Information:  Mutual Information 상호 정보 척도는 통계적 언어 모델 문서관리를 위한 자동문서범주화에 대한 이론 및 기법 (statistical language model)에서 일반적으로 사용되는 기법 범주c에서 용어 t가 많이 출현할 수록 범주 c에서의 용어 t의 상호정보량은 크다. 용어 t의 범주 c에서의 정보량은 다음의 식으로 나타낸다. Chi-Square:  Chi-Square 카이 제곱 통계량은 용어 t와 범주 c와의 의존성( d e p e n d e n c y )을 측정하는 것으로서 자유도 1인 카이 제곱 분포와 비교될수 있다. 카이 제곱 통계량을 계산하는 식은 아래와 같다. X2(t,c) = N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) ) 용어 t와 범주 c가 완전히 독립적이면 값 = 0 Use either maximum or average X2 Document Frequency:  Document Frequency Number of documents a term occurs in Is sometimes used for eliminating both very frequent and very infrequent terms How is document frequency measure different from the other 3 measures? Yang&Pedersen: Experiments:  Yang&Pedersen: Experiments Two classification methods kNN (k nearest neighbors; more later) Linear Least Squares Fit Regression method Collections Reuters-22173 92 categories 16,000 unique terms Ohsumed: subset of medline 14,000 categories 72,000 unique terms Ltc term weighting Yang&Pedersen: Experiments:  Yang&Pedersen: Experiments Choose feature set size Preprocess collection, discarding non-selected features / words Apply term weighting -> feature vector for each document Train classifier on training set Evaluate classifier on test set Yang&Pedersen: Discussion:  Yang&Pedersen: Discussion You can eliminate 90% of features for IG, DF, and CHI without decreasing performance. In fact, performance increases with fewer features for IG, DF, and CHI. Mutual information is very sensitive to small counts. IG does best with smallest number of features. Document frequency is close to optimal. By far the simplest feature selection method. MI ignores non-occurrence of terms Results:  Results Why is selecting common terms a good strategy? IG, DF, CHI Are Correlated.:  IG, DF, CHI Are Correlated. Text Classification:  Text Classification Introduction to Text Classification K Nearest Neighbors Decision boundaries Vector space classification using centroids Decision Trees (briefly) Recall: Vector Space Representation:  Recall: Vector Space Representation Each document is a vector, one component for each term (= word). Normalize to unit length. High-dimensional vector space: Terms are axes 10,000+ dimensions, or even 100,000+ Docs are vectors in this space Classification Using Vector Spaces:  Classification Using Vector Spaces Each training doc a point (vector) labeled by its topic (= class) Hypothesis: docs of the same class form a contiguous region of space We define surfaces to delineate classes in space Test Document = Government:  Test Document = Government Government Science Arts Similarity hypothesis true in general? k Nearest Neighbor Classification:  k Nearest Neighbor Classification To classify document d into class c Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority class] Example: k=6 (6NN):  Example: k=6 (6NN) Government Science Arts P(science| )? kNN: Discussion:  kNN: Discussion No feature selection necessary Scales well with large number of classes Don’t need to train n classifiers for n classes Classes can influence each other Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training necessary Actually: perhaps not true. (Data editing, etc.) Decision Tree Classification:  Decision Tree Classification Tree with internal nodes labeled by terms Branches are labeled by tests on the weight that the term has Leaves are labeled by categories Classifier categorizes document by descending tree following tests to leaf The label of the leaf node is then assigned to the document Most decision trees are binary trees (never disadvantageous; may require extra internal nodes) DT make good use of a few high-leverage features Decision Tree Categorization: Example:  Decision Tree Categorization: Example Decision Tree Learning:  Decision Tree Learning Learn a sequence of tests on features, typically using top-down, greedy search At each stage choose the unused feature with highest Information Gain (feature/class MI) Binary (yes/no) or continuous decisions Decision Tree Learning:  Decision Tree Learning Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents well Therefore, pruning or early stopping methods for Decision Trees are normally a standard part of classification packages Use of small number of features is potentially bad in text cat, but in practice decision trees do well for some text classification tasks Decision trees are very easily interpreted by humans – much more easily than probabilistic methods like Naive Bayes Decision Trees are normally regarded as a symbolic machine learning algorithm, though they can be used probabilistically Category: “interest” – Dumais et al. (Microsoft) Decision Tree:  Category: “interest” – Dumais et al. (Microsoft) Decision Tree Summary:  Summary Process : 학습과정 : 자질 선택, 가중치 부여 실제 분류 : 베이지언 확률모델, K-Nearest Neighbor, Support Vector Machines, Linear Classification, Decision Tree, Neural Network 자질 선택 : Feature selection 문서 빈도 (Document Frequency), 상호정보(Mutual Information), 정보획득량(Information Gain), 카이제곱 (X2 statistics) 가중치 부여 : Term Weighting 이진 가중치(Boolean Weighting), 단어 빈도 가중치 (Word Frequency Weighting), TF-IDF 가중치, 엔트로피 가중치(Entropy Weighting), 역범주 빈도 가중치

Add a comment

Related presentations

Related pages

Hier sollte eine Beschreibung angezeigt werden, diese Seite lässt dies jedoch nicht zu.
Read more

TRT - Anasayfa

TRT - Türkiye Radyo Televizyon Kurumu Resmi Web Sitesi. Haber, Televizyon, Radyo, Canlı Yayın, Video, Podcast, Spor, Finans, Hava Durumu, Dünya ...
Read more

TRT 1-Canlı İzle

TRT 1 Canlı İzle Tarih :16.12.2015 Çarşamba. Canlı yayın izlemede sorun yaşarsanız adresine yazınız.
Read more

TV Programm auf TRT

TV Programm auf TRT - sehen was im Fernsehprogramm läuft. Mit vielen Bildern, Infos, Trailern und Insidertipps für jeden TV Sender.
Read more

TRT 1-Haber - Dizi - Sinema

TRT 1 ekranlarının sıcacık aile dizisi “Baba Candır’ın 38 ... 2016 Avrupa Futbol Şampiyonası maçları 10 Haziran'dan itibaren TRT1de ekrana ...
Read more

Acesso Intranet 10 - Tribunal Regional do Trabalho 10ª ...

Informe seu nome de usuário * Informe sua Senha * Acessar Cancelar. Bem-vindo !
Read more

TRT 1 HD - Canlı İzle - TRT

TRT 1 HD kanalını canlı izleyin. Sayfa içeriğine erişmek için tıklayın. 10 Mayıs 2016 08:51:32. ANKARA 21 / 9. Ana men ...
Read more

TRT Haber - Türkiye`nin Haber Kaynağı

TRT Türkiye Radyo Televizyon Kurumu Resmi Web Sitesi, TRT`nin internet dünyasındaki yeni yüzü, uluslararası ve yerel haberler, yurt haberler, dış ...
Read more

Türkiye Radyo ve Televizyon Kurumu – Wikipedia

Türkiye Radyo ve Televizyon Kurumu (TRT; deutsch: Türkische Hörfunk- und Fernsehanstalt) ist die öffentlich-rechtliche Rundfunkgesellschaft der Türkei.
Read more

TRT 10 - YouTube

Catherine Cookson's The Round Tower ... This feature is not available right now. Please try again later.
Read more