On cascading small decision trees

50 %
50 %
Information about On cascading small decision trees

Published on June 18, 2008

Author: jminguillona

Source: slideshare.net

On Cascading Small Decision Trees Julià Minguillón Combinatorics and Digital Communications Group (CCD) Autonomous University of Barcelona (UAB) Barcelona, Spain http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf

Table of contents Introduction Decision trees Combining classifiers Experimental results Theoretical issues Conclusions Further research References

Introduction

Decision trees

Combining classifiers

Experimental results

Theoretical issues

Conclusions

Further research

References

Introduction Main goal: to build simple and fast classifiers for data mining Partial goals: To reduce both training and exploitation costs To increase classification accuracy To permit partial classification Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.

Main goal: to build simple and fast classifiers for data mining

Partial goals:

To reduce both training and exploitation costs

To increase classification accuracy

To permit partial classification

Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.

Decision trees Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984 Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes Supervised training  labeled data is used for training

Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984

Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label

Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes

Supervised training  labeled data is used for training

Why decision trees? Natural handling of data of mixed types Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations Computational scalability Ability to deal with irrelevant inputs Interpretability

Natural handling of data of mixed types

Handling of missing values

Robustness to outliers in input space

Insensitive to monotone transformations

Computational scalability

Ability to deal with irrelevant inputs

Interpretability

Growing decision trees (binary) T=(data set) /* initially the tree is a single leaf */ while stoppingCriterion(T) is false select t from T maximising selectionCriterion(t) split t=(t L ,t R ) maximising splittingCriterion(t,t L ,t R ) replace t in T with (t L ,t R ) end prune back T using the BFOS algorithm choose T’ minimising classification error on (data set’)

T=(data set) /* initially the tree is a single leaf */

while stoppingCriterion(T) is false

select t from T maximising selectionCriterion(t)

split t=(t L ,t R ) maximising splittingCriterion(t,t L ,t R )

replace t in T with (t L ,t R )

end

prune back T using the BFOS algorithm

choose T’ minimising classification error on (data set’)

Growing algorithm parameters The computed decision tree is determined by: Stopping criterion Node selection criterion Splitting criterion Labelling rule If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant

The computed decision tree is determined by:

Stopping criterion

Node selection criterion

Splitting criterion

Labelling rule

If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant

Splitting criterion Measures the gain of a split for a given criterion Usually related to the concept of impurity Classification performance may be very sensitive to such criterion Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst Different kinds of splits: Orthogonal hyperplanes: fast, interpretable, poor performance General hyperplanes: expensive, partially interpretable Distance based (spherical trees): expensive, allow clustering

Measures the gain of a split for a given criterion

Usually related to the concept of impurity

Classification performance may be very sensitive to such criterion

Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst

Different kinds of splits:

Orthogonal hyperplanes: fast, interpretable, poor performance

General hyperplanes: expensive, partially interpretable

Distance based (spherical trees): expensive, allow clustering

Labelling rule Each leaf t is labelled in order to minimise misclassification error: l(t) = arg j min { r(t) =  {k=0..K-1} C(j,k) p(k|t) } Different classification costs C(j,k) are allowed A priori class probabilities may be included Margin is defined as 1-2 r(t) , or also as max { p(k|t) } – 2 nd max { p(k|t) }

Each leaf t is labelled in order to minimise misclassification error:

l(t) = arg j min { r(t) =  {k=0..K-1} C(j,k) p(k|t) }

Different classification costs C(j,k) are allowed

A priori class probabilities may be included

Margin is defined as 1-2 r(t) , or also as

max { p(k|t) } – 2 nd max { p(k|t) }

Problems Repetition, replication and fragmentation Poor performance for large data dimensionality or large number of classes Orthogonal splits may lead to p oor classification performance due to poor internal decision functions Overfitting may occur for large decision trees Training is very expensive for large data sets Decision trees are unstable classifiers

Repetition, replication and fragmentation

Poor performance for large data dimensionality or large number of classes

Orthogonal splits may lead to p oor classification performance due to poor internal decision functions

Overfitting may occur for large decision trees

Training is very expensive for large data sets

Decision trees are unstable classifiers

Progressive decision trees Goal: to overcome some problems related to the use of classical decision trees Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult Only small decision trees are used: Avoid overfitting Reduce both training and exploitation costs Permit partial classification Detect possible outliers Decision trees become decision graphs

Goal: to overcome some problems related to the use of classical decision trees

Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult

Only small decision trees are used:

Avoid overfitting

Reduce both training and exploitation costs

Permit partial classification

Detect possible outliers

Decision trees become decision graphs

Growing progressive decision trees Build a complete decision tree of depth d Prune it using the BFOS algorithm Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least  ) Join all regions labelled as mixed Start again using only the mixed regions

Build a complete decision tree of depth d

Prune it using the BFOS algorithm

Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least  )

Join all regions labelled as mixed

Start again using only the mixed regions

Example (I) M 1 M 0 M 0 1 M

Example (II) M 0 1 M 1 0 0 1 M M M

Example (III) 1 0 M 0 1

Combining classifiers Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one Several paradigms: Voting: bagging, boosting, randomising Stacking Cascading Why do they work? Because of the fact that different classifiers make different kinds of mistakes Different classifiers are built by using different training sets

Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one

Several paradigms:

Voting: bagging, boosting, randomising

Stacking

Cascading

Why do they work? Because of the fact that different classifiers make different kinds of mistakes

Different classifiers are built by using different training sets

Cascading generalization Developed by Gama et al. in 2000 Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence Three types of cascading ensembles: Type A: no additional info, mixed class Type B: additional info, no mixed class Type C: additional info, mixed class

Developed by Gama et al. in 2000

Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence

Three types of cascading ensembles:

Type A: no additional info, mixed class

Type B: additional info, no mixed class

Type C: additional info, mixed class

Type A progressive decision trees No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down: T D Y D’

No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down:

Type B progressive decision trees Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down: T D Y D’

Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down:

Type C progressive decision trees Additional info is computed for each sample, and only samples labelled as mixed are passed down: T D Y D’

Additional info is computed for each sample, and only samples labelled as mixed are passed down:

Experimental results Four different projects: Document layout recognition Hyperspectral imaging Brain tumour classification UCI collection  evaluation Basic tools for evaluation: N-fold cross-validation bootstrapping bias-variance decomposition } real projects

Four different projects:

Document layout recognition

Hyperspectral imaging

Brain tumour classification

UCI collection  evaluation

Basic tools for evaluation:

N-fold cross-validation

bootstrapping

bias-variance decomposition

Document layout recognition (I) Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy) Documents are 8.5” x 11.7” at 150 dpi Target block size: 8 x 8 pixels (JPEG standard) Minguillón, J. et al., Progressive classification scheme for document layout recognition , Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999

Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard

Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy)

Documents are 8.5” x 11.7” at 150 dpi

Target block size: 8 x 8 pixels (JPEG standard)

Minguillón, J. et al., Progressive classification scheme for document layout recognition , Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999

Document layout recognition (II) Classical approach: a single decision tree with a block size of 8 x 8 pixels 0.078 38 8.56 721 211200 / 211200 8 x 8 Error d max R |T| Num. Blocks Size

Classical approach: a single decision tree with a block size of 8 x 8 pixels

Document layout recognition (III) Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8) 0.042 6 3.72 11 21052 / 53760 16 x 16 0.047 6 4.17 14 7856 / 13440 32 x 32 0.089 4 2.77 6 3360 / 3360 64 x 64 0.065 8 4.73 18 27892 / 215040 8 x 8 Error d max R |T| Num. Blocks Size

Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8)

Hyperspectral imaging (I) Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400) Ground truth data presents some artifacts due to low resolution: around 10% mislabelled 19 classes including asphalt, water, rocks, soil and several vegetation types Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set Minguillón, J. et al., Adaptive lossy compression and classification of hyperspectral images , Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000

Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400)

Ground truth data presents some artifacts due to low resolution: around 10% mislabelled

19 classes including asphalt, water, rocks, soil and several vegetation types

Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set

Minguillón, J. et al., Adaptive lossy compression and classification of hyperspectral images , Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000

Hyperspectral imaging (II) Classical approach: Using the new labeling rule: 0.163 1.0 9.83 836 T 1 Error P T R |T| Tree 0.092 0.722 9.60 650 T 2 Error P T R |T| Tree

Classical approach:

Using the new labeling rule:

Hyperspectral imaging (III) Progressive approach: 0.199 0.383 2.14 8 T 3B 0.056 0.523 3.02 9 T 3A 0.094 0.706 4.84 44 T 3 Error P T R |T| Tree

Progressive approach:

Brain tumour classification (I) Goal: to build a classification system for helping clinicians to identify brain tumour types Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created Different classifiers (LDA, k -NN, decision trees) are combined using a mixture of cascading and voting schemes Minguillón, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours , Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364

Goal: to build a classification system for helping clinicians to identify brain tumour types

Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created

Different classifiers (LDA, k -NN, decision trees) are combined using a mixture of cascading and voting schemes

Minguillón, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours , Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364

Brain tumour classification (II) Each classification stage is: k -NN LDA DT X V Y Decision trees use LDA class distances as additional information “ Unknown” means classifiers disagree

Each classification stage is:

Decision trees use LDA class distances as additional information

“ Unknown” means classifiers disagree

Brain tumour classification (III) Normal 100% Tumour 99.5% Benign 92.1% Malignant 94.9% Grade II 82.6% Grade IV 94.7% 98.9% Grade III 0% Astro 94.1% Oligo 100% 84.0% 89.9% 83.8% Secondary 91.4% Primary 81.8% 75.0% MN+SCH+HB ASTII+OD GLB+LYM+PNET+MET

UCI collection Goal: exhaustive testing of progressive decision trees 20 data sets were chosen: No categorical variables No missing values Large range of number of samples, data dimension and number of classes Available at http:// kdd.ics.uci.edu

Goal: exhaustive testing of progressive decision trees

20 data sets were chosen:

No categorical variables

No missing values

Large range of number of samples, data dimension and number of classes

Available at http:// kdd.ics.uci.edu

Experiments setup N-fold cross-validation with N=3 For each training set, 25 bootstrap replicates are generated (subsampling with replacement) Each experiment is repeated 5 times and performance results are averaged Bias-variance decomposition is computed for each repetition and then averaged

N-fold cross-validation with N=3

For each training set, 25 bootstrap replicates are generated (subsampling with replacement)

Each experiment is repeated 5 times and performance results are averaged

Bias-variance decomposition is computed for each repetition and then averaged

Bias-variance decomposition Several approaches, Domingos 2000 First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees Last classifiers should have small bias and moderate variance: large (but not too much) decision trees Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small

Several approaches, Domingos 2000

First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees

Last classifiers should have small bias and moderate variance: large (but not too much) decision trees

Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small

Empirical evaluation summary (I) Bias usually predominates over variance on most data sets  decision trees outperform the k -NN classifier Bias decreases fast when the decision tree has enough leaves Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics

Bias usually predominates over variance on most data sets  decision trees outperform the k -NN classifier

Bias decreases fast when the decision tree has enough leaves

Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics

Empirical evaluation summary (II) Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler) Type C experiments are still running...

Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough

Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler)

Type C experiments are still running...

Theoretical issues Decision trees are convex combinations of internal node decision functions: T j (x)=  {i=1..|T j |} p ij  ij h ij (x) Cascading is a convex combination of t decision trees: T(x)=  {j=1..t} q j T j (x) Type A: the first decision tree is the most important Type B: the last decision tree is the most important Type C: not aplicable

Decision trees are convex combinations of internal node decision functions:

T j (x)=  {i=1..|T j |} p ij  ij h ij (x)

Cascading is a convex combination of t decision trees:

T(x)=  {j=1..t} q j T j (x)

Type A: the first decision tree is the most important

Type B: the last decision tree is the most important

Type C: not aplicable

Error generalization bounds Convex combinations may be studied under the margin paradigm defined by Schapire et al. Generalization error depends on tree structure and internal node functions VC dimension Unbalanced trees are preferable Unbalanced classifiers are preferable Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees

Convex combinations may be studied under the margin paradigm defined by Schapire et al.

Generalization error depends on tree structure and internal node functions VC dimension

Unbalanced trees are preferable

Unbalanced classifiers are preferable

Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees

Conclusions Progressive decision trees generalise classical decision trees and the cascading paradigm Cascading is very useful for large data sets with a large number of classes  hierarchical structure Preliminary experiments with type C progressive decision trees look promising… Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and explo i tation costs at the same time Fine tuning is absolutely necessary!...

Progressive decision trees generalise classical decision trees and the cascading paradigm

Cascading is very useful for large data sets with a large number of classes  hierarchical structure

Preliminary experiments with type C progressive decision trees look promising…

Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and explo i tation costs at the same time

Fine tuning is absolutely necessary!...

Further research The R-norm splitting criterion may be used to build adaptive decision trees Better error generalisation bounds are needed A complete and specific theoretical framework for the cascading paradigm must be developed Parameters (  , d and t ) are currently empirical, more explanations are needed New applications (huge data sets): Web mining DNA interpretation

The R-norm splitting criterion may be used to build adaptive decision trees

Better error generalisation bounds are needed

A complete and specific theoretical framework for the cascading paradigm must be developed

Parameters (  , d and t ) are currently empirical, more explanations are needed

New applications (huge data sets):

Web mining

DNA interpretation

Selected references Breiman, L. et al., Classification and Regression Trees , Wadsworth International Group, 1984 Gama, J. et al., Cascade Generalization , Machine Learning 41(3):315-343, 2000 Domingos, P., A unified bias-variance decomposition and its applications , Proc. of the 17 th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000 Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods , Annals of Statistics 26(5):1651-1686, 1998

Breiman, L. et al., Classification and Regression Trees , Wadsworth International Group, 1984

Gama, J. et al., Cascade Generalization , Machine Learning 41(3):315-343, 2000

Domingos, P., A unified bias-variance decomposition and its applications , Proc. of the 17 th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000

Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods , Annals of Statistics 26(5):1651-1686, 1998

Add a comment

Related pages

On cascading small decision trees - CSUC

This thesis is about using small decision trees for classification and data mining. The intuitive idea behind this thesis is that a sequence of small ...
Read more

Cascading classifiers - Wikipedia, the free encyclopedia

On Cascading Small Decision Trees (PhD thesis). Universitat Autònoma de Barcelona. ...
Read more

On cascading small decision trees - Dialnet

Información de la tesis doctoral On cascading small decision trees.
Read more

Constrained Cascade Generalization of Decision Trees

... a shortcoming of many decision tree inducers is that they do not learn intermediate concepts, ... Cascading methods proposed in the past, ...
Read more

Decision Trees for Decision Making - Ecldp

Decision Trees for Decision Making by John F. Magee ... In the bottom half we see the small plant figures, including Decision #2 position
Read more

Cascading classifiers : Wikis (The Full Wiki)

J. Minguillón. On Cascading Small Decision Trees. PhD dissertation. Universitat Autònoma de Barcelona, 2002.
Read more

Constrained cascade generalization of decision trees

Constrained cascade generalization of decision trees on ResearchGate, the professional network for scientists. ... Cascading methods proposed in the past, ...
Read more

Ensembles of cascading trees - ResearchGate - Share and ...

Ensembles of cascading trees on ... Decision trees are often used for decision support since ... be presented with a small set of similar ...
Read more

Cascading with VDM and Binary Decision Trees for Nominal ...

Cascading with VDM and Binary Decision Trees for ... The results suggest that Cascading with Binary Decision Trees at base level and SVM with VDM at ...
Read more