- What are the characteristics of anomaly detection?
- What are the detection problems and methods?
- What are the statistical approaches when there is an anomaly found?
- Compare and contrast proximity and clustering based approaches.

INTRODUCTION TO DATA MINING

Don't use plagiarized sources. Get Your Custom Essay on

What are the characteristics of anomaly detection?
What are the detection problems and methods?
W

Just from $10/Page

INTRODUCTION TO DATA MINING

SECOND EDITION

PANG-NING TAN

Michigan State Universit

MICHAEL STEINBACH

University of Minnesota

ANUJ KARPATNE

University of Minnesota

VIPIN KUMAR

University of Minnesota

330 Hudson Street, NY NY 10013

Director, Portfolio Management: Engineering, Computer Science & Global

Editions: Julian Partridge

Specialist, Higher Ed Portfolio Management: Matt Goldstein

Portfolio Management Assistant: Meghan Jacoby

Managing Content Producer: Scott Disanno

Content Producer: Carole Snyder

Web Developer: Steve Wright

Rights and Permissions Manager: Ben Ferrini

Manufacturing Buyer, Higher Ed, Lake Side Communications Inc (LSC):

Maura Zaldivar-Garcia

Inventory Manager: Ann Lam

Product Marketing Manager: Yvonne Vannatta

Field Marketing Manager: Demetrius Hall

Marketing Assistant: Jon Bryant

Cover Designer: Joyce Wells, jWellsDesign

Full-Service Project Management: Chandrasekar Subramanian, SPi Global

Copyright ©2019 Pearson Education, Inc. All rights reserved. Manufactured in

the United States of America. This publication is protected by Copyright, and

permission should be obtained from the publisher prior to any prohibited

reproduction, storage in a retrieval system, or transmission in any form or by

any means, electronic, mechanical, photocopying, recording, or likewise. For

information regarding permissions, request forms and the appropriate

contacts within the Pearson Education Global Rights & Permissions

department, please visit www.pearsonhighed.com/permissions/.

Many of the designations by manufacturers and sellers to distinguish their

products are claimed as trademarks. Where those designations appear in this

book, and the publisher was aware of a trademark claim, the designations

have been printed in initial caps or all caps.

Library of Congress Cataloging-in-Publication Data on File

Names: Tan, Pang-Ning, author. | Steinbach, Michael, author. | Karpatne,

Anuj, author. | Kumar, Vipin, 1956- author.

Title: Introduction to Data Mining / Pang-Ning Tan, Michigan State University,

Michael Steinbach, University of Minnesota, Anuj Karpatne, University of

Minnesota, Vipin Kumar, University of Minnesota.

Description: Second edition. | New York, NY : Pearson Education, [2019] |

Includes bibliographical references and index.

Identifiers: LCCN 2017048641 | ISBN 9780133128901 | ISBN 0133128903

Subjects: LCSH: Data mining.

Classification: LCC QA76.9.D343 T35 2019 | DDC 006.3/12–dc23 LC record

available at https://lccn.loc.gov/2017048641

1 18

ISBN-10: 0133128903

ISBN-13: 9780133128901

To our families …

Preface to the Second Edition

Since the first edition, roughly 12 years ago, much has changed in the field of

data analysis. The volume and variety of data being collected continues to

increase, as has the rate (velocity) at which it is being collected and used to

make decisions. Indeed, the term, Big Data, has been used to refer to the

massive and diverse data sets now available. In addition, the term data

science has been coined to describe an emerging area that applies tools and

techniques from various fields, such as data mining, machine learning,

statistics, and many others, to extract actionable insights from data, often big

data.

The growth in data has created numerous opportunities for all areas of data

analysis. The most dramatic developments have been in the area of predictive

modeling, across a wide range of application domains. For instance, recent

advances in neural networks, known as deep learning, have shown

impressive results in a number of challenging areas, such as image

classification, speech recognition, as well as text categorization and

understanding. While not as dramatic, other areas, e.g., clustering,

association analysis, and anomaly detection have also continued to advance.

This new edition is in response to those advances.

Overview

As with the first edition, the second edition of the book provides a

comprehensive introduction to data mining and is designed to be accessible

and useful to students, instructors, researchers, and professionals. Areas

covered include data preprocessing, predictive modeling, association

analysis, cluster analysis, anomaly detection, and avoiding false discoveries.

The goal is to present fundamental concepts and algorithms for each topic,

thus providing the reader with the necessary background for the application of

data mining to real problems. As before, classification, association analysis

and cluster analysis, are each covered in a pair of chapters. The introductory

chapter covers basic concepts, representative algorithms, and evaluation

techniques, while the more following chapter discusses advanced concepts

and algorithms. As before, our objective is to provide the reader with a sound

understanding of the foundations of data mining, while still covering many

important advanced topics. Because of this approach, the book is useful both

as a learning tool and as a reference.

To help readers better understand the concepts that have been presented, we

provide an extensive set of examples, figures, and exercises. The solutions to

the original exercises, which are already circulating on the web, will be made

public. The exercises are mostly unchanged from the last edition, with the

exception of new exercises in the chapter on avoiding false discoveries. New

exercises for the other chapters and their solutions will be available to

instructors via the web. Bibliographic notes are included at the end of each

chapter for readers who are interested in more advanced topics, historically

important papers, and recent trends. These have also been significantly

updated. The book also contains a comprehensive subject and author index.

What is New in the Second Edition?

Some of the most significant improvements in the text have been in the two

chapters on classification. The introductory chapter uses the decision tree

classifier for illustration, but the discussion on many topics—those that apply

across all classification approaches—has been greatly expanded and

clarified, including topics such as overfitting, underfitting, the impact of training

size, model complexity, model selection, and common pitfalls in model

evaluation. Almost every section of the advanced classification chapter has

been significantly updated. The material on Bayesian networks, support vector

machines, and artificial neural networks has been significantly expanded. We

have added a separate section on deep networks to address the current

developments in this area. The discussion of evaluation, which occurs in the

section on imbalanced classes, has also been updated and improved.

The changes in association analysis are more localized. We have completely

reworked the section on the evaluation of association patterns (introductory

chapter), as well as the sections on sequence and graph mining (advanced

chapter). Changes to cluster analysis are also localized. The introductory

chapter added the K-means initialization technique and an updated the

discussion of cluster evaluation. The advanced clustering chapter adds a new

section on spectral graph clustering. Anomaly detection has been greatly

revised and expanded. Existing approaches—statistical, nearest

neighbor/density-based, and clustering based—have been retained and

updated, while new approaches have been added: reconstruction-based, one-

class classification, and information-theoretic. The reconstruction-based

approach is illustrated using autoencoder networks that are part of the deep

learning paradigm. The data chapter has been updated to include discussions

of mutual information and kernel-based techniques.

The last chapter, which discusses how to avoid false discoveries and produce

valid results, is completely new, and is novel among other contemporary

textbooks on data mining. It supplements the discussions in the other

chapters with a discussion of the statistical concepts (statistical significance,

p-values, false discovery rate, permutation testing, etc.) relevant to avoiding

spurious results, and then illustrates these concepts in the context of data

mining techniques. This chapter addresses the increasing concern over the

validity and reproducibility of results obtained from data analysis. The addition

of this last chapter is a recognition of the importance of this topic and an

acknowledgment that a deeper understanding of this area is needed for those

analyzing data.

The data exploration chapter has been deleted, as have the appendices, from

the print edition of the book, but will remain available on the web. A new

appendix provides a brief discussion of scalability in the context of big data.

To the Instructor

As a textbook, this book is suitable for a wide range of students at the

advanced undergraduate or graduate level. Since students come to this

subject with diverse backgrounds that may not include extensive knowledge of

statistics or databases, our book requires minimal prerequisites. No database

knowledge is needed, and we assume only a modest background in statistics

or mathematics, although such a background will make for easier going in

some sections. As before, the book, and more specifically, the chapters

covering major data mining topics, are designed to be as self-contained as

possible. Thus, the order in which topics can be covered is quite flexible. The

core material is covered in chapters 2 (data), 3 (classification), 5 (association

analysis), 7 (clustering), and 9 (anomaly detection). We recommend at least a

cursory coverage of Chapter 10 (Avoiding False Discoveries) to instill in

students some caution when interpreting the results of their data analysis.

Although the introductory data chapter (2) should be covered first, the basic

classification (3), association analysis (5), and clustering chapters (7), can be

covered in any order. Because of the relationship of anomaly detection (9) to

classification (3) and clustering (7), these chapters should precede Chapter 9.

Various topics can be selected from the advanced classification, association

analysis, and clustering chapters (4, 6, and 8, respectively) to fit the schedule

and interests of the instructor and students. We also advise that the lectures

be augmented by projects or practical exercises in data mining. Although they

are time consuming, such hands-on assignments greatly enhance the value of

the course.

Support Materials

Support materials available to all readers of this book are available at

http://www-users.cs.umn.edu/~kumar/dmbook.

PowerPoint lecture slides

Suggestions for student projects

Data mining resources, such as algorithms and data sets

Online tutorials that give step-by-step examples for selected data mining

techniques described in the book using actual data sets and data analysis

software

Additional support materials, including solutions to exercises, are available

only to instructors adopting this textbook for classroom use. The book’s

resources will be mirrored at www.pearsonhighered.com/cs-resources.

Comments and suggestions, as well as reports of errors, can be sent to the

authors through [email protected]

Acknowledgments

Many people contributed to the first and second editions of the book. We

begin by acknowledging our families to whom this book is dedicated. Without

their patience and support, this project would have been impossible.

We would like to thank the current and former students of our data mining

groups at the University of Minnesota and Michigan State for their

contributions. Eui-Hong (Sam) Han and Mahesh Joshi helped with the initial

data mining classes. Some of the exercises and presentation slides that they

created can be found in the book and its accompanying slides. Students in our

data mining groups who provided comments on drafts of the book or who

contributed in other ways include Shyam Boriah, Haibin Cheng, Varun

Chandola, Eric Eilertson, Levent Ertöz, Jing Gao, Rohit Gupta, Sridhar Iyer,

Jung-Eun Lee, Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey,

Kashif Riaz, Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and

Pusheng Zhang. We would also like to thank the students of our data mining

classes at the University of Minnesota and Michigan State University who

worked with early drafts of the book and provided invaluable feedback. We

specifically note the helpful suggestions of Bernardo Craemer, Arifin Ruslim,

Jamshid Vayghan, and Yu Wei.

Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of Florida)

class tested early versions of the book. We also received many useful

suggestions directly from the following UT students: Pankaj Adhikari, Rajiv

Bhatia, Frederic Bosche, Arindam Chakraborty, Meghana Deodhar, Chris

Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones, Ajay Joshi,

Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung Park, Aashish

Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and Mei Yang.

Ronald Kostoff (ONR) read an early version of the clustering chapter and

offered numerous suggestions. George Karypis provided invaluable LATEX

assistance in creating an author index. Irene Moulitsas also provided

assistance with LATEX and reviewed some of the appendices. Musetta

Steinbach was very helpful in finding errors in the figures.

We would like to acknowledge our colleagues at the University of Minnesota

and Michigan State who have helped create a positive environment for data

mining research. They include Arindam Banerjee, Dan Boley, Joyce Chai, Anil

Jain, Ravi Janardan, Rong Jin, George Karypis, Claudia Neuhauser, Haesun

Park, William F. Punch, György Simon, Shashi Shekhar, and Jaideep

Srivastava. The collaborators on our many data mining projects, who also

have our gratitude, include Ramesh Agrawal, Maneesh Bhargava, Steve

Cannon, Alok Choudhary, Imme Ebert-Uphoff, Auroop Ganguly, Piet C. de

Groen, Fran Hill, Yongdae Kim, Steve Klooster, Kerry Long, Nihar Mahapatra,

Rama Nemani, Nikunj Oza, Chris Potter, Lisiane Pruinelli, Nagiza Samatova,

Jonathan Shapiro, Kevin Silverstein, Brian Van Ness, Bonnie Westra, Nevin

Young, and Zhi-Li Zhang.

The departments of Computer Science and Engineering at the University of

Minnesota and Michigan State University provided computing resources and a

supportive environment for this project. ARDA, ARL, ARO, DOE, NASA,

NOAA, and NSF provided research support for Pang-Ning Tan, Michael Stein-

bach, Anuj Karpatne, and Vipin Kumar. In particular, Kamal Abdali, Mitra

Basu, Dick Brackney, Jagdish Chandra, Joe Coughlan, Michael Coyle,

Stephen Davis, Frederica Darema, Richard Hirsch, Chandrika Kamath,

Tsengdar Lee, Raju Namburu, N. Radhakrishnan, James Sidoran, Sylvia

Spengler, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova, Aidong

Zhang, and Xiaodong Zhang have been supportive of our research in data

mining and high-performance computing.

It was a pleasure working with the helpful staff at Pearson Education. In

particular, we would like to thank Matt Goldstein, Kathy Smith, Carole Snyder,

and Joyce Wells. We would also like to thank George Nichols, who helped

with the art work and Paul Anagnostopoulos, who provided LATEX support.

We are grateful to the following Pearson reviewers: Leman Akoglu (Carnegie

Mellon University), Chien-Chung Chan (University of Akron), Zhengxin Chen

(University of Nebraska at Omaha), Chris Clifton (Purdue University), Joy-

deep Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute of

Technology), J. Michael Hardin (University of Alabama), Jingrui He (Arizona

State University), James Hearne (Western Washington University), Hillol

Kargupta (University of Maryland, Baltimore County and Agnik, LLC), Eamonn

Keogh (University of California-Riverside), Bing Liu (University of Illinois at

Chicago), Mariofanna Milanova (University of Arkansas at Little Rock),

Srinivasan Parthasarathy (Ohio State University), Zbigniew W. Ras (University

of North Carolina at Charlotte), Xintao Wu (University of North Carolina at

Charlotte), and Mohammed J. Zaki (Rensselaer Polytechnic Institute).

Over the years since the first edition, we have also received numerous

comments from readers and students who have pointed out typos and various

other issues. We are unable to mention these individuals by name, but their

input is much appreciated and has been taken into account for the second

edition.

Contents

Preface to the Second Edition v

1 Introduction 1

1.1 What Is Data Mining? 4

1.2 Motivating Challenges 5

1.3 The Origins of Data Mining 7

1.4 Data Mining Tasks 9

1.5 Scope and Organization of the Book 13

1.6 Bibliographic Notes 15

1.7 Exercises 21

2 Data 23

2.1 Types of Data 26

2.1.1 Attributes and Measurement 27

2.1.2 Types of Data Sets 34

2.2 Data Quality 42

2.2.1 Measurement and Data Collection Issues 42

2.2.2 Issues Related to Applications 49

2.3 Data Preprocessing 50

2.3.1 Aggregation 51

2.3.2 Sampling 52

2.3.3 Dimensionality Reduction 56

2.3.4 Feature Subset Selection 58

2.3.5 Feature Creation 61

2.3.6 Discretization and Binarization 63

2.3.7 Variable Transformation 69

2.4 Measures of Similarity and Dissimilarity 71

2.4.1 Basics 72

2.4.2 Similarity and Dissimilarity between Simple Attributes 74

2.4.3 Dissimilarities between Data Objects 76

2.4.4 Similarities between Data Objects 78

2.4.5 Examples of Proximity Measures 79

2.4.6 Mutual Information 88

2.4.7 Kernel Functions* 90

2.4.8 Bregman Divergence* 94

2.4.9 Issues in Proximity Calculation 96

2.4.10 Selecting the Right Proximity Measure 98

2.5 Bibliographic Notes 100

2.6 Exercises 105

3 Classification: Basic Concepts and Techniques 113

3.1 Basic Concepts 114

3.2 General Framework for Classification 117

3.3 Decision Tree Classifier 119

3.3.1 A Basic Algorithm to Build a Decision Tree 121

3.3.2 Methods for Expressing Attribute Test Conditions 124

3.3.3 Measures for Selecting an Attribute Test Condition 127

3.3.4 Algorithm for Decision Tree Induction 136

3.3.5 Example Application: Web Robot Detection 138

3.3.6 Characteristics of Decision Tree Classifiers 140

3.4 Model Overfitting 147

3.4.1 Reasons for Model Overfitting 149

3.5 Model Selection 156

3.5.1 Using a Validation Set 156

3.5.2 Incorporating Model Complexity 157

3.5.3 Estimating Statistical Bounds 162

3.5.4 Model Selection for Decision Trees 162

3.6 Model Evaluation 164

3.6.1 Holdout Method 165

3.6.2 Cross-Validation 165

3.7 Presence of Hyper-parameters 168

3.7.1 Hyper-parameter Selection 168

3.7.2 Nested Cross-Validation 170

3.8 Pitfalls of Model Selection and Evaluation 172

3.8.1 Overlap between Training and Test Sets 172

3.8.2 Use of Validation Error as Generalization Error 172

3.9 Model Comparison 173

3.9.1 Estimating the Confidence Interval for Accuracy 174

3.9.2 Comparing the Performance of Two Models 175

3.10 Bibliographic Notes 176

3.11 Exercises 185

4 Classification: Alternative Techniques 193

4.1 Types of Classifiers 193

4.2 Rule-Based Classifier 195

4.2.1 How a Rule-Based Classifier Works 197

4.2.2 Properties of a Rule Set 198

4.2.3 Direct Methods for Rule Extraction 199

4.2.4 Indirect Methods for Rule Extraction 204

4.2.5 Characteristics of Rule-Based Classifiers 206

4.3 Nearest Neighbor Classifiers 208

4.3.1 Algorithm 209

4.3.2 Characteristics of Nearest Neighbor Classifiers 210

*

4.4 Naïve Bayes Classifier 212

4.4.1 Basics of Probability Theory 213

4.4.2 Naïve Bayes Assumption 218

4.5 Bayesian Networks 227

4.5.1 Graphical Representation 227

4.5.2 Inference and Learning 233

4.5.3 Characteristics of Bayesian Networks 242

4.6 Logistic Regression 243

4.6.1 Logistic Regression as a Generalized Linear Model 244

4.6.2 Learning Model Parameters 245

4.6.3 Characteristics of Logistic Regression 248

4.7 Artificial Neural Network (ANN) 249

4.7.1 Perceptron 250

4.7.2 Multi-layer Neural Network 254

4.7.3 Characteristics of ANN 261

4.8 Deep Learning 262

4.8.1 Using Synergistic Loss Functions 263

4.8.2 Using Responsive Activation Functions 266

4.8.3 Regularization 268

4.8.4 Initialization of Model Parameters 271

4.8.5 Characteristics of Deep Learning 275

4.9 Support Vector Machine (SVM) 276

4.9.1 Margin of a Separating Hyperplane 276

4.9.2 Linear SVM 278

4.9.3 Soft-margin SVM 284

4.9.4 Nonlinear SVM 290

4.9.5 Characteristics of SVM 294

4.10 Ensemble Methods 296

4.10.1 Rationale for Ensemble Method 297

4.10.2 Methods for Constructing an Ensemble Classifier 297

4.10.3 Bias-Variance Decomposition 300

4.10.4 Bagging 302

4.10.5 Boosting 305

4.10.6 Random Forests 310

4.10.7 Empirical Comparison among Ensemble Methods 312

4.11 Class Imbalance Problem 313

4.11.1 Building Classifiers with Class Imbalance 314

4.11.2 Evaluating Performance with Class Imbalance 318

4.11.3 Finding an Optimal Score Threshold 322

4.11.4 Aggregate Evaluation of Performance 323

4.12 Multiclass Problem 330

4.13 Bibliographic Notes 333

4.14 Exercises 345

5 Association Analysis: Basic Concepts and Algorithms 357

5.1 Preliminaries 358

5.2 Frequent Itemset Generation 362

5.2.1 The Apriori Principle 363

5.2.2 Frequent Itemset Generation in the Apriori Algorithm 364

5.2.3 Candidate Generation and Pruning 368

5.2.4 Support Counting 373

5.2.5 Computational Complexity 377

5.3 Rule Generation 380

5.3.1 Confidence-Based Pruning 380

5.3.2 Rule Generation in Apriori Algorithm 381

5.3.3 An Example: Congressional Voting Records 382

5.4 Compact Representation of Frequent Itemsets 384

5.4.1 Maximal Frequent Itemsets 384

5.4.2 Closed Itemsets 386

5.5 Alternative Methods for Generating Frequent Itemsets* 389

5.6 FP-Growth Algorithm* 393

5.6.1 FP-Tree Representation 394

5.6.2 Frequent Itemset Generation in FP-Growth Algorithm 397

5.7 Evaluation of Association Patterns 401

5.7.1 Objective Measures of Interestingness 402

5.7.2 Measures beyond Pairs of Binary Variables 414

5.7.3 Simpson’s Paradox 416

5.8 Effect of Skewed Support Distribution 418

5.9 Bibliographic Notes 424

5.10 Exercises 438

6 Association Analysis: Advanced Concepts 451

6.1 Handling Categorical Attributes 451

6.2 Handling Continuous Attributes 454

6.2.1 Discretization-Based Methods 454

6.2.2 Statistics-Based Methods 458

6.2.3 Non-discretization Methods 460

6.3 Handling a Concept Hierarchy 462

6.4 Sequential Patterns 464

6.4.1 Preliminaries 465

6.4.2 Sequential Pattern Discovery 468

6.4.3 Timing Constraints 473

6.4.4 Alternative Counting Schemes 477

6.5 Subgraph Patterns 479

6.5.1 Preliminaries 480

∗

∗

6.5.2 Frequent Subgraph Mining 483

6.5.3 Candidate Generation 487

6.5.4 Candidate Pruning 493

6.5.5 Support Counting 493

6.6 Infrequent Patterns 493

6.6.1 Negative Patterns 494

6.6.2 Negatively Correlated Patterns 495

6.6.3 Comparisons among Infrequent Patterns, Negative Patterns,

and Negatively Correlated Patterns 496

6.6.4 Techniques for Mining Interesting Infrequent Patterns 498

6.6.5 Techniques Based on Mining Negative Patterns 499

6.6.6 Techniques Based on Support Expectation 501

6.7 Bibliographic Notes 505

6.8 Exercises 510

7 Cluster Analysis: Basic Concepts and Algorithms 525

7.1 Overview 528

7.1.1 What Is Cluster Analysis? 528

7.1.2 Different Types of Clusterings 529

7.1.3 Different Types of Clusters 531

7.2 K-means 534

7.2.1 The Basic K-means Algorithm 535

∗

7.2.2 K-means: Additional Issues 544

7.2.3 Bisecting K-means 547

7.2.4 K-means and Different Types of Clusters 548

7.2.5 Strengths and Weaknesses 549

7.2.6 K-means as an Optimization Problem 549

7.3 Agglomerative Hierarchical Clustering 554

7.3.1 Basic Agglomerative Hierarchical Clustering Algorithm

555

7.3.2 Specific Techniques 557

7.3.3 The Lance-Williams Formula for Cluster Proximity 562

7.3.4 Key Issues in Hierarchical Clustering 563

7.3.5 Outliers 564

7.3.6 Strengths and Weaknesses 565

7.4 DBSCAN 565

7.4.1 Traditional Density: Center-Based Approach 565

7.4.2 The DBSCAN Algorithm 567

7.4.3 Strengths and Weaknesses 569

7.5 Cluster Evaluation 571

7.5.1 Overview 571

7.5.2 Unsupervised Cluster Evaluation Using Cohesion and

Separation 574

7.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix

582

7.5.4 Unsupervised Evaluation of Hierarchical Clustering 585

7.5.5 Determining the Correct Number of Clusters 587

7.5.6 Clustering Tendency 588

7.5.7 Supervised Measures of Cluster Validity 589

7.5.8 Assessing the Significance of Cluster Validity Measures

594

7.5.9 Choosing a Cluster Validity Measure 596

7.6 Bibliographic Notes 597

7.7 Exercises 603

8 Cluster Analysis: Additional Issues and Algorithms 613

8.1 Characteristics of Data, Clusters, and Clustering Algorithms

614

8.1.1 Example: Comparing K-means and DBSCAN 614

8.1.2 Data Characteristics 615

8.1.3 Cluster Characteristics 617

8.1.4 General Characteristics of Clustering Algorithms 619

8.2 Prototype-Based Clustering 621

8.2.1 Fuzzy Clustering 621

8.2.2 Clustering Using Mixture Models 627

8.2.3 Self-Organizing Maps (SOM) 637

8.3 Density-Based Clustering 644

8.3.1 Grid-Based Clustering 644

8.3.2 Subspace Clustering 648

8.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based

Clustering 652

8.4 Graph-Based Clustering 656

8.4.1 Sparsification 657

8.4.2 Minimum Spanning Tree (MST) Clustering 658

8.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using

METIS 659

8.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling

660

8.4.5 Spectral Clustering 666

8.4.6 Shared Nearest Neighbor Similarity 673

8.4.7 The Jarvis-Patrick Clustering Algorithm 676

8.4.8 SNN Density 678

8.4.9 SNN Density-Based Clustering 679

8.5 Scalable Clustering Algorithms 681

8.5.1 Scalability: General Issues and Approaches 681

8.5.2 BIRCH 684

8.5.3 CURE 686

8.6 Which Clustering Algorithm? 690

8.7 Bibliographic Notes 693

8.8 Exercises 699

9 Anomaly Detection 703

9.1 Characteristics of Anomaly Detection Problems 705

9.1.1 A Definition of an Anomaly 705

9.1.2 Nature of Data 706

9.1.3 How Anomaly Detection is Used 707

9.2 Characteristics of Anomaly Detection Methods 708

9.3 Statistical Approaches 710

9.3.1 Using Parametric Models 710

9.3.2 Using Non-parametric Models 714

9.3.3 Modeling Normal and Anomalous Classes 715

9.3.4 Assessing Statistical Significance 717

9.3.5 Strengths and Weaknesses 718

9.4 Proximity-based Approaches 719

9.4.1 Distance-based Anomaly Score 719

9.4.2 Density-based Anomaly Score 720

9.4.3 Relative Density-based Anomaly Score 722

9.4.4 Strengths and Weaknesses 723

9.5 Clustering-based Approaches 724

9.5.1 Finding Anomalous Clusters 724

9.5.2 Finding Anomalous Instances 725

9.5.3 Strengths and Weaknesses 728

9.6 Reconstruction-based Approaches 728

9.6.1 Strengths and Weaknesses 731

9.7 One-class Classification 732

9.7.1 Use of Kernels 733

9.7.2 The Origin Trick 734

9.7.3 Strengths and Weaknesses 738

9.8 Information Theoretic Approaches 738

9.8.1 Strengths and Weaknesses 740

9.9 Evaluation of Anomaly Detection 740

9.10 Bibliographic Notes 742

9.11 Exercises 749

10 Avoiding False Discoveries 755

10.1 Preliminaries: Statistical Testing 756

10.1.1 Significance Testing 756

10.1.2 Hypothesis Testing 761

10.1.3 Multiple Hypothesis Testing 767

10.1.4 Pitfalls in Statistical Testing 776

10.2 Modeling Null and Alternative Distributions 778

10.2.1 Generating Synthetic Data Sets 781

10.2.2 Randomizing Class Labels 782

10.2.3 Resampling Instances 782

10.2.4 Modeling the Distribution of the Test Statistic 783

10.3 Statistical Testing for Classification 783

10.3.1 Evaluating Classification Performance 783

10.3.2 Binary Classification as Multiple Hypothesis Testing 785

10.3.3 Multiple Hypothesis Testing in Model Selection 786

10.4 Statistical Testing for Association Analysis 787

10.4.1 Using Statistical Models 788

10.4.2 Using Randomization Methods 794

10.5 Statistical Testing for Cluster Analysis 795

10.5.1 Generating a Null Distribution for Internal Indices 796

10.5.2 Generating a Null Distribution for External Indices 798

10.5.3 Enrichment 798

10.6 Statistical Testing for Anomaly Detection 800

10.7 Bibliographic Notes 803

10.8 Exercises 808

Author Index 816

Subject Index 829

Copyright Permissions 839

1 Introduction

Rapid advances in data collection and storage

technology, coupled with the ease with which data can

be generated and disseminated, have triggered the

explosive growth of data, leading to the current age of

big data. Deriving actionable insights from these large

data sets is increasingly important in decision making

across almost all areas of society, including business

and industry; science and engineering; medicine and

biotechnology; and government and individuals.

However, the amount of data (volume), its complexity

(variety), and the rate at which it is being collected and

processed (velocity) have simply become too great for

humans to analyze unaided. Thus, there is a great

need for automated tools for extracting useful

information from the big data despite the challenges

posed by its enormity and diversity.

Data mining blends traditional data analysis methods

with sophisticated algorithms for processing this

abundance of data. In this introductory chapter, we

present an overview of data mining and outline the key

topics to be covered in this book. We start with a

description of some applications that require more

advanced techniques for data analysis.

Business and Industry Point-of-sale data collection (bar code scanners,

radio frequency identification (RFID), and smart card technology) have

allowed retailers to collect up-to-the-minute data about customer purchases at

the checkout counters of their stores. Retailers can utilize this information,

along with other business-critical data, such as web server logs from e-

commerce websites and customer service records from call centers, to help

them better understand the needs of their customers and make more informed

business decisions.

Data mining techniques can be used to support a wide range of business

intelligence applications, such as customer profiling, targeted marketing,

workflow management, store layout, fraud detection, and automated buying

and selling. An example of the last application is high-speed stock trading,

where decisions on buying and selling have to be made in less than a second

using data about financial transactions. Data mining can also help retailers

answer important business questions, such as “Who are the most profitable

customers?” “What products can be cross-sold or up-sold?” and “What is the

revenue outlook of the company for next year?” These questions have

inspired the development of such data mining techniques as association

analysis (Chapters 5 and 6 ).

As the Internet continues to revolutionize the way we interact and make

decisions in our everyday lives, we are generating massive amounts of data

about our online experiences, e.g., web browsing, messaging, and posting on

social networking websites. This has opened several opportunities for

business applications that use web data. For example, in the e-commerce

sector, data about our online viewing or shopping preferences can be used to

provide personalized recommendations of products. Data mining also plays a

prominent role in supporting several other Internet-based services, such as

filtering spam messages, answering search queries, and suggesting social

updates and connections. The large corpus of text, images, and videos

available on the Internet has enabled a number of advancements in data

mining methods, including deep learning, which is discussed in Chapter 4 .

These developments have led to great advances in a number of applications,

such as object recognition, natural language translation, and autonomous

driving.

Another domain that has undergone a rapid big data transformation is the use

of mobile sensors and devices, such as smart phones and wearable

computing devices. With better sensor technologies, it has become possible

to collect a variety of information about our physical world using low-cost

sensors embedded on everyday objects that are connected to each other,

termed the Internet of Things (IOT). This deep integration of physical sensors

in digital systems is beginning to generate large amounts of diverse and

distributed data about our environment, which can be used for designing

convenient, safe, and energy-efficient home systems, as well as for urban

planning of smart cities.

Medicine, Science, and Engineering Researchers in medicine, science, and

engineering are rapidly accumulating data that is key to significant new

discoveries. For example, as an important step toward improving our

understanding of the Earth’s climate system, NASA has deployed a series of

Earth-orbiting satellites that continuously generate global observations of the

land surface, oceans, and atmosphere. However, because of the size and

spatio-temporal nature of the data, traditional methods are often not suitable

for analyzing these data sets. Techniques developed in data mining can aid

Earth scientists in answering questions such as the following: “What is the

relationship between the frequency and intensity of ecosystem disturbances

such as droughts and hurricanes to global warming?” “How is land surface

precipitation and temperature affected by ocean surface temperature?” and

“How well can we predict the beginning and end of the growing season for a

region?”

As another example, researchers in molecular biology hope to use the large

amounts of genomic data to better understand the structure and function of

genes. In the past, traditional methods in molecular biology allowed scientists

to study only a few genes at a time in a given experiment. Recent

breakthroughs in microarray technology have enabled scientists to compare

the behavior of thousands of genes under various situations. Such

comparisons can help determine the function of each gene, and perhaps

isolate the genes responsible for certain diseases. However, the noisy, high-

dimensional nature of data requires new data analysis methods. In addition to

analyzing gene expression data, data mining can also be used to address

other important biological challenges such as protein structure prediction,

multiple sequence alignment, the modeling of biochemical pathways, and

phylogenetics.

Another example is the use of data mining techniques to analyze electronic

health record (EHR) data, which has become increasingly available. Not very

long ago, studies of patients required manually examining the physical

records of individual patients and extracting very specific pieces of information

pertinent to the particular question being investigated. EHRs allow for a faster

and broader exploration of such data. However, there are significant

challenges since the observations on any one patient typically occur during

their visits to a doctor or hospital and only a small number of details about the

health of the patient are measured during any particular visit.

Currently, EHR analysis focuses on simple types of data, e.g., a patient’s

blood pressure or the diagnosis code of a disease. However, large amounts of

more complex types of medical data are also being collected, such as

electrocardiograms (ECGs) and neuroimages from magnetic resonance

imaging (MRI) or functional Magnetic Resonance Imaging (fMRI). Although

challenging to analyze, this data also provides vital information about patients.

Integrating and analyzing such data, with traditional EHR and genomic data is

one of the capabilities needed to enable precision medicine, which aims to

provide more personalized patient care.

1.1 What Is Data Mining?

Data mining is the process of automatically discovering useful information in

large data repositories. Data mining techniques are deployed to scour large

data sets in order to find novel and useful patterns that might otherwise

remain unknown. They also provide the capability to predict the outcome of a

future observation, such as the amount a customer will spend at an online or a

brick-and-mortar store.

Not all information discovery tasks are considered to be data mining.

Examples include queries, e.g., looking up individual records in a database or

finding web pages that contain a particular set of keywords. This is because

such tasks can be accomplished through simple interactions with a database

management system or an information retrieval system. These systems rely

on traditional computer science techniques, which include sophisticated

indexing structures and query processing algorithms, for efficiently organizing

and retrieving information from large data repositories. Nonetheless, data

mining techniques have been used to enhance the performance of such

systems by improving the quality of the search results based on their

relevance to the input queries.

Data Mining and Knowledge Discovery in

Databases

Data mining is an integral part of knowledge discovery in databases (KDD),

which is the overall process of converting raw data into useful information, as

shown in Figure 1.1 . This process consists of a series of steps, from data

preprocessing to postprocessing of data mining results.

Figure 1.1.

The process of knowledge discovery in databases (KDD).

The input data can be stored in a variety of formats (flat files, spreadsheets, or

relational tables) and may reside in a centralized data repository or be

distributed across multiple sites. The purpose of preprocessing is to

transform the raw input data into an appropriate format for subsequent

analysis. The steps involved in data preprocessing include fusing data from

multiple sources, cleaning data to remove noise and duplicate observations,

and selecting records and features that are relevant to the data mining task at

hand. Because of the many ways data can be collected and stored, data

preprocessing is perhaps the most laborious and time-consuming step in the

overall knowledge discovery process.

“Closing the loop” is a phrase often used to refer to the process of integrating

data mining results into decision support systems. For example, in business

applications, the insights offered by data mining results can be integrated with

campaign management tools so that effective marketing promotions can be

conducted and tested. Such integration requires a postprocessing step to

ensure that only valid and useful results are incorporated into the decision

support system. An example of postprocessing is visualization, which allows

analysts to explore the data and the data mining results from a variety of

viewpoints. Hypothesis testing methods can also be applied during

postprocessing to eliminate spurious data mining results. (See Chapter

10 .)

1.2 Motivating Challenges

As mentioned earlier, traditional data analysis techniques have often

encountered practical difficulties in meeting the challenges posed by big data

applications. The following are some of the specific challenges that motivated

the development of data mining.

Scalability

Because of advances in data generation and collection, data sets with sizes of

terabytes, petabytes, or even exabytes are becoming common. If data mining

algorithms are to handle these massive data sets, they must be scalable.

Many data mining algorithms employ special search strategies to handle

exponential search problems. Scalability may also require the implementation

of novel data structures to access individual records in an efficient manner.

For instance, out-of-core algorithms may be necessary when processing data

sets that cannot fit into main memory. Scalability can also be improved by

using sampling or developing parallel and distributed algorithms. A general

overview of techniques for scaling up data mining algorithms is given in

Appendix F.

High Dimensionality

It is now common to encounter data sets with hundreds or thousands of

attributes instead of the handful common a few decades ago. In

bioinformatics, progress in microarray technology has produced gene

expression data involving thousands of features. Data sets with temporal or

spatial components also tend to have high dimensionality. For example,

consider a data set that contains measurements of temperature at various

locations. If the temperature measurements are taken repeatedly for an

extended period, the number of dimensions (features) increases in proportion

to the number of measurements taken. Traditional data analysis techniques

that were developed for low-dimensional data often do not work well for such

high-dimensional data due to issues such as curse of dimensionality (to be

discussed in Chapter 2 ). Also, for some data analysis algorithms, the

computational complexity increases rapidly as the dimensionality (the number

of features) increases.

Heterogeneous and Complex Data

Traditional data analysis methods often deal with data sets containing

attributes of the same type, either continuous or categorical. As the role of

data mining in business, science, medicine, and other fields has grown, so

has the need for techniques that can handle heterogeneous attributes. Recent

years have also seen the emergence of more complex data objects.

Examples of such non-traditional types of data include web and social media

data containing text, hyperlinks, images, audio, and videos; DNA data with

sequential and three-dimensional structure; and climate data that consists of

measurements (temperature, pressure, etc.) at various times and locations on

the Earth’s surface. Techniques developed for mining such complex objects

should take into consideration relationships in the data, such as temporal and

spatial autocorrelation, graph connectivity, and parent-child relationships

between the elements in semi-structured text and XML documents.

Data Ownership and Distribution

Sometimes, the data needed for an analysis is not stored in one location or

owned by one organization. Instead, the data is geographically distributed

among resources belonging to multiple entities. This requires the development

of distributed data mining techniques. The key challenges faced by distributed

data mining algorithms include the following: (1) how to reduce the amount of

communication needed to perform the distributed computation, (2) how to

effectively consolidate the data mining results obtained from multiple sources,

and (3) how to address data security and privacy issues.

Non-traditional Analysis

The traditional statistical approach is based on a hypothesize-and-test

paradigm. In other words, a hypothesis is proposed, an experiment is

designed to gather the data, and then the data is analyzed with respect to the

hypothesis. Unfortunately, this process is extremely labor-intensive. Current

data analysis tasks often require the generation and evaluation of thousands

of hypotheses, and consequently, the development of some data mining

techniques has been motivated by the desire to automate the process of

hypothesis generation and evaluation. Furthermore, the data sets analyzed in

data mining are typically not the result of a carefully designed experiment and

often represent opportunistic samples of the data, rather than random

samples.

1.3 The Origins of Data Mining

While data mining has traditionally been viewed as an intermediate process

within the KDD framework, as shown in Figure 1.1 , it has emerged over the

years as an academic field within computer science, focusing on all aspects of

KDD, including data preprocessing, mining, and postprocessing. Its origin can

be traced back to the late 1980s, following a series of workshops organized

on the topic of knowledge discovery in databases. The workshops brought

together researchers from different disciplines to discuss the challenges and

opportunities in applying computational techniques to extract actionable

knowledge from large databases. The workshops quickly grew into hugely

popular conferences that were attended by researchers and practitioners from

both the academia and industry. The success of these conferences, along

with the interest shown by businesses and industry in recruiting new hires with

data mining background, have fueled the tremendous growth of this field.

The field was initially built upon the methodology and algorithms that

researchers had previously used. In particular, data mining researchers draw

upon ideas, such as (1) sampling, estimation, and hypothesis testing from

statistics and (2) search algorithms, modeling techniques, and learning

theories from artificial intelligence, pattern recognition, and machine learning.

Data mining has also been quick to adopt ideas from other areas, including

optimization, evolutionary computing, information theory, signal processing,

visualization, and information retrieval, and extending them to solve the

challenges of mining big data.

A number of other areas also play key supporting roles. In particular, database

systems are needed to provide support for efficient storage, indexing, and

query processing. Techniques from high performance (parallel) computing are

often important in addressing the massive size of some data sets. Distributed

techniques can also help address the issue of size and are essential when the

data cannot be gathered in one location. Figure 1.2 shows the relationship

of data mining to other areas.

Figure 1.2.

Data mining as a confluence of many disciplines.

Data Science and Data-Driven Discovery

Data science is an interdisciplinary field that studies and applies tools and

techniques for deriving useful insights from data. Although data science is

regarded as an emerging field with a distinct identity of its own, the tools and

techniques often come from many different areas of data analysis, such as

data mining, statistics, AI, machine learning, pattern recognition, database

technology, and distributed and parallel computing. (See Figure 1.2 .)

The emergence of data science as a new field is a recognition that, often,

none of the existing areas of data analysis provides a complete set of tools for

the data analysis tasks that are often encountered in emerging applications.

Instead, a broad range of computational, mathematical, and statistical skills is

often required. To illustrate the challenges that arise in analyzing such data,

consider the following example. Social media and the Web present new

opportunities for social scientists to observe and quantitatively measure

human behavior on a large scale. To conduct such a study, social scientists

work with analysts who possess skills in areas such as web mining, natural

language processing (NLP), network analysis, data mining, and statistics.

Compared to more traditional research in social science, which is often based

on surveys, this analysis requires a broader range of skills and tools, and

involves far larger amounts of data. Thus, data science is, by necessity, a

highly interdisciplinary field that builds on the continuing work of many fields.

The data-driven approach of data science emphasizes the direct discovery of

patterns and relationships from data, especially in large quantities of data,

often without the need for extensive domain knowledge. A notable example of

the success of this approach is represented by advances in neural networks,

i.e., deep learning, which have been particularly successful in areas which

have long proved challenging, e.g., recognizing objects in photos or videos

and words in speech, as well as in other application areas. However, note that

this is just one example of the success of data-driven approaches, and

dramatic improvements have also occurred in many other areas of data

analysis. Many of these developments are topics described later in this book.

Some cautions on potential limitations of a purely data-driven approach are

given in the Bibliographic Notes.

1.4 Data Mining Tasks

Data mining tasks are generally divided into two major categories:

Predictive tasks The objective of these tasks is to predict the value of a

particular attribute based on the values of other attributes. The attribute to be

predicted is commonly known as the target or dependent variable, while the

attributes used for making the prediction are known as the explanatory or

independent variables.

Descriptive tasks Here, the objective is to derive patterns (correlations,

trends, clusters, trajectories, and anomalies) that summarize the underlying

relationships in data. Descriptive data mining tasks are often exploratory in

nature and frequently require postprocessing techniques to validate and

explain the results.

Figure 1.3 illustrates four of the core data mining tasks that are described

in the remainder of this book.

Figure 1.3.

Four of the core data mining tasks.

Predictive modeling refers to the task of building a model for the target

variable as a function of the explanatory variables. There are two types of

predictive modeling tasks: classification, which is used for discrete target

variables, and regression, which is used for continuous target variables. For

example, predicting whether a web user will make a purchase at an online

bookstore is a classification task because the target variable is binary-valued.

On the other hand, forecasting the future price of a stock is a regression task

because price is a continuous-valued attribute. The goal of both tasks is to

learn a model that minimizes the error between the predicted and true values

of the target variable. Predictive modeling can be used to identify customers

who will respond to a marketing campaign, predict disturbances in the Earth’s

ecosystem, or judge whether a patient has a particular disease based on the

results of medical tests.

Example 1.1 (Predicting the Type of a Flower).

Consider the task of predicting a species of flower based on the

characteristics of the flower. In particular, consider classifying an Iris flower

as one of the following three Iris species: Setosa, Versicolour, or Virginica.

To perform this task, we need a data set containing the characteristics of

various flowers of these three species. A data set with this type of

information is the well-known Iris data set from the UCI Machine Learning

Repository at http://www.ics.uci.edu/~mlearn. In addition to the species

of a flower, this data set contains four other attributes: sepal width, sepal

length, petal length, and petal width. Figure 1.4 shows a plot of petal

width versus petal length for the 150 flowers in the Iris data set. Petal width

is broken into the categories low, medium, and high, which correspond to

the intervals [0, 0.75), [0.75, 1.75), , respectively. Also, petal

length is broken into categories low, medium,and high, which correspond

to the intervals [0, 2.5), [2.5, 5), , respectively. Based on these

categories of petal width and length, the following rules can be derived:

Petal width low and petal length low implies Setosa.

Petal width medium and petal length medium implies Versicolour.

Petal width high and petal length high implies Virginica.

While these rules do not classify all the flowers, they do a good (but not

perfect) job of classifying most of the flowers. Note that flowers from the

Setosa species are well separated from the Versicolour and Virginica

species with respect to petal width and length, but the latter two species

overlap somewhat with respect to these attributes.

[1.75, ∞)

[5, ∞)

Figure 1.4.

Petal width versus petal length for 150 Iris flowers.

Association analysis is used to discover patterns that describe strongly

associated features in the data. The discovered patterns are typically

represented in the form of implication rules or feature subsets. Because of the

exponential size of its search space, the goal of association analysis is to

extract the most interesting patterns in an efficient manner. Useful applications

of association analysis include finding groups of genes that have related

functionality, identifying web pages that are accessed together, or

understanding the relationships between different elements of Earth’s climate

system.

Example 1.2 (Market Basket Analysis).

The transactions shown in Table 1.1 illustrate point-of-sale data

collected at the checkout counters of a grocery store. Association analysis

can be applied to find items that are frequently bought together by

customers. For example, we may discover the rule ,

which suggests that customers who buy diapers also tend to buy milk. This

type of rule can be used to identify potential cross-selling opportunities

among related items.

Table 1.1. Market basket data.

Transaction ID Items

1 {Bread, Butter, Diapers, Milk}

2 {Coffee, Sugar, Cookies, Salmon}

3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}

4 {Bread, Butter, Salmon, Chicken}

5 {Eggs, Bread, Butter}

6 {Salmon, Diapers, Milk}

7 {Bread, Tea, Sugar, Eggs}

8 {Coffee, Sugar, Chicken, Eggs}

9 {Bread, Diapers, Milk, Salt}

10 {Tea, Eggs, Cookies, Diapers, Milk}

Cluster analysis seeks to find groups of closely related observations so that

observations that belong to the same cluster are more similar to each other

than observations that belong to other clusters. Clustering has been used to

{Diapers}→{Milk}

group sets of related customers, find areas of the ocean that have a

significant impact on the Earth’s climate, and compress data.

Example 1.3 (Document Clustering).

The collection of news articles shown in Table 1.2 can be grouped

based on their respective topics. Each article is represented as a set of

word-frequency pairs (w : c), where w is a word and c is the number of

times the word appears in the article. There are two natural clusters in the

data set. The first cluster consists of the first four articles, which

correspond to news about the economy, while the second cluster contains

the last four articles, which correspond to news about health care. A good

clustering algorithm should be able to identify these two clusters based on

the similarity between words that appear in the articles.

Table 1.2. Collection of news articles.

Article Word-frequency pairs

1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2

2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1

3 job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3

4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2

5 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2

6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1, flu: 3

7 death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2

8 medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1

Anomaly detection is the task of identifying observations whose

characteristics are significantly different from the rest of the data. Such

observations are known as anomalies or outliers. The goal of an anomaly

detection algorithm is to discover the real anomalies and avoid falsely labeling

normal objects as anomalous. In other words, a good anomaly detector must

have a high detection rate and a low false alarm rate. Applications of anomaly

detection include the detection of fraud, network intrusions, unusual patterns

of disease, and ecosystem disturbances, such as droughts, floods, fires,

hurricanes, etc.

Example 1.4 (Credit Card Fraud Detection).

A credit card company records the transactions made by every credit card

holder, along with personal information such as credit limit, age, annual

income, and address. Since the number of fraudulent cases is relatively

small compared to the number of legitimate transactions, anomaly

detection techniques can be applied to build a profile of legitimate

transactions for the users. When a new transaction arrives, it is compared

against the profile of the user. If the characteristics of the transaction are

very different from the previously created profile, then the transaction is

flagged as potentially fraudulent.

1.5 Scope and Organization of the

Book

This book introduces the major principles and techniques used in data mining

from an algorithmic perspective. A study of these principles and techniques is

essential for developing a better understanding of how data mining technology

can be applied to various kinds of data. This book also serves as a starting

point for readers who are interested in doing research in this field.

We begin the technical discussion of this book with a chapter on data

(Chapter 2 ), which discusses the basic types of data, data quality,

preprocessing techniques, and measures of similarity and dissimilarity.

Although this material can be covered quickly, it provides an essential

foundation for data analysis. Chapters 3 and 4 cover classification.

Chapter 3 provides a foundation by discussing decision tree classifiers and

several issues that are important to all classification: overfitting, underfitting,

model selection, and performance evaluation. Using this foundation, Chapter

4 describes a number of other important classification techniques: rule-

based systems, nearest neighbor classifiers, Bayesian classifiers, artificial

neural networks, including deep learning, support vector machines, and

ensemble classifiers, which are collections of classifiers. The multiclass and

imbalanced class problems are also discussed. These topics can be covered

independently.

Association analysis is explored in Chapters 5 and 6 . Chapter 5

describes the basics of association analysis: frequent itemsets, association

rules, and some of the algorithms used to generate them. Specific types of

frequent itemsets—maximal, closed, and hyperclique—that are important for

data mining are also discussed, and the chapter concludes with a discussion

of evaluation measures for association analysis. Chapter 6 considers a

variety of more advanced topics, including how association analysis can be

applied to categorical and continuous data or to data that has a concept

hierarchy. (A concept hierarchy is a hierarchical categorization of objects, e.g.,

store items .) This chapter also

describes how association analysis can be extended to find sequential

patterns (patterns involving order), patterns in graphs, and negative

relationships (if one item is present, then the other is not).

Cluster analysis is discussed in Chapters 7 and 8 . Chapter 7 first

describes the different types of clusters, and then presents three specific

clustering techniques: K-means, agglomerative hierarchical clustering, and

DBSCAN. This is followed by a discussion of techniques for validating the

results of a clustering algorithm. Additional clustering concepts and

techniques are explored in Chapter 8 , including fuzzy and probabilistic

clustering, Self-Organizing Maps (SOM), graph-based clustering, spectral

clustering, and density-based clustering. There is also a discussion of

scalability issues and factors to consider when selecting a clustering

algorithm.

Chapter 9 , is on anomaly detection. After some basic definitions, several

different types of anomaly detection are considered: statistical, distance-

based, density-based, clustering-based, reconstruction-based, one-class

classification, and information theoretic. The last chapter, Chapter 10 ,

supplements the discussions in the other Chapters with a discussion of the

statistical concepts important for avoiding spurious results, and then

discusses those concepts in the context of data mining techniques studied in

the previous chapters. These techniques include statistical hypothesis testing,

p-values, the false discovery rate, and permutation testing. Appendices A

through F give a brief review of important topics that are used in portions of

store items→clothing→shoes→sneakers

the book: linear algebra, dimensionality reduction, statistics, regression,

optimization, and scaling up data mining techniques for big data.

The subject of data mining, while relatively young compared to statistics or

machine learning, is already too large to cover in a single book. Selected

references to topics that are only briefly covered, such as data quality, are

provided in the Bibliographic Notes section of the appropriate chapter.

References to topics not covered in this book, such as mining streaming data

and privacy-preserving data mining are provided in the Bibliographic Notes of

this chapter.

1.6 Bibliographic Notes

The topic of data mining has inspired many textbooks. Introductory textbooks

include those by Dunham [16], Han et al. [29], Hand et al. [31], Roiger and

Geatz [50], Zaki and Meira [61], and Aggarwal [2]. Data mining books with a

stronger emphasis on business applications include the works by Berry and

Linoff [5], Pyle [47], and Parr Rud [45]. Books with an emphasis on statistical

learning include those by Cherkassky and Mulier [11], and Hastie et al. [32].

Similar books with an emphasis on machine learning or pattern recognition

are those by Duda et al. [15], Kantardzic [34], Mitchell [43], Webb [57], and

Witten and Frank [58]. There are also some more specialized books:

Chakrabarti [9] (web mining), Fayyad et al. [20] (collection of early articles on

data mining), Fayyad et al. [18] (visualization), Grossman et al. [25] (science

and engineering), Kargupta and Chan [35] (distributed data mining), Wang et

al. [56] (bioinformatics), and Zaki and Ho [60] (parallel data mining).

There are several conferences related to data mining. Some of the main

conferences dedicated to this field include the ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD), the IEEE

International Conference on Data Mining (ICDM), the SIAM International

Conference on Data Mining (SDM), the European Conference on Principles

and Practice of Knowledge Discovery in Databases (PKDD), and the Pacific-

Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Data

mining papers can also be found in other major conferences such as the

Conference and Workshop on Neural Information Processing Systems

(NIPS),the International Conference on Machine Learning (ICML), the ACM

SIGMOD/PODS conference, the International Conference on Very Large Data

Bases (VLDB), the Conference on Information and Knowledge Management

(CIKM), the International Conference on Data Engineering (ICDE), the

National Conference on Artificial Intelligence (AAAI), the IEEE International

Conference on Big Data, and the IEEE International Conference on Data

Science and Advanced Analytics (DSAA).

Journal publications on data mining include IEEE Transactions on Knowledge

and Data Engineering, Data Mining and Knowledge Discovery, Knowledge

and Information Systems, ACM Transactions on Knowledge Discovery from

Data, Statistical Analysis and Data Mining, and Information Systems. There

are various open-source data mining software available, including Weka [27]

and Scikit-learn [46]. More recently, data mining software such as Apache

Mahout and Apache Spark have been developed for large-scale problems on

the distributed computing platform.

There have been a number of general articles on data mining that define the

field or its relationship to other fields, particularly statistics. Fayyad et al. [19]

describe data mining and how it fits into the total knowledge discovery

process. Chen et al. [10] give a database perspective on data mining.

Ramakrishnan and Grama [48] provide a general discussion of data mining

and present several viewpoints. Hand [30] describes how data mining differs

from statistics, as does Friedman [21]. Lambert [40] explores the use of

statistics for large data sets and provides some comments on the respective

roles of data mining and statistics. Glymour et al. [23] consider the lessons

that statistics may have for data mining. Smyth et al. [53] describe how the

evolution of data mining is being driven by new types of data and applications,

such as those involving streams, graphs, and text. Han et al. [28] consider

emerging applications in data mining and Smyth [52] describes some

research challenges in data mining. Wu et al. [59] discuss how developments

in data mining research can be turned into practical tools. Data mining

standards are the subject of a paper by Grossman et al. [24]. Bradley [7]

discusses how data mining algorithms can be scaled to large data sets.

The emergence of new data mining applications has produced new

challenges that need to be addressed. For instance, concerns about privacy

breaches as a result of data mining have escalated in recent years,

particularly in application domains such as web commerce and health care.

As a result, there is growing interest in developing data mining algorithms that

maintain user privacy. Developing techniques for mining encrypted or

randomized data is known as privacy-preserving data mining. Some

general references in this area include papers by Agrawal and Srikant [3],

Clifton et al. [12] and Kargupta et al. [36]. Vassilios et al. [55] provide a survey.

Another area of concern is the bias in predictive models that may be used for

some applications, e.g., screening job applicants or deciding prison parole

[39]. Assessing whether such applications are producing biased results is

made more difficult by the fact that the predictive models used for such

applications are often black box models, i.e., models that are not interpretable

in any straightforward way.

Data science, its constituent fields, and more generally, the new paradigm of

knowledge discovery they represent [33], have great potential, some of which

has been realized. However, it is important to emphasize that data science

works mostly with observational data, i.e., data that was collected by various

organizations as part of their normal operation. The consequence of this is

that sampling biases are common and the determination of causal factors

becomes more problematic. For this and a number of other reasons, it is often

hard to interpret the predictive models built from this data [42, 49]. Thus,

theory, experimentation and computational simulations will continue to be the

methods of choice in many areas, especially those related to science.

More importantly, a purely data-driven approach often ignores the existing

knowledge in a particular field. Such models may perform poorly, for example,

predicting impossible outcomes or failing to generalize to new situations.

However, if the model does work well, e.g., has high predictive accuracy, then

this approach may be sufficient for practical purposes in some fields. But in

many areas, such as medicine and science, gaining insight into the underlying

domain is often the goal. Some recent work attempts to address these issues

in order to create theory-guided data science, which takes pre-existing domain

knowledge into account [17, 37].

Recent years have witnessed a growing number of applications that rapidly

generate continuous streams of data. Examples of stream data include

network traffic, multimedia streams, and stock prices. Several issues must be

considered when mining data streams, such as the limited amount of memory

available, the need for online analysis, and the change of the data over time.

Data mining for stream data has become an important area in data mining.

Some selected publications are Domingos and Hulten [14] (classification),

Giannella et al. [22] (association analysis), Guha et al. [26] (clustering), Kifer

et al. [38] (change detection), Papadimitriou et al. [44] (time series), and Law

et al. [41] (dimensionality reduction).

Another area of interest is recommender and collaborative filtering systems [1,

6, 8, 13, 54], which suggest movies, television shows, books, products, etc.

that a person might like. In many cases, this problem, or at least a component

of it, is treated as a prediction problem and thus, data mining techniques can

be applied [4, 51].

Bibliography

[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of

recommender systems: A survey of the state-of-the-art and possible

extensions. IEEE transactions on knowledge and data engineering,

17(6):734–749, 2005.

[2] C. Aggarwal. Data mining: The Textbook. Springer, 2009.

[3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. of

2000 ACMSIGMOD Intl. Conf. on Management of Data, pages 439–450,

Dallas, Texas, 2000. ACM Press.

[4] X. Amatriain and J. M. Pujol. Data mining methods for recommender

systems. In Recommender Systems Handbook, pages 227–262. Springer,

2015.

[5] M. J. A. Berry and G. Linoff. Data Mining Techniques: For Marketing,

Sales, and Customer Relationship Management. Wiley Computer

Publishing, 2nd edition, 2004.

[6] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. Recommender

systems survey. Knowledge-based systems, 46:109–132, 2013.

[7] P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining

algorithms to large databases. Communications of the ACM, 45(8):38–43,

2002.

[8] R. Burke. Hybrid recommender systems: Survey and experiments. User

modeling and user-adapted interaction, 12(4):331–370, 2002.

[9] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext

Data. Morgan Kaufmann, San Francisco, CA, 2003.

[10] M.-S. Chen, J. Han, and P. S. Yu. Data Mining: An Overview from a

Database Perspective. IEEE Transactions on Knowledge and Data

Engineering, 8(6):866–883, 1996.

[11] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and

Methods. Wiley-IEEE Press, 2nd edition, 1998.

[12] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining privacy for data

mining. In National Science Foundation Workshop on Next Generation

Data Mining, pages 126– 133, Baltimore, MD, November 2002.

[13] C. Desrosiers and G. Karypis. A comprehensive survey of neighborhood-

based recommendation methods. Recommender systems handbook,

pages 107–144, 2011.

[14] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of

the 6th Intl. Conf. on Knowledge Discovery and Data Mining, pages 71–80,

Boston, Massachusetts, 2000. ACM Press.

[15] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley

… Sons, Inc., New York, 2nd edition, 2001.

[16] M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice

Hall, 2006.

[17] J. H. Faghmous, A. Banerjee, S. Shekhar, M. Steinbach, V. Kumar, A. R.

Ganguly, and N. Samatova. Theory-guided data science for climate

change. Computer, 47(11):74–78, 2014.

[18] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information

Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann

Publishers, San Francisco, CA, September 2001.

[19] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to

Knowledge Discovery: An Overview. In Advances in Knowledge Discovery

and Data Mining, pages 1–34. AAAI Press, 1996.

[20] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,

editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT

Press, 1996.

[21] J. H. Friedman. Data Mining and Statistics: What’s the Connection?

Unpublished. www-stat.stanford.edu/~jhf/ftp/dm-stat.ps, 1997.

[22] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining Frequent

Patterns in Data Streams at Multiple Time Granularities. In H. Kargupta, A.

Joshi, K. Sivakumar, and Y. Yesha, editors, Next Generation Data Mining,

pages 191–212. AAAI/MIT, 2003.

[23] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical Themes

and Lessons for Data Mining. Data Mining and Knowledge Discovery,

1(1):11–28, 1997.

[24] R. L. Grossman, M. F. Hornick, and G. Meyer. Data mining standards

initiatives. Communications of the ACM, 45(8):59–61, 2002.

[25] R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu,

editors. Data Mining for Scientific and Engineering Applications. Kluwer

Academic Publishers, 2001.

[26] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan.

Clustering Data Streams: Theory and Practice. IEEE Transactions on

Knowledge and Data Engineering, 15(3):515–528, May/June 2003.

[27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.

Witten. The WEKA Data Mining Software: An Update. SIGKDD

Explorations, 11(1), 2009.

[28] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon. Emerging

scientific applications in data mining. Communications of the ACM,

45(8):54–58, 2002.

[29] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques.

Morgan Kaufmann Publishers, San Francisco, 3rd edition, 2011.

[30] D. J. Hand. Data Mining: Statistics and More? The American Statistician,

52(2): 112–118, 1998.

[31] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT

Press, 2001.

[32] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical

Learning: Data Mining, Inference, Prediction. Springer, 2nd edition, 2009.

[33] T. Hey, S. Tansley, K. M. Tolle, et al. The fourth paradigm: data-intensive

scientific discovery, volume 1. Microsoft research Redmond, WA, 2009.

[34] M. Kantardzic. Data Mining: Concepts, Models, Methods, and Algorithms.

Wiley-IEEE Press, Piscataway, NJ, 2003.

[35] H. Kargupta and P. K. Chan, editors. Advances in Distributed and Parallel

Knowledge Discovery. AAAI Press, September 2002.

[36] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Privacy

Preserving Properties of Random Data Perturbation Techniques. In Proc.

of the 2003 IEEE Intl. Conf. on Data Mining, pages 99–106, Melbourne,

Florida, December 2003. IEEE Computer Society.

[37] A. Karpatne, G. Atluri, J. Faghmous, M. Steinbach, A. Banerjee, A.

Ganguly, S. Shekhar, N. Samatova, and V. Kumar. Theory-guided Data

Science: A New Paradigm for Scientific Discovery from Data. IEEE

Transactions on Knowledge and Data Engineering, 2017.

[38] D. Kifer, S. Ben-David, and J. Gehrke. Detecting Change in Data

Streams. In Proc. of the 30th VLDB Conf., pages 180–191, Toronto,

Canada, 2004. Morgan Kaufmann.

[39] J. Kleinberg, J. Ludwig, and S. Mullainathan. A Guide to Solving Social

Problems with Machine Learning. Harvard Business Review, December

2016.

[40] D. Lambert. What Use is Statistics for Massive Data? In ACM SIGMOD

Workshop on Research Issues in Data Mining and Knowledge Discovery,

pages 54–62, 2000.

[41] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Manifold Learning for

Data Streams. In Proc. of the SIAM Intl. Conf. on Data Mining, Lake Buena

Vista, Florida, April 2004. SIAM.

[42] Z. C. Lipton. The mythos of model interpretability. arXiv preprint

arXiv:1606.03490, 2016.

[43] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.

[44] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsupervised

stream mining. VLDB Journal, 13(3):222–239, 2004.

[45] O. Parr Rud. Data Mining Cookbook: Modeling Data for Marketing, Risk

and Customer Relationship Management. John Wiley … Sons, New York,

NY, 2001.

[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A.

Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-

learn: Machine Learning in Python. Journal of Machine Learning

Research, 12:2825–2830, 2011.

[47] D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann, San

Francisco, CA, 2003.

[48] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity to

Science—Guest Editors’ Introduction. IEEE Computer, 32(8):34–37, 1999.

[49] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?:

Explaining the predictions of any classifier. In Proceedings of the 22nd

ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining, pages 1135–1144. ACM, 2016.

[50] R. Roiger and M. Geatz. Data Mining: A Tutorial Based Primer. Addison-

Wesley, 2002.

[51] J. Schafer. The Application of Data-Mining to Recommender Systems.

Encyclopedia of data warehousing and mining, 1:44–48, 2009.

[52] P. Smyth. Breaking out of the Black-Box: Research Challenges in Data

Mining. In Proc. of the 2001 ACM SIGMOD Workshop on Research Issues

in Data Mining and Knowledge Discovery, 2001.

[53] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolution of data

mining algorithms. Communications of the ACM, 45(8):33–37, 2002.

[54] X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering

techniques. Advances in artificial intelligence, 2009:4, 2009.

[55] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y.

Theodoridis. State-of-the-art in privacy preserving data mining. SIGMOD

Record, 33(1):50–57, 2004.

[56] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. E. Shasha, editors. Data

Mining in Bioinformatics. Springer, September 2004.

[57] A. R. Webb. Statistical Pattern Recognition. John Wiley … Sons, 2nd

edition, 2002.

[58] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools

and Techniques. Morgan Kaufmann, 3rd edition, 2011.

[59] X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How Research

Meets Practical Development? Knowledge and Information Systems,

5(2):248–261, 2003.

[60] M. J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining.

Springer, September 2002.

[61] M. J. Zaki and W. Meira Jr. Data Mining and Analysis: Fundamental

Concepts and Algorithms. Cambridge University Press, New York, 2014.

1.7 Exercises

1. Discuss whether or not each of the following activities is a data mining task.

a. Dividing the customers of a company according to their gender.

b. Dividing the customers of a company according to their profitability.

c. Computing the total sales of a company.

d. Sorting a student database based on student identification numbers.

e. Predicting the outcomes of tossing a (fair) pair of dice.

f. Predicting the future stock price of a company using historical records.

g. Monitoring the heart rate of a patient for abnormalities.

h. Monitoring seismic waves for earthquake activities.

i. Extracting the frequencies of a sound wave.

2. Suppose that you are employed as a data mining consultant for an Internet

search engine company. Describe how data mining can help the company by

giving specific examples of how techniques, such as clustering, classification,

association rule mining, and anomaly detection can be applied.

3. For each of the following data sets, explain whether or not data privacy is

an important issue.

a. Census data collected from 1900–1950.

b. IP addresses and visit times of web users who visit your website.

c. Images from Earth-orbiting satellites.

d. Names and addresses of people from the telephone book.

e. Names and email addresses collected from the Web.

2 Data

This chapter discusses several data-related issues that

are important for successful data mining:

The Type of Data Data sets differ in a number of ways. For example, the

attributes used to describe data objects can be of different types—quantitative

or qualitative—and data sets often have special characteristics; e.g., some

data sets contain time series or objects with explicit relationships to one

another. Not surprisingly, the type of data determines which tools and

techniques can be used to analyze the data. Indeed, new research in data

mining is often driven by the need to accommodate new application areas and

their new types of data.

The Quality of the Data Data is often far from perfect. While most data

mining techniques can tolerate some level of imperfection in the data, a focus

on understanding and improving data quality typically improves the quality of

the resulting analysis. Data quality issues that often need to be addressed

include the presence of noise and outliers; missing, inconsistent, or duplicate

data; and data that is biased or, in some other way, unrepresentative of the

phenomenon or population that the data is supposed to describe.

Preprocessing Steps to Make the Data More Suitable for Data Mining

Often, the raw data must be processed in order to make it suitable for

analysis. While one objective may be to improve data quality, other goals

focus on modifying the data so that it better fits a specified data mining

technique or tool. For example, a continuous attribute, e.g., length, sometimes

needs to be transformed into an attribute with discrete categories, e.g., short,

medium, or long, in order to apply a particular technique. As another example,

the number of attributes in a data set is often reduced because many

techniques are more effective when the data has a relatively small number of

attributes.

Analyzing Data in Terms of Its Relationships One approach to data

analysis is to find relationships among the data objects and then perform the

remaining analysis using these relationships rather than the data objects

themselves. For instance, we can compute the similarity or distance between

pairs of objects and then perform the analysis—clustering, classification, or

anomaly detection—based on these similarities or distances. There are many

such similarity or distance measures, and the proper choice depends on the

type of data and the particular application.

Example 2.1 (An Illustration of Data-Related

Issues).

To further illustrate the importance of these issues, consider the following

hypothetical situation. You receive an email from a medical researcher

concerning a project that you are eager to work on.

Hi,

I’ve attached the data file that I mentioned in my previous email. Each line contains the

information for a single patient and consists of five fields. We want to predict the last field using

the other fields. I don’t have time to provide any more information about the data since I’m going

out of town for a couple of days, but hopefully that won’t slow you down too much. And if you

don’t mind, could we meet when I get back to discuss your preliminary results? I might invite a

few other members of my team.

Thanks and see you in a couple of days.

Despite some misgivings, you proceed to analyze the data. The first few rows

of the file are as follows:

012 232 33.5 0 10.7

020 121 16.9 2 210.1

027 165 24.0 0 427.6

⋮

A brief look at the data reveals nothing strange. You put your doubts aside

and start the analysis. There are only 1000 lines, a smaller data file than you

had hoped for, but two days later, you feel that you have made some

progress. You arrive for the meeting, and while waiting for others to arrive, you

strike up a conversation with a statistician who is working on the project.

When she learns that you have also been analyzing the data from the project,

she asks if you would mind giving her a brief overview of your results.

Statistician: So, you got the data for all the patients?

Data Miner: Yes. I haven’t had much time for analysis, but I do have a

few interesting results.

Statistician: Amazing. There were so many data issues with this set of

patients that I couldn’t do much.

Data Miner: Oh? I didn’t hear about any possible problems.

Statistician: Well, first there is field 5, the variable we want to predict.

It’s common knowledge among people who analyze this type of data

that results are better if you work with the log of the values, but I didn’t

discover this until later. Was it mentioned to you?

Data Miner: No.

Statistician: But surely you heard about what happened to field 4? It’s

supposed to be measured on a scale from 1 to 10, with 0 indicating a

missing value, but because of a data entry error, all 10’s were changed

into 0’s. Unfortunately, since some of the patients have missing values

for this field, it’s impossible to say whether a 0 in this field is a real 0 or

a 10. Quite a few of the records have that problem.

Data Miner: Interesting. Were there any other problems?

Statistician: Yes, fields 2 and 3 are basically the same, but I assume

that you probably noticed that.

Data Miner: Yes, but these fields were only weak predictors of field 5.

Statistician: Anyway, given all those problems, I’m surprised you were

able to accomplish anything.

Data Miner: True, but my results are really quite good. Field 1 is a very

strong predictor of field 5. I’m surprised that this wasn’t noticed before.

Statistician: What? Field 1 is just an identification number.

Data Miner: Nonetheless, my results speak for themselves.

Statistician: Oh, no! I just remembered. We assigned ID numbers after

we sorted the records based on field 5. There is a strong connection,

but it’s meaningless. Sorry.

Although this scenario represents an extreme situation, it emphasizes the

importance of “knowing your data.” To that end, this chapter will address each

of the four issues mentioned above, outlining some of the basic challenges

and standard approaches.

2.1 Types of Data

A data set can often be viewed as a collection of data objects. Other names

for a data object are record, point, vector, pattern, event, case, sample,

instance, observation, or entity. In turn, data objects are described by a

number of attributes that capture the characteristics of an object, such as the

mass of a physical object or the time at which an event occurred. Other

names for an attribute are variable, characteristic, field, feature, or dimension.

Example 2.2 (Student Information).

Often, a data set is a file, in which the objects are records (or rows) in the

file and each field (or column) corresponds to an attribute. For example,

Table 2.1 shows a data set that consists of student information. Each

row corresponds to a student and each column is an attribute that

describes some aspect of a student, such as grade point average (GPA) or

identification number (ID).

Table 2.1. A sample data set containing student information.

Student ID Year Grade Point Average (GPA) …

⋮

1034262 Senior 3.24 …

1052663 Freshman 3.51 …

1082246 Sophomore 3.62 …

Although record-based data sets are common, either in flat files or relational

database systems, there are other important types of data sets and systems

for storing data. In Section 2.1.2 , we will discuss some of the types of data

sets that are commonly encountered in data mining. However, we first

consider attributes.

2.1.1 Attributes and Measurement

In this section, we consider the types of attributes used to describe data

objects. We first define an attribute, then consider what we mean by the type

of an attribute, and finally describe the types of attributes that are commonly

encountered.

What Is an Attribute?

We start with a more detailed definition of an attribute.

Definition 2.1.

An attribute is a property or characteristic of an object that can

vary, either from one object to another or from one time to

another.

For example, eye color varies from person to person, while the temperature of

an object varies over time. Note that eye color is a symbolic attribute with a

small number of possible values {brown, black, blue, green, hazel, etc.} , while

temperature is a numerical attribute with a potentially unlimited number of

values.

At the most basic level, attributes are not about numbers or symbols.

However, to discuss and more precisely analyze the characteristics of objects,

we assign numbers or symbols to them. To do this in a well-defined way, we

need a measurement scale.

Definition 2.2.

A measurement scale is a rule (function) that associates a

numerical or symbolic value with an attribute of an object.

Formally, the process of measurement is the application of a measurement

scale to associate a value with a particular attribute of a specific object. While

this may seem a bit abstract, we engage in the process of measurement all

the time. For instance, we step on a bathroom scale to determine our weight,

we classify someone as male or female, or we count the number of chairs in a

room to see if there will be enough to seat all the people coming to a meeting.

In all these cases, the “physical value” of an attribute of an object is mapped

to a numerical or symbolic value.

With this background, we can discuss the type of an attribute, a concept that

is important in determining if a particular data analysis technique is consistent

with a specific type of attribute.

The Type of an Attribute

It is common to refer to the type of an attribute as the type of a measurement

scale. It should be apparent from the previous discussion that an attribute can

be described using different measurement scales and that the properties of an

attribute need not be the same as the properties of the values used to

measure it. In other words, the values used to represent an attribute can have

properties that are not properties of the attribute itself, and vice versa. This is

illustrated with two examples.

Example 2.3 (Employee Age and ID Number).

Two attributes that might be associated with an employee are ID and age

(in years). Both of these attributes can be represented as integers.

However, while it is reasonable to talk about the average age of an

employee, it makes no sense to talk about the average employee ID.

Indeed, the only aspect of employees that we want to capture with the ID

attribute is that they are distinct. Consequently, the only valid operation for

employee IDs is to test whether they are equal. There is no hint of this

limitation, however, when integers are used to represent the employee ID

attribute. For the age attribute, the properties of the integers used to

represent age are very much the properties of the attribute. Even so, the

correspondence is not complete because, for example, ages have a

maximum, while integers do not.

Example 2.4 (Length of Line Segments).

Consider Figure 2.1 , which shows some objects—line segments—and

how the length attribute of these objects can be mapped to numbers in two

different ways. Each successive line segment, going from the top to the

bottom, is formed by appending the topmost line segment to itself. Thus,

the second line segment from the top is formed by appending the topmost

line segment to itself twice, the third line segment from the top is formed by

appending the topmost line segment to itself three times, and so forth. In a

very real (physical) sense, all the line segments are multiples of the first.

This fact is captured by the measurements on the right side of the figure,

but not by those on the left side. More specifically, the measurement scale

on the left side captures only the ordering of the length attribute, while the

scale on the right side captures both the ordering and additivity properties.

Thus, an attribute can be measured in a way that does not capture all the

properties of the attribute.

Figure 2.1.

The measurement of the length of line segments on two different scales of

measurement.

Knowing the type of an attribute is important because it tells us which

properties of the measured values are consistent with the underlying

properties of the attribute, and therefore, it allows us to avoid foolish actions,

such as computing the average employee ID.

The Different Types of Attributes

A useful (and simple) way to specify the type of an attribute is to identify the

properties of numbers that correspond to underlying properties of the attribute.

For example, an attribute such as length has many of the properties of

numbers. It makes sense to compare and order objects by length, as well as

to talk about the differences and ratios of length. The following properties

(operations) of numbers are typically used to describe attributes.

1. Distinctness and

2. Order and

3. Addition and

4. Multiplication and /

Given these properties, we can define four types of attributes: nominal ,

ordinal, interval , and ratio. Table 2.2 gives the definitions of these types,

along with information about the statistical operations that are valid for each

type. Each attribute type possesses all of the properties and operations of the

attribute types above it. Consequently, any property or operation that is valid

for nominal, ordinal, and interval attributes is also valid for ratio attributes. In

other words, the definition of the attribute types is cumulative. However, this

does not mean that the statistical operations appropriate for one attribute type

are appropriate for the attribute types above it.

Table 2.2. Different attribute types.

Attribute Type Description Examples Operations

Categorical Nominal The values of a nominal attribute zip codes, mode,

= ≠

<, ≤, >, ≥

+ −

×

(Qualitative) are just different names; i.e.,

nominal values provide only

enough information to distinguish

one object from another.

employee ID

numbers, eye

color, gender

entropy,

contingency

correlation,

test

Ordinal The values of an ordinal attribute

provide enough information to

order objects.

hardness of

minerals, {good,

better, best},

grades, street

numbers

median,

percentiles,

rank

correlation,

run tests,

sign tests

Numeric

(Quantitative)

Interval For interval attributes, the

differences between values are

meaningful, i.e., a unit of

measurement exists.

calendar dates,

temperature in

Celsius or

Fahrenheit

mean,

standard

deviation,

Pearson’s

correlation,

t and F

tests

Ratio For ratio variables, both

differences and ratios are

meaningful.

temperature in

Kelvin, monetary

quantities, counts,

age, mass,

length, electrical

current

geometric

mean,

harmonic

mean,

percent

variation

Nominal and ordinal attributes are collectively referred to as categorical or

qualitative attributes. As the name suggests, qualitative attributes, such as

employee ID, lack most of the properties of numbers. Even if they are

represented by numbers, i.e., integers, they should be treated more like

symbols. The remaining two types of attributes, interval and ratio, are

collectively referred to as quantitative or numeric attributes. Quantitative

attributes are represented by numbers and have most of the properties of

(=, ≠) χ2

(<, >)

(+, −)

(×, /)

numbers. Note that quantitative attributes can be integer-valued or

continuous.

The types of attributes can also be described in terms of transformations that

do not change the meaning of an attribute. Indeed, S. Smith Stevens, the

psychologist who originally defined the types of attributes shown in Table

2.2 , defined them in terms of these permissible transformations. For

example, the meaning of a length attribute is unchanged if it is measured in

meters instead of feet.

The statistical operations that make sense for a particular type of attribute are

those that will yield the same results when the attribute is transformed by

using a transformation that preserves the attribute’s meaning. To illustrate, the

average length of a set of objects is different when measured in meters rather

than in feet, but both averages represent the same length. Table 2.3 shows

the meaning-preserving transformations for the four attribute types of Table

2.2 .

Table 2.3. Transformations that define attribute levels.

Attribute Type Transformation Comment

Categorical

(Qualitative)

Nominal Any one-to-one mapping,

e.g., a permutation of values

If all employee ID numbers are

reassigned, it will not make any

difference.

Ordinal An order-preserving change

of values, i.e.,

where f is a monotonic

function.

An attribute encompassing the notion

of good, better, best can be

represented equally well by the values

{1, 2, 3} or by {0.5, 1, 10}.

Numeric

(Quantitative)

Interval

a and b constants.

The Fahrenheit and Celsius

temperature scales differ in the

new_value=f(old_value),

new_value=a×old_value+b,

location of their zero value and the

size of a degree (unit).

Ratio Length can be measured in meters or

feet.

Example 2.5 (Temperature Scales).

Temperature provides a good illustration of some of the concepts that have

been described. First, temperature can be either an interval or a ratio

attribute, depending on its measurement scale. When measured on the

Kelvin scale, a temperature of 2 is, in a physically meaningful way, twice

that of a temperature of 1 . This is not true when temperature is measured

on either the Celsius or Fahrenheit scales, because, physically, a

temperature of 1 Fahrenheit (Celsius) is not much different than a

temperature of 2 Fahrenheit (Celsius). The problem is that the zero points

of the Fahrenheit and Celsius scales are, in a physical sense, arbitrary,

and therefore, the ratio of two Celsius or Fahrenheit temperatures is not

physically meaningful.

Describing Attributes by the Number of Values

An independent way of distinguishing between attributes is by the number of

values they can take.

Discrete A discrete attribute has a finite or countably infinite set of values.

Such attributes can be categorical, such as zip codes or ID numbers, or

numeric, such as counts. Discrete attributes are often represented using

integer variables. Binary attributes are a special case of discrete attributes

and assume only two values, e.g., true/false, yes/no, male/female, or 0/1.

new_value=a×old_value

◦

◦

◦

◦

Binary attributes are often represented as Boolean variables, or as integer

variables that only take the values 0 or 1.

Continuous A continuous attribute is one whose values are real numbers.

Examples include attributes such as temperature, height, or weight.

Continuous attributes are typically represented as floating-point variables.

Practically, real values can be measured and represented only with limited

precision.

In theory, any of the measurement scale types—nominal, ordinal, interval, and

ratio—could be combined with any of the types based on the number of

attribute values—binary, discrete, and continuous. However, some

combinations occur only infrequently or do not make much sense. For

instance, it is difficult to think of a realistic data set that contains a continuous

binary attribute. Typically, nominal and ordinal attributes are binary or discrete,

while interval and ratio attributes are continuous. However, count attributes ,

which are discrete, are also ratio attributes.

Asymmetric Attributes

For asymmetric attributes, only presence—a non-zero attribute value—is

regarded as important. Consider a data set in which each object is a student

and each attribute records whether a student took a particular course at a

university. For a specific student, an attribute has a value of 1 if the student

took the course associated with that attribute and a value of 0 otherwise.

Because students take only a small fraction of all available courses, most of

the values in such a data set would be 0. Therefore, it is more meaningful and

more efficient to focus on the non-zero values. To illustrate, if students are

compared on the basis of the courses they don’t take, then most students

would seem very similar, at least if the number of courses is large. Binary

attributes where only non-zero values are important are called asymmetric

binary attributes. This type of attribute is particularly important for

association analysis, which is discussed in Chapter 5 . It is also possible to

have discrete or continuous asymmetric features. For instance, if the number

of credits associated with each course is recorded, then the resulting data set

will consist of asymmetric discrete or continuous attributes.

General Comments on Levels of Measurement

As described in the rest of this chapter, there are many diverse types of data.

The previous discussion of measurement scales, while useful, is not complete

and has some limitations. We provide the following comments and guidance.

Distinctness, order, and meaningful intervals and ratios are only four

properties of data—many others are possible. For instance, some data

is inherently cyclical, e.g., position on the surface of the Earth or time. As

another example, consider set valued attributes, where each attribute

value is a set of elements, e.g., the set of movies seen in the last year.

Define one set of elements (movies) to be greater (larger) than a second

set if the second set is a subset of the first. However, such a relationship

defines only a partial order that does not match any of the attribute types

just defined.

The numbers or symbols used to capture attribute values may not

capture all the properties of the attributes or may suggest properties

that are not there. An illustration of this for integers was presented in

Example 2.3 , i.e., averages of IDs and out of range ages.

Data is often transformed for the purpose of analysis—see Section

2.3.7 . This often changes the distribution of the observed variable to a

distribution that is easier to analyze, e.g., a Gaussian (normal) distribution.

Often, such transformations only preserve the order of the original values,

and other properties are lost. Nonetheless, if the desired outcome is a

statistical test of differences or a predictive model, such a transformation is

justified.

The final evaluation of any data analysis, including operations on

attributes, is whether the results make sense from a domain point of

view.

In summary, it can be challenging to determine which operations can be

performed on a particular attribute or a collection of attributes without

compromising the integrity of the analysis. Fortunately, established practice

often serves as a reliable guide. Occasionally, however, standard practices

are erroneous or have limitations.

2.1.2 Types of Data Sets

There are many types of data sets, and as the field of data mining develops

and matures, a greater variety of data sets become available for analysis. In

this section, we describe some of the most common types. For convenience,

we have grouped the types of data sets into three groups: record data, graph-

based data, and ordered data. These categories do not cover all possibilities

and other groupings are certainly possible.

General Characteristics of Data Sets

Before providing details of specific kinds of data sets, we discuss three

characteristics that apply to many data sets and have a significant impact on

the data mining techniques that are used: dimensionality, distribution, and

resolution.

Dimensionality

The dimensionality of a data set is the number of attributes that the objects in

the data set possess. Analyzing data with a small number of dimensions tends

to be qualitatively different from analyzing moderate or high-dimensional data.

Indeed, the difficulties associated with the analysis of high-dimensional data

are sometimes referred to as the curse of dimensionality. Because of this,

an important motivation in preprocessing the data is dimensionality

reduction. These issues are discussed in more depth later in this chapter and

in Appendix B.

Distribution

The distribution of a data set is the frequency of occurrence of various values

or sets of values for the attributes comprising data objects. Equivalently, the

distribution of a data set can be considered as a description of the

concentration of objects in various regions of the data space. Statisticians

have enumerated many types of distributions, e.g., Gaussian (normal), and

described their properties. (See Appendix C.) Although statistical approaches

for describing distributions can yield powerful analysis techniques, many data

sets have distributions that are not well captured by standard statistical

distributions.

As a result, many data mining algorithms do not assume a particular statistical

distribution for the data they analyze. However, some general aspects of

distributions often have a strong impact. For example, suppose a categorical

attribute is used as a class variable, where one of the categories occurs 95%

of the time, while the other categories together occur only 5% of the time. This

skewness in the distribution can make classification difficult as discussed in

Section 4.11. (Skewness has other impacts on data analysis that are not

discussed here.)

A special case of skewed data is sparsity. For sparse binary, count or

continuous data, most attributes of an object have values of 0. In many cases,

fewer than 1% of the values are non-zero. In practical terms, sparsity is an

advantage because usually only the non-zero values need to be stored and

manipulated. This results in significant savings with respect to computation

time and storage. Indeed, some data mining algorithms, such as the

association rule mining algorithms described in Chapter 5 , work well only

for sparse data. Finally, note that often the attributes in sparse data sets are

asymmetric attributes.

Resolution

It is frequently possible to obtain data at different levels of resolution, and

often the properties of the data are different at different resolutions. For

instance, the surface of the Earth seems very uneven at a resolution of a few

meters, but is relatively smooth at a resolution of tens of kilometers. The

patterns in the data also depend on the level of resolution. If the resolution is

too fine, a pattern may not be visible or may be buried in noise; if the

resolution is too coarse, the pattern can disappear. For example, variations in

atmospheric pressure on a scale of hours reflect the movement of storms and

other weather systems. On a scale of months, such phenomena are not

detectable.

Record Data

Much data mining work assumes that the data set is a collection of records

(data objects), each of which consists of a fixed set of data fields (attributes).

See Figure 2.2(a) . For the most basic form of record data, there is no

explicit relationship among records or data fields, and every record (object)

has the same set of attributes. Record data is usually stored either in flat files

or in relational databases. Relational databases are certainly more than a

collection of records, but data mining often does not use any of the additional

information available in a relational database. Rather, the database serves as

a convenient place to find records. Different types of record data are

described below and are illustrated in Figure 2.2 .

Figure 2.2.

Different variations of record data.

Transaction or Market Basket Data

Transaction data is a special type of record data, where each record

(transaction) involves a set of items. Consider a grocery store. The set of

products purchased by a customer during one shopping trip constitutes a

transaction, while the individual products that were purchased are the items.

This type of data is called market basket data because the items in each

record are the products in a person’s “market basket.” Transaction data is a

collection of sets of items, but it can be viewed as a set of records whose

fields are asymmetric attributes. Most often, the attributes are binary,

indicating whether an item was purchased, but more generally, the attributes

can be discrete or continuous, such as the number of items purchased or the

amount spent on those items. Figure 2.2(b) shows a sample transaction

data set. Each row represents the purchases of a particular customer at a

particular time.

The Data Matrix

If all the data objects in a collection of data have the same fixed set of numeric

attributes, then the data objects can be thought of as points (vectors) in a

multidimensional space, where each dimension represents a distinct attribute

describing the object. A set of such data objects can be interpreted as an m by

n matrix, where there are m rows, one for each object, and n columns, one for

each attribute. (A representation that has data objects as columns and

attributes as rows is also fine.) This matrix is called a data matrix or a pattern

matrix. A data matrix is a variation of record data, but because it consists of

numeric attributes, standard matrix operation can be applied to transform and

manipulate the data. Therefore, the data matrix is the standard data format for

most statistical data. Figure 2.2(c) shows a sample data matrix.

The Sparse Data Matrix

A sparse data matrix is a special case of a data matrix where the attributes

are of the same type and are asymmetric; i.e., only non-zero values are

important. Transaction data is an example of a sparse data matrix that has

only 0–1 entries. Another common example is document data. In particular, if

the order of the terms (words) in a document is ignored—the “bag of words”

approach—then a document can be represented as a term vector, where each

term is a component (attribute) of the vector and the value of each component

is the number of times the corresponding term occurs in the document. This

representation of a collection of documents is often called a document-term

matrix. Figure 2.2(d) shows a sample document-term matrix. The

documents are the rows of this matrix, while the terms are the columns. In

practice, only the non-zero entries of sparse data matrices are stored.

Graph-Based Data

A graph can sometimes be a convenient and powerful representation for data.

We consider two specific cases: (1) the graph captures relationships among

data objects and (2) the data objects themselves are represented as graphs.

Data with Relationships among Objects

The relationships among objects frequently convey important information. In

such cases, the data is often represented as a graph. In particular, the data

objects are mapped to nodes of the graph, while the relationships among

objects are captured by the links between objects and link properties, such as

direction and weight. Consider web pages on the World Wide Web, which

contain both text and links to other pages. In order to process search queries,

web search engines collect and process web pages to extract their contents. It

is well-known, however, that the links to and from each page provide a great

deal of information about the relevance of a web page to a query, and thus,

must also be taken into consideration. Figure 2.3(a) shows a set of linked

web pages. Another important example of such graph data are the social

networks, where data objects are people and the relationships among them

are their interactions via social media.

Data with Objects That Are Graphs

If objects have structure, that is, the objects contain subobjects that have

relationships, then such objects are frequently represented as graphs. For

example, the structure of chemical compounds can be represented by a

graph, where the nodes are atoms and the links between nodes are chemical

bonds. Figure 2.3(b) shows a ball-and-stick diagram of the chemical

compound benzene, which contains atoms of carbon (black) and hydrogen

(gray). A graph representation makes it possible to determine which

substructures occur frequently in a set of compounds and to ascertain

whether the presence of any of these substructures is associated with the

presence or absence of certain chemical properties, such as melting point or

heat of formation. Frequent graph mining, which is a branch of data mining

that analyzes such data, is considered in Section 6.5.

Figure 2.3.

Different variations of graph data.

Ordered Data

For some types of data, the attributes have relationships that involve order in

time or space. Different types of ordered data are described next and are

shown in Figure 2.4 .

Sequential Transaction Data

Sequential transaction data can be thought of as an extension of transaction

data, where each transaction has a time associated with it. Consider a retail

transaction data set that also stores the time at which the transaction took

place. This time information makes it possible to find patterns such as “candy

sales peak before Halloween.” A time can also be associated with each

attribute. For example, each record could be the purchase history of a

customer, with a listing of items purchased at different times. Using this

information, it is possible to find patterns such as “people who buy DVD

players tend to buy DVDs in the period immediately following the purchase.”

Figure 2.4(a) shows an example of sequential transaction data. There are

five different times—t1, t2, t3, t4, and t5; three different customers—C1, C2,

and C3; and five different items—A, B, C, D, and E. In the top table, each row

corresponds to the items purchased at a particular time by each customer. For

instance, at time t3, customer C2 purchased items A and D. In the bottom

table, the same information is displayed, but each row corresponds to a

particular customer. Each row contains information about each transaction

involving the customer, where a transaction is considered to be a set of items

and the time at which those items were purchased. For example, customer C3

bought items A and C at time t2.

Time Series Data

Time series data is a special type of ordered data where each record is a time

series , i.e., a series of measurements taken over time. For example, a

financial data set might contain objects that are time series of the daily prices

of various stocks. As another example, consider Figure 2.4(c) , which

shows a time series of the average monthly temperature for Minneapolis

during the years 1982 to 1994. When working with temporal data, such as

time series, it is important to consider temporal autocorrelation; i.e., if two

measurements are close in time, then the values of those measurements are

often very similar.

Figure 2.4.

Different variations of ordered data.

Sequence Data

Sequence data consists of a data set that is a sequence of individual entities,

such as a sequence of words or letters. It is quite similar to sequential data,

except that there are no time stamps; instead, there are positions in an

ordered sequence. For example, the genetic information of plants and animals

can be represented in the form of sequences of nucleotides that are known as

genes. Many of the problems associated with genetic sequence data involve

predicting similarities in the structure and function of genes from similarities in

nucleotide sequences. Figure 2.4(b) shows a section of the human genetic

code expressed using the four nucleotides from which all DNA is constructed:

A, T, G, and C.

Spatial and Spatio-Temporal Data

Some objects have spatial attributes, such as positions or areas, in addition to

other types of attributes. An example of spatial data is weather data

(precipitation, temperature, pressure) that is collected for a variety of

geographical locations. Often such measurements are collected over time,

and thus, the data consists of time series at various locations. In that case, we

refer to the data as spatio-temporal data. Although analysis can be conducted

separately for each specific time or location, a more complete analysis of

spatio-temporal data requires consideration of both the spatial and temporal

aspects of the data.

An important aspect of spatial data is spatial autocorrelation; i.e., objects

that are physically close tend to be similar in other ways as well. Thus, two

points on the Earth that are close to each other usually have similar values for

temperature and rainfall. Note that spatial autocorrelation is analogous to

temporal autocorrelation.

Important examples of spatial and spatio-temporal data are the science and

engineering data sets that are the result of measurements or model output

taken at regularly or irregularly distributed points on a two- or three-

dimensional grid or mesh. For instance, Earth science data sets record the

temperature or pressure measured at points (grid cells) on latitude–longitude

spherical grids of various resolutions, e.g., by See Figure 2.4(d) . As

another example, in the simulation of the flow of a gas, the speed and

direction of flow at various instants in time can be recorded for each grid point

in the simulation. A different type of spatio-temporal data arises from tracking

the trajectories of objects, e.g., vehicles, in time and space.

Handling Non-Record Data

Most data mining algorithms are designed for record data or its variations,

such as transaction data and data matrices. Record-oriented techniques can

be applied to non-record data by extracting features from data objects and

using these features to create a record corresponding to each object.

Consider the chemical structure data that was described earlier. Given a set of

common substructures, each compound can be represented as a record with

binary attributes that indicate whether a compound contains a specific

substructure. Such a representation is actually a transaction data set, where

the transactions are the compounds and the items are the substructures.

In some cases, it is easy to represent the data in a record format, but this type

of representation does not capture all the information in the data. Consider

spatio-temporal data consisting of a time series from each point on a spatial

grid. This data is often stored in a data matrix, where each row represents a

location and each column represents a particular point in time. However, such

a representation does not explicitly capture the time relationships that are

present among attributes and the spatial relationships that exist among

objects. This does not mean that such a representation is inappropriate, but

rather that these relationships must be taken into consideration during the

analysis. For example, it would not be a good idea to use a data mining

1° 1°.

technique that ignores the temporal autocorrelation of the attributes or the

spatial autocorrelation of the data objects, i.e., the locations on the spatial

grid.

2.2 Data Quality

Data mining algorithms are often applied to data that was collected for another

purpose, or for future, but unspecified applications. For that reason, data

mining cannot usually take advantage of the significant benefits of “ad-

dressing quality issues at the source.” In contrast, much of statistics deals with

the design of experiments or surveys that achieve a prespecified level of data

quality. Because preventing data quality problems is typically not an option,

data mining focuses on (1) the detection and correction of data quality

problems and (2) the use of algorithms that can tolerate poor data quality. The

first step, detection and correction, is often called data cleaning.

The following sections discuss specific aspects of data quality. The focus is on

measurement and data collection issues, although some application-related

issues are also discussed.

2.2.1 Measurement and Data

Collection Issues

It is unrealistic to expect that data will be perfect. There may be problems due

to human error, limitations of measuring devices, or flaws in the data collection

process. Values or even entire data objects can be missing. In other cases,

there can be spurious or duplicate objects; i.e., multiple data objects that all

correspond to a single “real” object. For example, there might be two different

records for a person who has recently lived at two different addresses. Even if

all the data is present and “looks fine,” there may be inconsistencies—a

person has a height of 2 meters, but weighs only 2 kilograms.

In the next few sections, we focus on aspects of data quality that are related

to data measurement and collection. We begin with a definition of

measurement and data collection errors and then consider a variety of

problems that involve measurement error: noise, artifacts, bias, precision, and

accuracy. We conclude by discussing data quality issues that involve both

measurement and data collection problems: outliers, missing and inconsistent

values, and duplicate data.

Measurement and Data Collection Errors

The term measurement error refers to any problem resulting from the

measurement process. A common problem is that the value recorded differs

from the true value to some extent. For continuous attributes, the numerical

difference of the measured and true value is called the error. The term data

collection error refers to errors such as omitting data objects or attribute

values, or inappropriately including a data object. For example, a study of

animals of a certain species might include animals of a related species that

are similar in appearance to the species of interest. Both measurement errors

and data collection errors can be either systematic or random.

We will only consider general types of errors. Within particular domains,

certain types of data errors are commonplace, and well-developed techniques

often exist for detecting and/or correcting these errors. For example, keyboard

errors are common when data is entered manually, and as a result, many data

entry programs have techniques for detecting and, with human intervention,

correcting such errors.

Noise and Artifacts

Noise is the random component of a measurement error. It typically involves

the distortion of a value or the addition of spurious objects. Figure 2.5

shows a time series before and after it has been disrupted by random noise. If

a bit more noise were added to the time series, its shape would be lost.

Figure 2.6 shows a set of data points before and after some noise points

(indicated by ) have been added. Notice that some of the noise points are

intermixed with the non-noise points.

Figure 2.5.

Noise in a time series context.

‘+’s

Figure 2.6.

Noise in a spatial context.

The term noise is often used in connection with data that has a spatial or

temporal component. In such cases, techniques from signal or image

processing can frequently be used to reduce noise and thus, help to discover

patterns (signals) that might be “lost in the noise.” Nonetheless, the

elimination of noise is frequently difficult, and much work in data mining

focuses on devising robust algorithms that produce acceptable results even

when noise is present.

Data errors can be the result of a more deterministic phenomenon, such as a

streak in the same place on a set of photographs. Such deterministic

distortions of the data are often referred to as artifacts.

Precision, Bias, and Accuracy

In statistics and experimental science, the quality of the measurement process

and the resulting data are measured by precision and bias. We provide the

standard definitions, followed by a brief discussion. For the following

definitions, we assume that we make repeated measurements of the same

underlying quantity.

Definition 2.3 (Precision).

The closeness of repeated measurements (of the same

quantity) to one another.

Definition 2.4 (Bias).

A systematic variation of measurements from the quantity being

measured.

Precision is often measured by the standard deviation of a set of values, while

bias is measured by taking the difference between the mean of the set of

values and the known value of the quantity being measured. Bias can be

determined only for objects whose measured quantity is known by means

external to the current situation. Suppose that we have a standard laboratory

weight with a mass of 1g and want to assess the precision and bias of our

new laboratory scale. We weigh the mass five times, and obtain the following

five values:{ 1.015, 0.990, 1.013, 1.001, 0.986}. The mean of these values is

1.001, and hence, the bias is 0.001. The precision, as measured by the

standard deviation, is 0.013.

It is common to use the more general term, accuracy , to refer to the degree

of measurement error in data.

Definition 2.5 (Accuracy)

The closeness of measurements to the true value of the quantity

being measured.

Accuracy depends on precision and bias, but there is no specific formula for

accuracy in terms of these two quantities.

One important aspect of accuracy is the use of significant digits. The goal is

to use only as many digits to represent the result of a measurement or

calculation as are justified by the precision of the data. For example, if the

length of an object is measured with a meter stick whose smallest markings

are millimeters, then we should record the length of data only to the nearest

millimeter. The precision of such a measurement would be We do

not review the details of working with significant digits because most readers

will have encountered them in previous courses and they are covered in

considerable depth in science, engineering, and statistics textbooks.

Issues such as significant digits, precision, bias, and accuracy are sometimes

overlooked, but they are important for data mining as well as statistics and

science. Many times, data sets do not come with information about the

± 0.5mm.

precision of the data, and furthermore, the programs used for analysis return

results without any such information. Nonetheless, without some

understanding of the accuracy of the data and the results, an analyst runs the

risk of committing serious data analysis blunders.

Outliers

Outliers are either (1) data objects that, in some sense, have characteristics

that are different from most of the other data objects in the data set, or (2)

values of an attribute that are unusual with respect to the typical values for

that attribute. Alternatively, they can be referred to as anomalous objects or

values. There is considerable leeway in the definition of an outlier, and many

different definitions have been proposed by the statistics and data mining

communities. Furthermore, it is important to distinguish between the notions of

noise and outliers. Unlike noise, outliers can be legitimate data objects or

values that we are interested in detecting. For instance, in fraud and network

intrusion detection, the goal is to find unusual objects or events from among a

large number of normal ones. Chapter 9 discusses anomaly detection in

more detail.

Missing Values

It is not unusual for an object to be missing one or more attribute values. In

some cases, the information was not collected; e.g., some people decline to

give their age or weight. In other cases, some attributes are not applicable to

all objects; e.g., often, forms have conditional parts that are filled out only

when a person answers a previous question in a certain way, but for simplicity,

all fields are stored. Regardless, missing values should be taken into account

during the data analysis.

There are several strategies (and variations on these strategies) for dealing

with missing data, each of which is appropriate in certain circumstances.

These strategies are listed next, along with an indication of their advantages

and disadvantages.

Eliminate Data Objects or Attributes

A simple and effective strategy is to eliminate objects with missing values.

However, even a partially specified data object contains some information,

and if many objects have missing values, then a reliable analysis can be

difficult or impossible. Nonetheless, if a data set has only a few objects that

have missing values, then it may be expedient to omit them. A related strategy

is to eliminate attributes that have missing values. This should be done with

caution, however, because the eliminated attributes may be the ones that are

critical to the analysis.

Estimate Missing Values

Sometimes missing data can be reliably estimated. For example, consider a

time series that changes in a reasonably smooth fashion, but has a few,

widely scattered missing values. In such cases, the missing values can be

estimated (interpolated) by using the remaining values. As another example,

consider a data set that has many similar data points. In this situation, the

attribute values of the points closest to the point with the missing value are

often used to estimate the missing value. If the attribute is continuous, then

the average attribute value of the nearest neighbors is used; if the attribute is

categorical, then the most commonly occurring attribute value can be taken.

For a concrete illustration, consider precipitation measurements that are

recorded by ground stations. For areas not containing a ground station, the

precipitation can be estimated using values observed at nearby ground

stations.

Ignore the Missing Value during Analysis

Many data mining approaches can be modified to ignore missing values. For

example, suppose that objects are being clustered and the similarity between

pairs of data objects needs to be calculated. If one or both objects of a pair

have missing values for some attributes, then the similarity can be calculated

by using only the attributes that do not have missing values. It is true that the

similarity will only be approximate, but unless the total number of attributes is

small or the number of missing values is high, this degree of inaccuracy may

not matter much. Likewise, many classification schemes can be modified to

work with missing values.

Inconsistent Values

Data can contain inconsistent values. Consider an address field, where both a

zip code and city are listed, but the specified zip code area is not contained in

that city. It is possible that the individual entering this information transposed

two digits, or perhaps a digit was misread when the information was scanned

from a handwritten form. Regardless of the cause of the inconsistent values, it

is important to detect and, if possible, correct such problems.

Some types of inconsistences are easy to detect. For instance, a person’s

height should not be negative. In other cases, it can be necessary to consult

an external source of information. For example, when an insurance company

processes claims for reimbursement, it checks the names and addresses on

the reimbursement forms against a database of its customers.

Once an inconsistency has been detected, it is sometimes possible to correct

the data. A product code may have “check” digits, or it may be possible to

double-check a product code against a list of known product codes, and then

correct the code if it is incorrect, but close to a known code. The correction of

an inconsistency requires additional or redundant information.

Example 2.6 (Inconsistent Sea Surface

Temperature).

This example illustrates an inconsistency in actual time series data that

measures the sea surface temperature (SST) at various points on the

ocean. SST data was originally collected using ocean-based

measurements from ships or buoys, but more recently, satellites have

been used to gather the data. To create a long-term data set, both sources

of data must be used. However, because the data comes from different

sources, the two parts of the data are subtly different. This discrepancy is

visually displayed in Figure 2.7 , which shows the correlation of SST

values between pairs of years. If a pair of years has a positive correlation,

then the location corresponding to the pair of years is colored white;

otherwise it is colored black. (Seasonal variations were removed from the

data since, otherwise, all the years would be highly correlated.) There is a

distinct change in behavior where the data has been put together in 1983.

Years within each of the two groups, 1958–1982 and 1983–1999, tend to

have a positive correlation with one another, but a negative correlation with

years in the other group. This does not mean that this data should not be

used, only that the analyst should consider the potential impact of such

discrepancies on the data mining analysis.

Figure 2.7.

Correlation of SST data between pairs of years. White areas indicate

positive correlation. Black areas indicate negative correlation.

Duplicate Data

A data set can include data objects that are duplicates, or almost duplicates,

of one another. Many people receive duplicate mailings because they appear

in a database multiple times under slightly different names. To detect and

eliminate such duplicates, two main issues must be addressed. First, if there

are two objects that actually represent a single object, then one or more

values of corresponding attributes are usually different, and these inconsistent

values must be resolved. Second, care needs to be taken to avoid

accidentally combining data objects that are similar, but not duplicates, such

as two distinct people with identical names. The term deduplication is often

used to refer to the process of dealing with these issues.

In some cases, two or more objects are identical with respect to the attributes

measured by the database, but they still represent different objects. Here, the

duplicates are legitimate, but can still cause problems for some algorithms if

the possibility of identical objects is not specifically accounted for in their

design. An example of this is given in Exercise 13 on page 108.

2.2.2 Issues Related to Applications

Data quality issues can also be considered from an application viewpoint as

expressed by the statement “data is of high quality if it is suitable for its

intended use.” This approach to data quality has proven quite useful,

particularly in business and industry. A similar viewpoint is also present in

statistics and the experimental sciences, with their emphasis on the careful

design of experiments to collect the data relevant to a specific hypothesis. As

with quality issues at the measurement and data collection level, many issues

are specific to particular applications and fields. Again, we consider only a few

of the general issues.

Timeliness

Some data starts to age as soon as it has been collected. In particular, if the

data provides a snapshot of some ongoing phenomenon or process, such as

the purchasing behavior of customers or web browsing patterns, then this

snapshot represents reality for only a limited time. If the data is out of date,

then so are the models and patterns that are based on it.

Relevance

The available data must contain the information necessary for the application.

Consider the task of building a model that predicts the accident rate for

drivers. If information about the age and gender of the driver is omitted, then it

is likely that the model will have limited accuracy unless this information is

indirectly available through other attributes.

Making sure that the objects in a data set are relevant is also challenging. A

common problem is sampling bias, which occurs when a sample does not

contain different types of objects in proportion to their actual occurrence in the

population. For example, survey data describes only those who respond to the

survey. (Other aspects of sampling are discussed further in Section 2.3.2 .)

Because the results of a data analysis can reflect only the data that is present,

sampling bias will typically lead to erroneous results when applied to the

broader population.

Knowledge about the Data

Ideally, data sets are accompanied by documentation that describes different

aspects of the data; the quality of this documentation can either aid or hinder

the subsequent analysis. For example, if the documentation identifies several

attributes as being strongly related, these attributes are likely to provide highly

redundant information, and we usually decide to keep just one. (Consider

sales tax and purchase price.) If the documentation is poor, however, and fails

to tell us, for example, that the missing values for a particular field are

indicated with a -9999, then our analysis of the data may be faulty. Other

important characteristics are the precision of the data, the type of features

(nominal, ordinal, interval, ratio), the scale of measurement (e.g., meters or

feet for length), and the origin of the data.

2.3 Data Preprocessing

In this section, we consider which preprocessing steps should be applied to

make the data more suitable for data mining. Data preprocessing is a broad

area and consists of a number of different strategies and techniques that are

interrelated in complex ways. We will present some of the most important

ideas and approaches, and try to point out the interrelationships among them.

Specifically, we will discuss the following topics:

Aggregation

Sampling

Dimensionality reduction

Feature subset selection

Feature creation

Discretization and binarization

Variable transformation

Roughly speaking, these topics fall into two categories: selecting data objects

and attributes for the analysis or for creating/changing the attributes. In both

cases, the goal is to improve the data mining analysis with respect to time,

cost, and quality. Details are provided in the following sections.

A quick note about terminology: In the following, we sometimes use synonyms

for attribute, such as feature or variable, in order to follow common usage.

2.3.1 Aggregation

Sometimes “less is more,” and this is the case with aggregation , the

combining of two or more objects into a single object. Consider a data set

consisting of transactions (data objects) recording the daily sales of products

in various store locations (Minneapolis, Chicago, Paris, …) for different days

over the course of a year. See Table 2.4 . One way to aggregate

transactions for this data set is to replace all the transactions of a single store

with a single storewide transaction. This reduces the hundreds or thousands

of transactions that occur daily at a specific store to a single daily transaction,

and the number of data objects per day is reduced to the number of stores.

Table 2.4. Data set containing information about customer purchases.

Transaction ID Item Store Location Date Price …

⋮ ⋮ ⋮ ⋮ ⋮

101123 Watch Chicago 09/06/04 $25.99 …

101123 Battery Chicago 09/06/04 $5.99 …

101124 Shoes Minneapolis 09/06/04 $75.00 …

An obvious issue is how an aggregate transaction is created; i.e., how the

values of each attribute are combined across all the records corresponding to

a particular location to create the aggregate transaction that represents the

sales of a single store or date. Quantitative attributes, such as price, are

typically aggregated by taking a sum or an average. A qualitative attribute,

such as item, can either be omitted or summarized in terms of a higher level

category, e.g., televisions versus electronics.

The data in Table 2.4 can also be viewed as a multidimensional array,

where each attribute is a dimension. From this viewpoint, aggregation is the

process of eliminating attributes, such as the type of item, or reducing the

number of values for a particular attribute; e.g., reducing the possible values

for date from 365 days to 12 months. This type of aggregation is commonly

used in Online Analytical Processing (OLAP). References to OLAP are given

in the bibliographic Notes.

There are several motivations for aggregation. First, the smaller data sets

resulting from data reduction require less memory and processing time, and

hence, aggregation often enables the use of more expensive data mining

algorithms. Second, aggregation can act as a change of scope or scale by

providing a high-level view of the data instead of a low-level view. In the

previous example, aggregating over store locations and months gives us a

monthly, per store view of the data instead of a daily, per item view. Finally, the

behavior of groups of objects or attributes is often more stable than that of

individual objects or attributes. This statement reflects the statistical fact that

aggregate quantities, such as averages or totals, have less variability than the

individual values being aggregated. For totals, the actual amount of variation

is larger than that of individual objects (on average), but the percentage of the

variation is smaller, while for means, the actual amount of variation is less

than that of individual objects (on average). A disadvantage of aggregation is

the potential loss of interesting details. In the store example, aggregating over

months loses information about which day of the week has the highest sales.

Example 2.7 (Australian Precipitation).

This example is based on precipitation in Australia from the period 1982–

1993. Figure 2.8(a) shows a histogram for the standard deviation of

average monthly precipitation for by grid cells in Australia,

while Figure 2.8(b) shows a histogram for the standard deviation of the

average yearly precipitation for the same locations. The average yearly

precipitation has less variability than the average monthly precipitation. All

3,030 0.5° 0.5°

precipitation measurements (and their standard deviations) are in

centimeters.

Figure 2.8.

Histograms of standard deviation for monthly and yearly precipitation in

Australia for the period 1982–1993.

2.3.2 Sampling

Sampling is a commonly used approach for selecting a subset of the data

objects to be analyzed. In statistics, it has long been used for both the

preliminary investigation of the data and the final data analysis. Sampling can

also be very useful in data mining. However, the motivations for sampling in

statistics and data mining are often different. Statisticians use sampling

because obtaining the entire set of data of interest is too expensive or time

consuming, while data miners usually sample because it is too

computationally expensive in terms of the memory or time required to process

all the data. In some cases, using a sampling algorithm can reduce the data

size to the point where a better, but more computationally expensive algorithm

can be used.

The key principle for effective sampling is the following: Using a sample will

work almost as well as using the entire data set if the sample is

representative. In turn, a sample is representative if it has approximately the

same property (of interest) as the original set of data. If the mean (average) of

the data objects is the property of interest, then a sample is representative if it

has a mean that is close to that of the original data. Because sampling is a

statistical process, the representativeness of any particular sample will vary,

and the best that we can do is choose a sampling scheme that guarantees a

high probability of getting a representative sample. As discussed next, this

involves choosing the appropriate sample size and sampling technique.

Sampling Approaches

There are many sampling techniques, but only a few of the most basic ones

and their variations will be covered here. The simplest type of sampling is

simple random sampling. For this type of sampling, there is an equal

probability of selecting any particular object. There are two variations on

random sampling (and other sampling techniques as well): (1) sampling

without replacement —as each object is selected, it is removed from the set

of all objects that together constitute the population , and (2) sampling with

replacement —objects are not removed from the population as they are

selected for the sample. In sampling with replacement, the same object can

be picked more than once. The samples produced by the two methods are not

much different when samples are relatively small compared to the data set

size, but sampling with replacement is simpler to analyze because the

probability of selecting any object remains constant during the sampling

process.

When the population consists of different types of objects, with widely different

numbers of objects, simple random sampling can fail to adequately represent

those types of objects that are less frequent. This can cause problems when

the analysis requires proper representation of all object types. For example,

when building classification models for rare classes, it is critical that the rare

classes be adequately represented in the sample. Hence, a sampling scheme

that can accommodate differing frequencies for the object types of interest is

needed. Stratified sampling , which starts with prespecified groups of

objects, is such an approach. In the simplest version, equal numbers of

objects are drawn from each group even though the groups are of different

sizes. In another variation, the number of objects drawn from each group is

proportional to the size of that group.

Example 2.8 (Sampling and Loss of Information).

Once a sampling technique has been selected, it is still necessary to

choose the sample size. Larger sample sizes increase the probability that

a sample will be representative, but they also eliminate much of the

advantage of sampling. Conversely, with smaller sample sizes, patterns

can be missed or erroneous patterns can be detected. Figure 2.9(a)

shows a data set that contains 8000 two-dimensional points, while Figures

2.9(b) and 2.9(c) show samples from this data set of size 2000 and

500, respectively. Although most of the structure of this data set is present

in the sample of 2000 points, much of the structure is missing in the

sample of 500 points.

Figure 2.9.

Example of the loss of structure with sampling.

Example 2.9 (Determining the Proper Sample

Size).

To illustrate that determining the proper sample size requires a methodical

approach, consider the following task.

Given a set of data consisting of a small number of almost equalsized groups, find at least one

representative point for each of the groups. Assume that the objects in each group are highly

similar to each other, but not very similar to objects in different groups. Figure 2.10(a) shows

an idealized set of clusters (groups) from which these points might be drawn.

Figure 2.10.

Finding representative points from 10 groups.

This problem can be efficiently solved using sampling. One approach is to

take a small sample of data points, compute the pairwise similarities

between points, and then form groups of points that are highly similar. The

desired set of representative points is then obtained by taking one point

from each of these groups. To follow this approach, however, we need to

determine a sample size that would guarantee, with a high probability, the

desired outcome; that is, that at least one point will be obtained from each

cluster. Figure 2.10(b) shows the probability of getting one object from

each of the 10 groups as the sample size runs from 10 to 60. Interestingly,

with a sample size of 20, there is little chance (20%) of getting a sample

that includes all 10 clusters. Even with a sample size of 30, there is still a

moderate chance (almost 40%) of getting a sample that doesn’t contain

objects from all 10 clusters. This issue is further explored in the context of

clustering by Exercise 4 on page 603.

Progressive Sampling

The proper sample size can be difficult to determine, so adaptive or

progressive sampling schemes are sometimes used. These approaches

start with a small sample, and then increase the sample size until a sample of

sufficient size has been obtained. While this technique eliminates the need to

determine the correct sample size initially, it requires that there be a way to

evaluate the sample to judge if it is large enough.

Suppose, for instance, that progressive sampling is used to learn a predictive

model. Although the accuracy of predictive models increases as the sample

size increases, at some point the increase in accuracy levels off. We want to

stop increasing the sample size at this leveling-off point. By keeping track of

the change in accuracy of the model as we take progressively larger samples,

and by taking other samples close to the size of the current one, we can get

an estimate of how close we are to this leveling-off point, and thus, stop

sampling.

2.3.3 Dimensionality Reduction

Data sets can have a large number of features. Consider a set of documents,

where each document is represented by a vector whose components are the

frequencies with which each word occurs in the document. In such cases,

there are typically thousands or tens of thousands of attributes (components),

one for each word in the vocabulary. As another example, consider a set of

time series consisting of the daily closing price of various stocks over a period

of 30 years. In this case, the attributes, which are the prices on specific days,

again number in the thousands.

There are a variety of benefits to dimensionality reduction. A key benefit is that

many data mining algorithms work better if the dimensionality—the number of

attributes in the data—is lower. This is partly because dimensionality reduction

can eliminate irrelevant features and reduce noise and partly because of the

curse of dimensionality, which is explained below. Another benefit is that a

reduction of dimensionality can lead to a more understandable model because

the model usually involves fewer attributes. Also, dimensionality reduction

may allow the data to be more easily visualized. Even if dimensionality

reduction doesn’t reduce the data to two or three dimensions, data is often

visualized by looking at pairs or triplets of attributes, and the number of such

combinations is greatly reduced. Finally, the amount of time and memory

required by the data mining algorithm is reduced with a reduction in

dimensionality.

The term dimensionality reduction is often reserved for those techniques that

reduce the dimensionality of a data set by creating new attributes that are a

combination of the old attributes. The reduction of dimensionality by selecting

attributes that are a subset of the old is known as feature subset selection or

feature selection. It will be discussed in Section 2.3.4 .

In the remainder of this section, we briefly introduce two important topics: the

curse of dimensionality and dimensionality reduction techniques based on

linear algebra approaches such as principal components analysis (PCA).

More details on dimensionality reduction can be found in Appendix B.

The Curse of Dimensionality

The curse of dimensionality refers to the phenomenon that many types of data

analysis become significantly harder as the dimensionality of the data

increases. Specifically, as dimensionality increases, the data becomes

increasingly sparse in the space that it occupies. Thus, the data objects we

observe are quite possibly not a representative sample of all possible objects.

For classification, this can mean that there are not enough data objects to

allow the creation of a model that reliably assigns a class to all possible

objects. For clustering, the differences in density and in the distances between

points, which are critical for clustering, become less meaningful. (This is

discussed further in Sections 8.1.2, 8.4.6, and 8.4.8.) As a result, many

clustering and classification algorithms (and other data analysis algorithms)

have trouble with high-dimensional data leading to reduced classification

accuracy and poor quality clusters.

Linear Algebra Techniques for Dimensionality

Reduction

Some of the most common approaches for dimensionality reduction,

particularly for continuous data, use techniques from linear algebra to project

the data from a high-dimensional space into a lower-dimensional space.

Principal Components Analysis (PCA) is a linear algebra technique for

continuous attributes that finds new attributes (principal components) that (1)

are linear combinations of the original attributes, (2) are orthogonal

(perpendicular) to each other, and (3) capture the maximum amount of

variation in the data. For example, the first two principal components capture

as much of the variation in the data as is possible with two orthogonal

attributes that are linear combinations of the original attributes. Singular

Value Decomposition (SVD) is a linear algebra technique that is related to

PCA and is also commonly used for dimensionality reduction. For additional

details, see Appendices A and B.

2.3.4 Feature Subset Selection

Another way to reduce the dimensionality is to use only a subset of the

features. While it might seem that such an approach would lose information,

this is not the case if redundant and irrelevant features are present.

Redundant features duplicate much or all of the information contained in one

or more other attributes. For example, the purchase price of a product and the

amount of sales tax paid contain much of the same information. Irrelevant

features contain almost no useful information for the data mining task at

hand. For instance, students’ ID numbers are irrelevant to the task of

predicting students’ grade point averages. Redundant and irrelevant features

can reduce classification accuracy and the quality of the clusters that are

found.

While some irrelevant and redundant attributes can be eliminated immediately

by using common sense or domain knowledge, selecting the best subset of

features frequently requires a systematic approach. The ideal approach to

feature selection is to try all possible subsets of features as input to the data

mining algorithm of interest, and then take the subset that produces the best

results. This method has the advantage of reflecting the objective and bias of

the data mining algorithm that will eventually be used. Unfortunately, since the

number of subsets involving n attributes is 2 , such an approach is impractical

in most situations and alternative strategies are needed. There are three

standard approaches to feature selection: embedded, filter, and wrapper.

Embedded approaches

Feature selection occurs naturally as part of the data mining algorithm.

Specifically, during the operation of the data mining algorithm, the algorithm

itself decides which attributes to use and which to ignore. Algorithms for

building decision tree classifiers, which are discussed in Chapter 3 , often

operate in this manner.

n

Filter approaches

Features are selected before the data mining algorithm is run, using some

approach that is independent of the data mining task. For example, we might

select sets of attributes whose pairwise correlation is as low as possible so

that the attributes are non-redundant.

Wrapper approaches

These methods use the target data mining algorithm as a black box to find the

best subset of attributes, in a way similar to that of the ideal algorithm

described above, but typically without enumerating all possible subsets.

Because the embedded approaches are algorithm-specific, only the filter and

wrapper approaches will be discussed further here.

An Architecture for Feature Subset Selection

It is possible to encompass both the filter and wrapper approaches within a

common architecture. The feature selection process is viewed as consisting of

four parts: a measure for evaluating a subset, a search strategy that controls

the generation of a new subset of features, a stopping criterion, and a

validation procedure. Filter methods and wrapper methods differ only in the

way in which they evaluate a subset of features. For a wrapper method,

subset evaluation uses the target data mining algorithm, while for a filter

approach, the evaluation technique is distinct from the target data mining

algorithm. The following discussion provides some details of this approach,

which is summarized in Figure 2.11 .

Figure 2.11.

Flowchart of a feature subset selection process.

Conceptually, feature subset selection is a search over all possible subsets of

features. Many different types of search strategies can be used, but the

search strategy should be computationally inexpensive and should find

optimal or near optimal sets of features. It is usually not possible to satisfy

both requirements, and thus, trade-offs are necessary.

An integral part of the search is an evaluation step to judge how the current

subset of features compares to others that have been considered. This

requires an evaluation measure that attempts to determine the goodness of a

subset of attributes with respect to a particular data mining task, such as

classification or clustering. For the filter approach, such measures attempt to

predict how well the actual data mining algorithm will perform on a given set of

attributes. For the wrapper approach, where evaluation consists of actually

running the target data mining algorithm, the subset evaluation function is

simply the criterion normally used to measure the result of the data mining.

Because the number of subsets can be enormous and it is impractical to

examine them all, some sort of stopping criterion is necessary. This strategy is

usually based on one or more conditions involving the following: the number

of iterations, whether the value of the subset evaluation measure is optimal or

exceeds a certain threshold, whether a subset of a certain size has been

obtained, and whether any improvement can be achieved by the options

available to the search strategy.

Finally, once a subset of features has been selected, the results of the target

data mining algorithm on the selected subset should be validated. A

straightforward validation approach is to run the algorithm with the full set of

features and compare the full results to results obtained using the subset of

features. Hopefully, the subset of features will produce results that are better

than or almost as good as those produced when using all features. Another

validation approach is to use a number of different feature selection

algorithms to obtain subsets of features and then compare the results of

running the data mining algorithm on each subset.

Feature Weighting

Feature weighting is an alternative to keeping or eliminating features. More

important features are assigned a higher weight, while less important features

are given a lower weight. These weights are sometimes assigned based on

domain knowledge about the relative importance of features. Alternatively,

they can sometimes be determined automatically. For example, some

classification schemes, such as support vector machines (Chapter 4 ),

produce classification models in which each feature is given a weight.

Features with larger weights play a more important role in the model. The

normalization of objects that takes place when computing the cosine similarity

(Section 2.4.5 ) can also be regarded as a type of feature weighting.

2.3.5 Feature Creation

It is frequently possible to create, from the original attributes, a new set of

attributes that captures the important information in a data set much more

effectively. Furthermore, the number of new attributes can be smaller than the

number of original attributes, allowing us to reap all the previously described

benefits of dimensionality reduction. Two related methodologies for creating

new attributes are described next: feature extraction and mapping the data to

a new space.

Feature Extraction

The creation of a new set of features from the original raw data is known as

feature extraction. Consider a set of photographs, where each photograph is

to be classified according to whether it contains a human face. The raw data

is a set of pixels, and as such, is not suitable for many types of classification

algorithms. However, if the data is processed to provide higher-level features,

such as the presence or absence of certain types of edges and areas that are

highly correlated with the presence of human faces, then a much broader set

of classification techniques can be applied to this problem.

Unfortunately, in the sense in which it is most commonly used, feature

extraction is highly domain-specific. For a particular field, such as image

processing, various features and the techniques to extract them have been

developed over a period of time, and often these techniques have limited

applicability to other fields. Consequently, whenever data mining is applied to

a relatively new area, a key task is the development of new features and

feature extraction methods.

Although feature extraction is often complicated, Example 2.10 illustrates

that it can be relatively straightforward.

Example 2.10 (Density).

Consider a data set consisting of information about historical artifacts,

which, along with other information, contains the volume and mass of each

artifact. For simplicity, assume that these artifacts are made of a small

number of materials (wood, clay, bronze, gold) and that we want to classify

the artifacts with respect to the material of which they are made. In this

case, a density feature constructed from the mass and volume features,

i.e., density =mass/volume , would most directly yield an accurate

classification. Although there have been some attempts to automatically

perform such simple feature extraction by exploring basic mathematical

combinations of existing attributes, the most common approach is to

construct features using domain expertise.

Mapping the Data to a New Space

A totally different view of the data can reveal important and interesting

features. Consider, for example, time series data, which often contains

periodic patterns. If there is only a single periodic pattern and not much noise,

then the pattern is easily detected. If, on the other hand, there are a number of

periodic patterns and a significant amount of noise, then these patterns are

hard to detect. Such patterns can, nonetheless, often be detected by applying

a Fourier transform to the time series in order to change to a representation

in which frequency information is explicit. In Example 2.11 , it will not be

necessary to know the details of the Fourier transform. It is enough to know

that, for each time series, the Fourier transform produces a new data object

whose attributes are related to frequencies.

Example 2.11 (Fourier Analysis).

The time series presented in Figure 2.12(b) is the sum of three other

time series, two of which are shown in Figure 2.12(a) and have

frequencies of 7 and 17 cycles per second, respectively. The third time

series is random noise. Figure 2.12(c) shows the power spectrum that

can be computed after applying a Fourier transform to the original time

series. (Informally, the power spectrum is proportional to the square of

each frequency attribute.) In spite of the noise, there are two peaks that

correspond to the periods of the two original, non-noisy time series. Again,

the main point is that better features can reveal important aspects of the

data.

Figure 2.12.

Application of the Fourier transform to identify the underlying frequencies

in time series data.

Many other sorts of transformations are also possible. Besides the Fourier

transform, the wavelet transform has also proven very useful for time series

and other types of data.

2.3.6 Discretization and Binarization

Some data mining algorithms, especially certain classification algorithms,

require that the data be in the form of categorical attributes. Algorithms that

find association patterns require that the data be in the form of binary

attributes. Thus, it is often necessary to transform a continuous attribute into a

categorical attribute (discretization), and both continuous and discrete

attributes may need to be transformed into one or more binary attributes

(binarization). Additionally, if a categorical attribute has a large number of

values (categories), or some values occur infrequently, then it can be

beneficial for certain data mining tasks to reduce the number of categories by

combining some of the values.

As with feature selection, the best discretization or binarization approach is

the one that “produces the best result for the data mining algorithm that will be

used to analyze the data.” It is typically not practical to apply such a criterion

directly. Consequently, discretization or binarization is performed in a way that

satisfies a criterion that is thought to have a relationship to good performance

for the data mining task being considered. In general, the best discretization

depends on the algorithm being used, as well as the other attributes being

considered. Typically, however, the discretization of each attribute is

considered in isolation.

Binarization

A simple technique to binarize a categorical attribute is the following: If there

are m categorical values, then uniquely assign each original value to an

integer in the interval If the attribute is ordinal, then order must be

maintained by the assignment. (Note that even if the attribute is originally

represented using integers, this process is necessary if the integers are not in

[0, m−1].

the interval ) Next, convert each of these m integers to a binary

number. Since binary digits are required to represent these

integers, represent these binary numbers using n binary attributes. To

illustrate, a categorical variable with 5 values {awful, poor, OK, good, great}

would require three binary variables and The conversion is shown

in Table 2.5 .

Table 2.5. Conversion of a categorical attribute to three binary attributes.

Categorical Value Integer Value

awful 0 0 0 0

poor 1 0 0 1

OK 2 0 1 0

good 3 0 1 1

great 4 1 0 0

Such a transformation can cause complications, such as creating unintended

relationships among the transformed attributes. For example, in Table 2.5 ,

attributes and are correlated because information about the good value

is encoded using both attributes. Furthermore, association analysis requires

asymmetric binary attributes, where only the presence of the attribute

is important. For association problems, it is therefore necessary to

introduce one asymmetric binary attribute for each categorical value, as

shown in Table 2.6 . If the number of resulting attributes is too large, then

the techniques described in the following sections can be used to reduce the

number of categorical values before binarization.

Table 2.6. Conversion of a categorical attribute to five asymmetric binary

[0, m−1].

n=[log2(m)]

x1, x2, x3.

x1 x2 x3

x2 x3

(value =1).

attributes.

Categorical Value Integer Value

awful 0 1 0 0 0 0

poor 1 0 1 0 0 0

OK 2 0 0 1 0 0

good 3 0 0 0 1 0

great 4 0 0 0 0 1

Likewise, for association problems, it can be necessary to replace a single

binary attribute with two asymmetric binary attributes. Consider a binary

attribute that records a person’s gender, male or female. For traditional

association rule algorithms, this information needs to be transformed into two

asymmetric binary attributes, one that is a 1 only when the person is male and

one that is a 1 only when the person is female. (For asymmetric binary

attributes, the information representation is somewhat inefficient in that two

bits of storage are required to represent each bit of information.)

Discretization of Continuous Attributes

Discretization is typically applied to attributes that are used in classification or

association analysis. Transformation of a continuous attribute to a categorical

attribute involves two subtasks: deciding how many categories,n , to have and

determining how to map the values of the continuous attribute to these

categories. In the first step, after the values of the continuous attribute are

sorted, they are then divided into n intervals by specifying split points. In

the second, rather trivial step, all the values in one interval are mapped to the

same categorical value. Therefore, the problem of discretization is one of

x1 x2 x3 x4 x5

n−1

deciding how many split points to choose and where to place them. The result

can be represented either as a set of intervals

where and can be or respectively, or equivalently, as a

series of inequalities

Unsupervised Discretization

A basic distinction between discretization methods for classification is whether

class information is used (supervised) or not (unsupervised). If class

information is not used, then relatively simple approaches are common. For

instance, the equal width approach divides the range of the attribute into a

user-specified number of intervals each having the same width. Such an

approach can be badly affected by outliers, and for that reason, an equal

frequency (equal depth) approach, which tries to put the same number of

objects into each interval, is often preferred. As another example of

unsupervised discretization, a clustering method, such as K-means (see

Chapter 7 ), can also be used. Finally, visually inspecting the data can

sometimes be an effective approach.

Example 2.12 (Discretization Techniques).

This example demonstrates how these approaches work on an actual data

set. Figure 2.13(a) shows data points belonging to four different groups,

along with two outliers—the large dots on either end. The techniques of the

previous paragraph were applied to discretize the x values of these data

points into four categorical values. (Points in the data set have a random y

component to make it easy to see how many points are in each group.)

Visually inspecting the data works quite well, but is not automatic, and

thus, we focus on the other three approaches. The split points produced by

the techniques equal width, equal frequency, and K-means are shown in

{(x0, x1], (x1, x2],…, (xn

−1, xn)}, x0 xn +∞ −∞,

x0<x≤x1, …, xn−1<x<xn.

Figures 2.13(b) , 2.13(c) , and 2.13(d) , respectively. The split points

are represented as dashed lines.

Figure 2.13.

Different discretization techniques.

In this particular example, if we measure the performance of a

discretization technique by the extent to which different objects that clump

together have the same categorical value, then K-means performs best,

followed by equal frequency, and finally, equal width. More generally, the

best discretization will depend on the application and often involves

domain-specific discretization. For example, the discretization of people

into low income, middle income, and high income is based on economic

factors.

Supervised Discretization

If classification is our application and class labels are known for some data

objects, then discretization approaches that use class labels often produce

better classification. This should not be surprising, since an interval

constructed with no knowledge of class labels often contains a mixture of

class labels. A conceptually simple approach is to place the splits in a way

that maximizes the purity of the intervals, i.e., the extent to which an interval

contains a single class label. In practice, however, such an approach requires

potentially arbitrary decisions about the purity of an interval and the minimum

size of an interval.

To overcome such concerns, some statistically based approaches start with

each attribute value in a separate interval and create larger intervals by

merging adjacent intervals that are similar according to a statistical test. An

alternative to this bottom-up approach is a top-down approach that starts by

bisecting the initial values so that the resulting two intervals give minimum

entropy. This technique only needs to consider each value as a possible split

point, because it is assumed that intervals contain ordered sets of values. The

splitting process is then repeated with another interval, typically choosing the

interval with the worst (highest) entropy, until a user-specified number of

intervals is reached, or a stopping criterion is satisfied.

Entropy-based approaches are one of the most promising approaches to

discretization, whether bottom-up or top-down. First, it is necessary to define

entropy. Let k be the number of different class labels, m be the number of

values in the i interval of a partition, and m be the number of values of class

j in interval i. Then the entropy e of the i interval is given by the equation

where is the probability (fraction of values) of class j in the

interval. The total entropy, e, of the partition is the weighted average of the

individual interval entropies, i.e.,

where m is the number of values, is the fraction of values in the

interval, and n is the number of intervals. Intuitively, the entropy of an interval

is a measure of the purity of an interval. If an interval contains only values of

one class (is perfectly pure), then the entropy is 0 and it contributes nothing to

the overall entropy. If the classes of values in an interval occur equally often

(the interval is as impure as possible), then the entropy is a maximum.

Example 2.13 (Discretization of Two Attributes).

The top-down method based on entropy was used to independently

discretize both the x and y attributes of the two-dimensional data shown in

Figure 2.14 . In the first discretization, shown in Figure 2.14(a) , the x

and y attributes were both split into three intervals. (The dashed lines

indicate the split points.) In the second discretization, shown in Figure

2.14(b) , the x and y attributes were both split into five intervals.

i

th

ij

i

th

ei=−∑j=1kpijlog2 pij,

pij=mij/mi ith

e=∑i=1nwiei,

wi=mi/m ith

Figure 2.14.

Discretizing x and y attributes for four groups (classes) of points.

This simple example illustrates two aspects of discretization. First, in two

dimensions, the classes of points are well separated, but in one dimension,

this is not so. In general, discretizing each attribute separately often

guarantees suboptimal results. Second, five intervals work better than three,

but six intervals do not improve the discretization much, at least in terms of

entropy. (Entropy values and results for six intervals are not shown.)

Consequently, it is desirable to have a stopping criterion that automatically

finds the right number of partitions.

Categorical Attributes with Too Many Values

Categorical attributes can sometimes have too many values. If the categorical

attribute is an ordinal attribute, then techniques similar to those for continuous

attributes can be used to reduce the number of categories. If the categorical

attribute is nominal, however, then other approaches are needed. Consider a

university that has a large number of departments. Consequently, a

department name attribute might have dozens of different values. In this

situation, we could use our knowledge of the relationships among different

departments to combine departments into larger groups, such as engineering,

social sciences, or biological sciences. If domain knowledge does not serve

as a useful guide or such an approach results in poor classification

performance, then it is necessary to use a more empirical approach, such as

grouping values together only if such a grouping results in improved

classification accuracy or achieves some other data mining objective.

2.3.7 Variable Transformation

A variable transformation refers to a transformation that is applied to all the

values of a variable. (We use the term variable instead of attribute to adhere

to common usage, although we will also refer to attribute transformation on

occasion.) In other words, for each object, the transformation is applied to the

value of the variable for that object. For example, if only the magnitude of a

variable is important, then the values of the variable can be transformed by

taking the absolute value. In the following section, we discuss two important

types of variable transformations: simple functional transformations and

normalization.

Simple Functions

For this type of variable transformation, a simple mathematical function is

applied to each value individually. If x is a variable, then examples of such

transformations include or In statistics, variable

transformations, especially sqrt, log, and 1/x, are often used to transform data

that does not have a Gaussian (normal) distribution into data that does. While

xk, log x, ex, x, 1/x, sin x, |x|.

this can be important, other reasons often take precedence in data mining.

Suppose the variable of interest is the number of data bytes in a session, and

the number of bytes ranges from 1 to 1 billion. This is a huge range, and it can

be advantageous to compress it by using a log transformation. In this case,

sessions that transferred and bytes would be more similar to each

other than sessions that transferred 10 and 1000 bytes

For some applications, such as network intrusion detection, this may be what

is desired, since the first two sessions most likely represent transfers of large

files, while the latter two sessions could be two quite distinct types of

sessions.

Variable transformations should be applied with caution because they change

the nature of the data. While this is what is desired, there can be problems if

the nature of the transformation is not fully appreciated. For instance, the

transformation 1/x reduces the magnitude of values that are 1 or larger, but

increases the magnitude of values between 0 and 1. To illustrate, the values

{1, 2, 3} go to but the values go to {1, 2, 3}. Thus, for

all sets of values, the transformation 1/x reverses the order. To help clarify the

effect of a transformation, it is important to ask questions such as the

following: What is the desired property of the transformed attribute? Does the

order need to be maintained? Does the transformation apply to all values,

especially negative values and 0? What is the effect of the transformation on

the values between 0 and 1? Exercise 17 on page 109 explores other

aspects of variable transformation.

Normalization or Standardization

The goal of standardization or normalization is to make an entire set of values

have a particular property. A traditional example is that of “standardizing a

variable” in statistics. If is the mean (average) of the attribute values and

is their standard deviation, then the transformation creates a new

10

108 109

(9−8=1 versus 3−1=3).

{ 1, 12, 13 }, { 1, 12, 13 }

x¯ sx

x′=(x−x¯)/sx

variable that has a mean of 0 and a standard deviation of 1. If different

variables are to be used together, e.g., for clustering, then such a

transformation is often necessary to avoid having a variable with large values

dominate the results of the analysis. To illustrate, consider comparing people

based on two variables: age and income. For any two people, the difference in

income will likely be much higher in absolute terms (hundreds or thousands of

dollars) than the difference in age (less than 150). If the differences in the

range of values of age and income are not taken into account, then the

comparison between people will be dominated by differences in income. In

particular, if the similarity or dissimilarity of two people is calculated using the

similarity or dissimilarity measures defined later in this chapter, then in many

cases, such as that of Euclidean distance, the income values will dominate

the calculation.

The mean and standard deviation are strongly affected by outliers, so the

above transformation is often modified. First, the mean is replaced by the

median, i.e., the middle value. Second, the standard deviation is replaced by

the absolute standard deviation. Specifically, if x is a variable, then the

absolute standard deviation of x is given by where is the

value of the variable, m is the number of objects, and is either the mean

or median. Other approaches for computing estimates of the location (center)

and spread of a set of values in the presence of outliers are described in

statistics books. These more robust measures can also be used to define a

standardization transformation.

σA=∑i=1m|xi−μ|, xi

ith μ

2.4 Measures of Similarity and

Dissimilarity

Similarity and dissimilarity are important because they are used by a number

of data mining techniques, such as clustering, nearest neighbor classification,

and anomaly detection. In many cases, the initial data set is not needed once

these similarities or dissimilarities have been computed. Such approaches can

be viewed as transforming the data to a similarity (dissimilarity) space and

then performing the analysis. Indeed, kernel methods are a powerful

realization of this idea. These methods are introduced in Section 2.4.7 and

are discussed more fully in the context of classification in Section 4.9.4.

We begin with a discussion of the basics: high-level definitions of similarity

and dissimilarity, and a discussion of how they are related. For convenience,

the term proximity is used to refer to either similarity or dissimilarity. Since

the proximity between two objects is a function of the proximity between the

corresponding attributes of the two objects, we first describe how to measure

the proximity between objects having only one attribute.

We then consider proximity measures for objects with multiple attributes. This

includes measures such as the Jaccard and cosine similarity measures, which

are useful for sparse data, such as documents, as well as correlation and

Euclidean distance, which are useful for non-sparse (dense) data, such as

time series or multi-dimensional points. We also consider mutual information,

which can be applied to many types of data and is good for detecting

nonlinear relationships. In this discussion, we restrict ourselves to objects with

relatively homogeneous attribute types, typically binary or continuous.

Next, we consider several important issues concerning proximity measures.

This includes how to compute proximity between objects when they have

heterogeneous types of attributes, and approaches to account for differences

of scale and correlation among variables when computing distance between

numerical objects. The section concludes with a brief discussion of how to

select the right proximity measure.

Although this section focuses on the computation of proximity between data

objects, proximity can also be computed between attributes. For example, for

the document-term matrix of Figure 2.2(d) , the cosine measure can be

used to compute similarity between a pair of documents or a pair of terms

(words). Knowing that two variables are strongly related can, for example, be

helpful for eliminating redundancy. In particular, the correlation and mutual

information measures discussed later are often used for that purpose.

2.4.1 Basics

Definitions

Informally, the similarity between two objects is a numerical measure of the

degree to which the two objects are alike. Consequently, similarities are

higher for pairs of objects that are more alike. Similarities are usually non-

negative and are often between 0 (no similarity) and 1 (complete similarity).

The dissimilarity between two objects is a numerical measure of the degree

to which the two objects are different. Dissimilarities are lower for more similar

pairs of objects. Frequently, the term distance is used as a synonym for

dissimilarity, although, as we shall see, distance often refers to a special class

of dissimilarities. Dissimilarities sometimes fall in the interval [0, 1], but it is

also common for them to range from 0 to ∞.

Transformations

Transformations are often applied to convert a similarity to a dissimilarity, or

vice versa, or to transform a proximity measure to fall within a particular

range, such as [0,1]. For instance, we may have similarities that range from 1

to 10, but the particular algorithm or software package that we want to use

may be designed to work only with dissimilarities, or it may work only with

similarities in the interval [0,1]. We discuss these issues here because we will

employ such transformations later in our discussion of proximity. In addition,

these issues are relatively independent of the details of specific proximity

measures.

Frequently, proximity measures, especially similarities, are defined or

transformed to have values in the interval [0,1]. Informally, the motivation for

this is to use a scale in which a proximity value indicates the fraction of

similarity (or dissimilarity) between two objects. Such a transformation is often

relatively straightforward. For example, if the similarities between objects

range from 1 (not at all similar) to 10 (completely similar), we can make them

fall within the range [0, 1] by using the transformation where s and

s′ are the original and new similarity values, respectively. In the more general

case, the transformation of similarities to the interval [0, 1] is given by the

expression where max_s and min_s are the

maximum and minimum similarity values, respectively. Likewise, dissimilarity

measures with a finite range can be mapped to the interval [0,1] by using the

formula This is an example of a linear

transformation, which preserves the relative distances between points. In

other words, if points, and are twice as far apart as points, and

the same will be true after a linear transformation.

s′=(s−1)/9,

s′=(s−min_s)/(max_s−min_s),

d′=(d−min_d)/(max_d−min_d).

x1 x2, x3 x4,

However, there can be complications in mapping proximity measures to the

interval [0, 1] using a linear transformation. If, for example, the proximity

measure originally takes values in the interval then max_d is not defined

and a nonlinear transformation is needed. Values will not have the same

relationship to one another on the new scale. Consider the transformation

for a dissimilarity measure that ranges from 0 to The

dissimilarities 0, 0.5, 2, 10, 100, and 1000 will be transformed into the new

dissimilarities 0, 0.33, 0.67, 0.90, 0.99, and 0.999, respectively. Larger values

on the original dissimilarity scale are compressed into the range of values

near 1, but whether this is desirable depends on the application.

Note that mapping proximity measures to the interval [0, 1] can also change

the meaning of the proximity measure. For example, correlation, which is

discussed later, is a measure of similarity that takes values in the interval

Mapping these values to the interval [0,1] by taking the absolute value

loses information about the sign, which can be important in some applications.

See Exercise 22 on page 111 .

Transforming similarities to dissimilarities and vice versa is also relatively

straightforward, although we again face the issues of preserving meaning and

changing a linear scale into a nonlinear scale. If the similarity (or dissimilarity)

falls in the interval [0,1], then the dissimilarity can be defined as

Another simple approach is to define similarity as the negative

of the dissimilarity (or vice versa). To illustrate, the dissimilarities 0, 1, 10, and

100 can be transformed into the similarities and

respectively.

The similarities resulting from the negation transformation are not restricted to

the range [0, 1], but if that is desired, then transformations such as

or can be used. For the

transformation the dissimilarities 0, 1, 10, 100 are transformed into 1,

[0,∞],

d=d/(1+d) ∞.

[−1, 1].

d=1−s(s=1−d).

0, −1, −10, −100,

s=1d+1, s=e−d, s=1−d−min_dmax_d−min_d

s=1d+1,

0.5, 0.09, 0.01, respectively. For they become 1.00, 0.37, 0.00, 0.00,

respectively, while for they become 1.00, 0.99,

0.90, 0.00, respectively. In this discussion, we have focused on converting

dissimilarities to similarities. Conversion in the opposite direction is considered

in Exercise 23 on page 111 .

In general, any monotonic decreasing function can be used to convert

dissimilarities to similarities, or vice versa. Of course, other factors also must

be considered when transforming similarities to dissimilarities, or vice versa,

or when transforming the values of a proximity measure to a new scale. We

have mentioned issues related to preserving meaning, distortion of scale, and

requirements of data analysis tools, but this list is certainly not exhaustive.

2.4.2 Similarity and Dissimilarity

between Simple Attributes

The proximity of objects with a number of attributes is typically defined by

combining the proximities of individual attributes, and thus, we first discuss

proximity between objects having a single attribute. Consider objects

described by one nominal attribute. What would it mean for two such objects

to be similar? Because nominal attributes convey only information about the

distinctness of objects, all we can say is that two objects either have the same

value or they do not. Hence, in this case similarity is traditionally defined as 1

if attribute values match, and as 0 otherwise. A dissimilarity would be defined

in the opposite way: 0 if the attribute values match, and 1 if they do not.

For objects with a single ordinal attribute, the situation is more complicated

because information about order should be taken into account. Consider an

s=e−d,

s=1−d−min_dmax_d−min_d

attribute that measures the quality of a product, e.g., a candy bar, on the scale

{poor, fair, OK, good, wonderful}. It would seem reasonable that a product,

P1, which is rated wonderful, would be closer to a product P2, which is rated

good, than it would be to a product P3, which is rated OK. To make this

observation quantitative, the values of the ordinal attribute are often mapped

to successive integers, beginning at 0 or 1, e.g.,

Then, or, if

we want the dissimilarity to fall between 0 and A

similarity for ordinal attributes can then be defined as

This definition of similarity (dissimilarity) for an ordinal attribute should make

the reader a bit uneasy since this assumes equal intervals between

successive values of the attribute, and this is not necessarily so. Otherwise,

we would have an interval or ratio attribute. Is the difference between the

values fair and good really the same as that between the values OK and

wonderful? Probably not, but in practice, our options are limited, and in the

absence of more information, this is the standard approach for defining

proximity between ordinal attributes.

For interval or ratio attributes, the natural measure of dissimilarity between

two objects is the absolute difference of their values. For example, we might

compare our current weight and our weight a year ago by saying “I am ten

pounds heavier.” In cases such as these, the dissimilarities typically range

from 0 to rather than from 0 to 1. The similarity of interval or ratio attributes

is typically expressed by transforming a dissimilarity into a similarity, as

previously described.

Table 2.7 summarizes this discussion. In this table, x and y are two objects

that have one attribute of the indicated type. Also, d(x, y) and s(x, y) are the

dissimilarity and similarity between x and y, respectively. Other approaches

are possible; these are the most common ones.

{poor=0, fair=1, OK=2, good=3, wonderful=4}. d(P1, P2)=3−2=1

d(P1, P2)=3−24=0.25.

s=1−d.

∞,

Table 2.7. Similarity and dissimilarity for simple attributes

Attribute

Type

Dissimilarity Similarity

Nominal

Ordinal (values mapped to

integers 0 to , where n is the

number of values)

Interval

or Ratio

The following two sections consider more complicated measures of proximity

between objects that involve multiple attributes: (1) dissimilarities between

data objects and (2) similarities between data objects. This division allows us

to more naturally display the underlying motivations for employing various

proximity measures. We emphasize, however, that similarities can be

transformed into dissimilarities and vice versa using the approaches described

earlier.

2.4.3 Dissimilarities between Data

Objects

In this section, we discuss various kinds of dissimilarities. We begin with a

discussion of distances, which are dissimilarities with certain properties, and

then provide examples of more general kinds of dissimilarities.

Distances

d={ 0if x=y1if x≠y s={ 1if x=y0if x≠y

d=|x−y|/(n−1)

n−1

s=1−d

d=|x−y| s=−d, s=11+d, s=e−d,s=1−d−min_dmax_d−min_d

We first present some examples, and then offer a more formal description of

distances in terms of the properties common to all distances. The Euclidean

distance ,d , between two points, x and y , in one-, two-, three-, or higher-

dimensional space, is given by the following familiar formula:

where n is the number of dimensions and and are, respectively, the

attributes (components) of x and y. We illustrate this formula with Figure

2.15 and Tables 2.8 and 2.9 , which show a set of points, the x and y

coordinates of these points, and the distance matrix containing the pairwise

distances of these points.

Figure 2.15.

Four two-dimensional points.

The Euclidean distance measure given in Equation 2.1 is generalized by

the Minkowski distance metric shown in Equation 2.2 ,

d(x,y)=∑k=1n(xk−yk)2, (2.1)

xk yk kth

d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)

where r is a parameter. The following are the three most common examples of

Minkowski distances.

City block (Manhattan, taxicab, norm) distance. A common

example is the Hamming distance , which is the number of bits that is

different between two objects that have only binary attributes, i.e., between

two binary vectors.

Euclidean distance ( norm).

Supremum ( or norm) distance. This is the maximum

difference between any attribute of the objects. More formally, the

distance is defined by Equation 2.3

The r parameter should not be confused with the number of dimensions (at-

tributes) n. The Euclidean, Manhattan, and supremum distances are defined

for all values of n: 1, 2, 3, …, and specify different ways of combining the

differences in each dimension (attribute) into an overall distance.

Tables 2.10 and 2.11 , respectively, give the proximity matrices for the

and distances using data from Table 2.8 . Notice that all these distance

matrices are symmetric; i.e., the entry is the same as the entry. In

Table 2.9 , for instance, the fourth row of the first column and the fourth

column of the first row both contain the value 5.1.

Table 2.8. x and y coordinates of four points.

point x coordinate y coordinate

p1 0 2

p2 2 0

p3 3 1

r=1. L1

r=2. L2

r=∞. Lmax L∞

L∞

d(x, y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)

L1

L∞

ijth jith

p4 5 1

Table 2.9. Euclidean distance matrix for Table 2.8 .

p1 p2 p3 p4

p1 0.0 2.8 3.2 5.1

p2 2.8 0.0 1.4 3.2

p3 3.2 1.4 0.0 2.0

p4 5.1 3.2 2.0 0.0

Table 2.10. distance matrix for Table 2.8 .

L p1 p2 p3 p4

p1 0.0 4.0 4.0 6.0

p2 4.0 0.0 2.0 4.0

p3 4.0 2.0 0.0 2.0

p4 6.0 4.0 2.0 0.0

Table 2.11. distance matrix for Table 2.8 .

p1 p2 p3 p4

p1 0.0 2.0 3.0 5.0

p2 2.0 0.0 1.0 3.0

p3 3.0 1.0 0.0 2.0

L1

1

L∞

L∞

p4 5.0 3.0 2.0 0.0

Distances, such as the Euclidean distance, have some well-known properties.

If d(x, y) is the distance between two points, x and y, then the following

properties hold.

1. Positivity

a. for all x and y,

b. only if

2. Symmetry for all x and y.

3. Triangle Inequality for all points x , y , and z.

Measures that satisfy all three properties are known as metrics. Some people

use the term distance only for dissimilarity measures that satisfy these

properties, but that practice is often violated. The three properties described

here are useful, as well as mathematically pleasing. Also, if the triangle

inequality holds, then this property can be used to increase the efficiency of

techniques (including clustering) that depend on distances possessing this

property. (See Exercise 25 .) Nonetheless, many dissimilarities do not

satisfy one or more of the metric properties. Example 2.14 illustrates such

a measure.

Example 2.14 (Non-metric Dissimilarities: Set

Differences).

This example is based on the notion of the difference of two sets, as

defined in set theory. Given two sets A and B, is the set of elements of

A that are not in

d(x, y)≥0

d(x, y)=0 x=y.

d(x, y)=d(y, x)

d(x, z)≤d(x, y)+d(y, z)

A−B

B. For example, if and then and

the empty set. We can define the distance d between two sets A

and B as where size is a function returning the number

of elements in a set. This distance measure, which is an integer value

greater than or equal to 0, does not satisfy the second part of the positivity

property, the symmetry property, or the triangle inequality. However, these

properties can be made to hold if the dissimilarity measure is modified as

follows: See Exercise 21 on page 110 .

2.4.4 Similarities between Data

Objects

For similarities, the triangle inequality (or the analogous property) typically

does not hold, but symmetry and positivity typically do. To be explicit, if s(x, y)

is the similarity between points x and y, then the typical properties of

similarities are the following:

1. only if

2. for all x and y. (Symmetry)

There is no general analog of the triangle inequality for similarity measures. It

is sometimes possible, however, to show that a similarity measure can easily

be converted to a metric distance. The cosine and Jaccard similarity

measures, which are discussed shortly, are two examples. Also, for specific

similarity measures, it is possible to derive mathematical bounds on the

similarity between two objects that are similar in spirit to the triangle inequality.

Example 2.15 (A Non-symmetric Similarity

A={1, 2, 3, 4} B={2, 3, 4}, A−B={1} B

−A=∅,

d(A, B)=size(A−B),

d(A, B)=size(A−B)+size(B−A).

s(x, y)=1 x=y. (0≤s≤1)

s(x, y)=s(y, x)

Measure).

Consider an experiment in which people are asked to classify a small set

of characters as they flash on a screen. The confusion matrix for this

experiment records how often each character is classified as itself, and

how often each is classified as another character. Using the confusion

matrix, we can define a similarity measure between a character x and a

character y as the number of times that x is misclassified asy , but note

that this measure is not symmetric. For example, suppose that “0”

appeared 200 times and was classified as a “0” 160 times, but as an “o” 40

times. Likewise, suppose that “o” appeared 200 times and was classified

as an “o” 170 times, but as “0” only 30 times. Then, but

In such situations, the similarity measure can be made

symmetric by setting where s indicates the

new similarity measure.

2.4.5 Examples of Proximity Measures

This section provides specific examples of some similarity and dissimilarity

measures.

Similarity Measures for Binary Data

Similarity measures between objects that contain only binary attributes are

called similarity coefficients , and typically have values between 0 and 1. A

value of 1 indicates that the two objects are completely similar, while a value

of 0 indicates that the objects are not at all similar. There are many rationales

for why one coefficient is better than another in specific instances.

s(0,o)=40,

s(o, 0)=30.

s′=(x,y)=s′(x,y)=(s(x,y+s(y,x))/2,

Let x and y be two objects that consist of n binary attributes. The comparison

of two such objects, i.e., two binary vectors, leads to the following four

quantities (frequencies):

Simple Matching Coefficient

One commonly used similarity coefficient is the simple matching coefficient

(SMC), which is defined as

This measure counts both presences and absences equally. Consequently,

the SMC could be used to find students who had answered questions similarly

on a test that consisted only of true/false questions.

Jaccard Coefficient

Suppose that x and y are data objects that represent two rows (two

transactions) of a transaction matrix (see Section 2.1.2 ). If each

asymmetric binary attribute corresponds to an item in a store, then a 1

indicates that the item was purchased, while a 0 indicates that the product

was not purchased. Because the number of products not purchased by any

customer far outnumbers the number of products that were purchased, a

similarity measure such as SMC would say that all transactions are very

similar. As a result, the Jaccard coefficient is frequently used to handle objects

consisting of asymmetric binary attributes. The Jaccard coefficient , which is

often symbolized by j, is given by the following equation:

f00=the number of attributes where x is 0 and y is 0f01= the number of attributes where

SMC=number of matching attribute valuesnumber of attributes=f11+f00f01+f10(2.4)

J=number of matching presencesnumber of attributes not involved in 00 matches(2.5)

Example 2.16 (The SMC and Jaccard Similarity

Coefficients).

To illustrate the difference between these two similarity measures, we

calculate SMC and j for the following two binary vectors.

Cosine Similarity

Documents are often represented as vectors, where each component

(attribute) represents the frequency with which a particular term (word) occurs

in the document. Even though documents have thousands or tens of

thousands of attributes (terms), each document is sparse since it has

relatively few nonzero attributes. Thus, as with transaction data, similarity

should not depend on the number of shared 0 values because any two

documents are likely to “not contain” many of the same words, and therefore,

if 0–0 matches are counted, most documents will be highly similar to most

other documents. Therefore, a similarity measure for documents needs to

ignores 0–0 matches like the Jaccard measure, but also must be able to

handle non-binary vectors. The cosine similarity , defined next, is one of the

most common measures of document similarity. If x and y are two document

vectors, then

x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)

f01=2the number of attributes where x was 0 and y was 1f10=1the number of attributes where

SMC=f11+f00f01+f10+f11+f00=0+72+1+0+7=0.7

J=f11f01+f10+f11=02+1+0=0

where ′ indicates vector or matrix transpose and indicates the inner

product of the two vectors,

and is the length of vector

The inner product of two vectors works well for asymmetric attributes since it

depends only on components that are non-zero in both vectors. Hence, the

similarity between two documents depends only upon the words that appear

in both of them.

Example 2.17 (Cosine Similarity between Two

Document Vectors).

This example calculates the cosine similarity for the following two data

objects, which might represent document vectors:

As indicated by Figure 2.16 , cosine similarity really is a measure of the

(cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the

angle between x and y is and x and y are the same except for length. If

the cosine similarity is 0, then the angle between x and y is and they do

not share any terms (words).

cos(x, y)=⟨ x, y ⟩∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)

⟨ x, y ⟩

⟨ x, y ⟩=∑k=1nxkyk=x′y, (2.7)

∥x∥ x, ∥x∥=∑k=1nxk2=⟨ x, x ⟩=x′x.

x=(3, 2, 0, 5, 0, 0, 0, 2, 0, 0)y=(1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

⟨ x, y

⟩=3×1+2×0+0×0+5×0+0×0+0×0+0×0+2×1+0×0×2=5∥x∥=3×3+2×2+0×0+5×5

0°,

90°,

Figure 2.16.

Geometric illustration of the cosine measure.

Equation 2.6 also can be written as Equation 2.8 .

where and Dividing x and y by their lengths normalizes them

to have a length of 1. This means that cosine similarity does not take the

length of the two data objects into account when computing similarity.

(Euclidean distance might be a better choice when length is important.) For

vectors with a length of 1, the cosine measure can be calculated by taking a

simple inner product. Consequently, when many cosine similarities between

objects are being computed, normalizing the objects to have unit length can

reduce the time required.

Extended Jaccard Coefficient (Tanimoto

Coefficient)

The extended Jaccard coefficient can be used for document data and that

reduces to the Jaccard coefficient in the case of binary attributes. This

coefficient, which we shall represent as EJ, is defined by the following

equation:

cos(x, y)=⟨ x∥x∥, y∥y∥ ⟩=⟨ x′, y′ ⟩, (2.8)

x′=x/∥x∥ y′=y/∥y∥.

EJ(x, y)=⟨ x, y ⟩ǁ x ǁ2+ǁ y ǁ2−⟨ x, y ⟩=x′yǁ x ǁ2+ǁ y ǁ2−x′y. (2.9)

Correlation

Correlation is frequently used to measure the linear relationship between two

sets of values that are observed together. Thus, correlation can measure the

relationship between two variables (height and weight) or between two objects

(a pair of temperature time series). Correlation is used much more frequently

to measure the similarity between attributes since the values in two data

objects come from different attributes, which can have very different attribute

types and scales. There are many types of correlation, and indeed correlation

is sometimes used in a general sense to mean the relationship between two

sets of values that are observed together. In this discussion, we will focus on a

measure appropriate for numerical values.

Specifically, Pearson’s correlation between two sets of numerical values,

i.e., two vectors, x and y, is defined by the following equation:

where we use the following standard statistical notation and definitions:

corr(x, y)=covariance(x, y)standard_deviation(x)×standard_deviation(y)=sxysx (2.10)

covariance(x, y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)

standard_deviation(x)=sx=1n−1∑k=1n(xk−x¯)2

standard_deviation(y)=sy=1n−1∑k=1n(yk−y¯)2

x¯=1n∑k=1nxk is the mean of x

y¯=1n∑k=1nyk is the mean of y

Example 2.18 (Perfect Correlation).

Correlation is always in the range to 1. A correlation of means

that x and y have a perfect positive (negative) linear relationship; that is,

where a and b are constants. The following two vectors x and y

illustrate cases where the correlation is and respectively. In the first

case, the means of x and y were chosen to be 0, for simplicity.

Example 2.19 (Nonlinear Relationships).

If the correlation is 0, then there is no linear relationship between the two

sets of values. However, nonlinear relationships can still exist. In the

following example, but their correlation is 0.

Example 2.20 (Visualizing Correlation).

It is also easy to judge the correlation between two vectors x and y by

plotting pairs of corresponding values of x and y in a scatter plot. Figure

2.17 shows a number of these scatter plots when x and y consist of a

set of 30 pairs of values that are randomly generated (with a normal

distribution) so that the correlation of x and y ranges from to 1. Each

circle in a plot represents one of the 30 pairs of x and y values; its x

coordinate is the value of that pair for x, while its y coordinate is the value

of the same pair for y.

−1 1 (−1)

xk=ayk+b,

−1 +1,

x=(−3, 6, 0, 3, −6)y=(1, −2, 0, −1, 2)corr(x, y)=−1xk=−3yk

x=(3, 6, 0, 3, 6)y=(1, 2, 0, 1, 2)corr(x, y)=1xk=3yk

yk=xk2,

x=(−3, −2, −1, 0, 1, 2, 3)y=(9, 4, 1, 0, 1, 4, 9)

−1

Figure 2.17.

Scatter plots illustrating correlations from to 1.

If we transform x and y by subtracting off their means and then normalizing

them so that their lengths are 1, then their correlation can be calculated by

taking the dot product. Let us refer to these transformed vectors of x and y as

and , respectively. (Notice that this transformation is not the same as the

standardization used in other contexts, where we subtract the means and

divide by the standard deviations, as discussed in Section 2.3.7 .) This

transformation highlights an interesting relationship between the correlation

measure and the cosine measure. Specifically, the correlation between x and

y is identical to the cosine between and However, the cosine between x

and y is not the same as the cosine between and even though they both

have the same correlation measure. In general, the correlation between two

−1

x′ y′

x′ y′.

x′ y′,

vectors is equal to the cosine measure only in the special case when the

means of the two vectors are 0.

Differences Among Measures For Continuous

Attributes

In this section, we illustrate the difference among the three proximity

measures for continuous attributes that we have just defined: cosine,

correlation, and Minkowski distance. Specifically, we consider two types of

data transformations that are commonly used, namely, scaling (multiplication)

by a constant factor and translation (addition) by a constant value. A proximity

measure is considered to be invariant to a data transformation if its value

remains unchanged even after performing the transformation. Table 2.12

compares the behavior of cosine, correlation, and Minkowski distance

measures regarding their invariance to scaling and translation operations. It

can be seen that while correlation is invariant to both scaling and translation,

cosine is only invariant to scaling but not to translation. Minkowski distance

measures, on the other hand, are sensitive to both scaling and translation and

are thus invariant to neither.

Table 2.12. Properties of cosine, correlation, and Minkowski distance

measures.

Property Cosine Correlation Minkowski Distance

Invariant to scaling (multiplication) Yes Yes No

Invariant to translation (addition) No Yes No

Let us consider an example to demonstrate the significance of these

differences among different proximity measures.

Example 2.21 (Comparing proximity measures).

Consider the following two vectors x and y with seven numeric attributes.

It can be seen that both x and y have 4 non-zero values, and the values in

the two vectors are mostly the same, except for the third and the fourth

components. The cosine, correlation, and Euclidean distance between the

two vectors can be computed as follows.

Not surprisingly, x and y have a cosine and correlation measure close to 1,

while the Euclidean distance between them is small, indicating that they

are quite similar. Now let us consider the vector which is a scaled

version of y (multiplied by a constant factor of 2), and the vector which

is constructed by translating y by 5 units as follows.

We are interested in finding whether and show the same proximity

with x as shown by the original vector y. Table 2.13 shows the different

measures of proximity computed for the pairs and It

can be seen that the value of correlation between x and y remains

unchanged even after replacing y with or However, the value of

cosine remains equal to 0.9667 when computed for (x, y) and but

significantly reduces to 0.7940 when computed for This highlights

x=(1, 2, 4, 3, 0, 0, 0)y=(1, 2, 3, 4, 0, 0, 0)

cos(x, y)=2930×30=0.9667correlation(x, y)=2.35711.5811×1.5811=0.9429Euclidean distance

x−y ǁ=1.4142

ys,

yt,

ys=2×y=(2, 4, 6, 8, 0, 0, 0)

yt=y+5=(6, 7, 8, 9, 5, 5, 5)

ys yt

(x, y), (x, ys), (x, yt).

ys yt.

(x, ys),

(x, yt).

the fact that cosine is invariant to the scaling operation but not to the

translation operation, in contrast with the correlation measure. The

Euclidean distance, on the other hand, shows different values for all three

pairs of vectors, as it is sensitive to both scaling and translation.

Table 2.13. Similarity between and

Measure (x, y)

Cosine 0.9667 0.9667 0.7940

Correlation 0.9429 0.9429 0.9429

Euclidean Distance 1.4142 5.8310 14.2127

We can observe from this example that different proximity measures

behave differently when scaling or translation operations are applied on

the data. The choice of the right proximity measure thus depends on the

desired notion of similarity between data objects that is meaningful for a

given application. For example, if x and y represented the frequencies of

different words in a document-term matrix, it would be meaningful to use a

proximity measure that remains unchanged when y is replaced by

because is just a scaled version of y with the same distribution of words

occurring in the document. However, is different from y, since it contains

a large number of words with non-zero frequencies that do not occur in y.

Because cosine is invariant to scaling but not to translation, it will be an

ideal choice of proximity measure for this application.

Consider a different scenario in which x represents a location’s

temperature measured on the Celsius scale for seven days. Let and

be the temperatures measured on those days at a different location, but

using three different measurement scales. Note that different units of

(x, y), (x, ys), (x, yt).

(x, ys) (x, yt)

ys,

ys

yt

y, ys,

yt

temperature have different offsets (e.g. Celsius and Kelvin) and different

scaling factors (e.g. Celsius and Fahrenheit). It is thus desirable to use a

proximity measure that captures the proximity between temperature values

without being affected by the measurement scale. Correlation would then

be the ideal choice of proximity measure for this application, as it is

invariant to both scaling and translation.

As another example, consider a scenario where x represents the amount

of precipitation (in cm) measured at seven locations. Let and be

estimates of the precipitation at these locations, which are predicted using

three different models. Ideally, we would like to choose a model that

accurately reconstructs the measurements in x without making any error. It

is evident that y provides a good approximation of the values in x, whereas

and provide poor estimates of precipitation, even though they do

capture the trend in precipitation across locations. Hence, we need to

choose a proximity measure that penalizes any difference in the model

estimates from the actual observations, and is sensitive to both the scaling

and translation operations. The Euclidean distance satisfies this property

and thus would be the right choice of proximity measure for this

application. Indeed, the Euclidean distance is commonly used in

computing the accuracy of models, which will be discussed later in

Chapter 3 .

2.4.6 Mutual Information

Like correlation, mutual information is used as a measure of similarity

between two sets of paired values that is sometimes used as an alternative to

correlation, particularly when a nonlinear relationship is suspected between

the pairs of values. This measure comes from information theory, which is the

y, ys, yt

ys yt

study of how to formally define and quantify information. Indeed, mutual

information is a measure of how much information one set of values provides

about another, given that the values come in pairs, e.g., height and weight. If

the two sets of values are independent, i.e., the value of one tells us nothing

about the other, then their mutual information is 0. On the other hand, if the

two sets of values are completely dependent, i.e., knowing the value of one

tells us the value of the other and vice-versa, then they have maximum mutual

information. Mutual information does not have a maximum value, but we will

define a normalized version of it that ranges between 0 and 1.

To define mutual information, we consider two sets of values, X and Y , which

occur in pairs (X, Y). We need to measure the average information in a single

set of values, i.e., either in X or in Y , and in the pairs of their values. This is

commonly measured by entropy. More specifically, assume X and Y are

discrete, that is, X can take m distinct values, and Y can take n

distinct values, Then their individual and joint entropy can be

defined in terms of the probabilities of each value and pair of values as

follows:

where if the probability of a value or combination of values is 0, then

is conventionally taken to be 0.

The mutual information of X and Y can now be defined straightforwardly:

u1, u2, …, um

v1, v2, …, vn.

H(X)=−∑j=1mP(X=uj)log2 P(X=uj) (2.12)

H(Y)=−∑k=1nP(Y=vk)log2 P(Y=vk) (2.13)

H(X, Y)=−∑j=1m∑k=1nP(X=uj, Y=vk)log2 P(X=uj, Y=vk) (2.14)

0 log2(0)

I(X, Y)=H(X)+H(Y)−H(X, Y) (2.15)

Note that H(X, Y) is symmetric, i.e., and thus mutual

information is also symmetric, i.e.,

Practically, X and Y are either the values in two attributes or two rows of the

same data set. In Example 2.22 , we will represent those values as two

vectors x and y and calculate the probability of each value or pair of values

from the frequency with which values or pairs of values occur in x, y and

where is the component of x and is the component of y. Let

us illustrate using a previous example.

Example 2.22 (Evaluating Nonlinear

Relationships with Mutual Information).

Recall Example 2.19 where but their correlation was 0.

From Figure 2.22 , Although a variety

of approaches to normalize mutual information are possible—see

Bibliographic Notes—for this example, we will apply one that divides the

mutual information by and produces a result between 0

and 1. This yields a value of Thus, we can see

that x and y are strongly related. They are not perfectly related because

given a value of y there is, except for some ambiguity about the value

of x. Notice that for the normalized mutual information would be 1.

Figure 2.18.

Computation of mutual information.

Table 2.14. Entropy for x

H(X, Y)=H(Y, X),

I(X, Y)=I(Y).

(xi, yi), xi ith yi ith

yk=xk2,

x=(−3, −2, −1, 0, 1, 2, 3)y=(9, 4, 1, 0, 1, 4, 9)

I(x, y)=H(x)+H(y)−H(x, y)=1.9502.

log2(min(m, n))

1.9502/log2(4))=0.9751.

y=0,

y=−x,

xj P(x=xj) −P(x=xj)log2 P(x=xj)

1/7 0.4011

1/7 0.4011

1/7 0.4011

0 1/7 0.4011

1 1/7 0.4011

2 1/7 0.4011

3 1/7 0.4011

H(x) 2.8074

Table 2.15. Entropy for y

9 2/7 0.5164

4 2/7 0.5164

1 2/7 0.5164

0 1/7 0.4011

H(y) 1.9502

Table 2.16. Joint entropy for x and y

9 1/7 0.4011

4 1/7 0.4011

−3

−2

−1

yk P(y=yk) −P(y=yk)log2 P(y=yk)

xj yk P(x=xj, y=xk) −P(x=xj, y=xk)log2 P(x=xj, y=xk)

−3

−2

1 1/7 0.4011

0 0 1/7 0.4011

1 1 1/7 0.4011

2 4 1/7 0.4011

3 9 1/7 0.4011

H(x, y) 2.8074

2.4.7 Kernel Functions*

It is easy to understand how similarity and distance might be useful in an

application such as clustering, which tries to group similar objects together.

What is much less obvious is that many other data analysis tasks, including

predictive modeling and dimensionality reduction, can be expressed in terms

of pairwise “proximities” of data objects. More specifically, many data analysis

problems can be mathematically formulated to take as input, a kernel matrix,

K, which can be considered a type of proximity matrix. Thus, an initial

preprocessing step is used to convert the input data into a kernel matrix,

which is the input to the data analysis algorithm.

More formally, if a data set has m data objects, then K is an m by m matrix. If

and are the and data objects, respectively, then the entry of

K, is computed by a kernel function:

−1

xi xj ith jth kij, ijth

kij=κ(xi, xj) (2.16)

As we will see in the material that follows, the use of a kernel matrix allows

both wider applicability of an algorithm to various kinds of data and an ability

to model nonlinear relationships with algorithms that are designed only for

detecting linear relationships.

Kernels make an algorithm data independent

If an algorithm uses a kernel matrix, then it can be used with any type of data

for which a kernel function can be designed. This is illustrated by Algorithm

2.1. Although only some data analysis algorithms can be modified to use a

kernel matrix as input, this approach is extremely powerful because it allows

such an algorithm to be used with almost any type of data for which an

appropriate kernel function can be defined. Thus, a classification algorithm

can be used, for example, with record data, string data, or graph data. If an

algorithm can be reformulated to use a kernel matrix, then its applicability to

different types of data increases dramatically. As we will see in later chapters,

many clustering, classification, and anomaly detection algorithms work only

with similarities or distances, and thus, can be easily modified to work with

kernels.

Algorithm 2.1 Basic kernel algorithm.

1. Read in the m data objects in the data set.

2. Compute the kernel matrix, K by applying the kernel function,

to each pair of data objects.

3. Run the data analysis algorithm with K as input.

4. Return the analysis result, e.g., predicted class or cluster labels.

Mapping data into a higher dimensional data space can

κ,

allow modeling of nonlinear relationships

There is yet another, equally important, aspect of kernel based data

algorithms—their ability to model nonlinear relationships with algorithms that

model only linear relationships. Typically, this works by first transforming

(mapping) the data from a lower dimensional data space to a higher

dimensional space.

Example 2.23 (Mapping Data to a Higher

Dimensional Space).

Consider the relationship between two variables x and y given by the

following equation, which defines an ellipse in two dimensions (Figure

2.19(a) ):

Figure 2.19.

Mapping data to a higher dimensional space: two to three dimensions.

We can map our two dimensional data to three dimensions by creating

three new variables, u, v, and w, which are defined as follows:

As a result, we can now express Equation 2.17 as a linear one. This

equation describes a plane in three dimensions. Points on the ellipse will

lie on that plane, while points inside and outside the ellipse will lie on

opposite sides of the plane. See Figure 2.19(b) . The viewpoint of this

3D plot is along the surface of the separating plane so that the plane

appears as a line.

The Kernel Trick

The approach illustrated above shows the value in mapping data to higher

dimensional space, an operation that is integral to kernel-based methods.

Conceptually, we first define a function that maps data points x and y to

data points and in a higher dimensional space such that the inner

product gives the desired measure of proximity of x and y. It may seem

that we have potentially sacrificed a great deal by using such an approach,

because we can greatly expand the size of our data, increase the

computational complexity of our analysis, and encounter problems with the

curse of dimensionality by computing similarity in a high-dimensional space.

However, this is not the case since these problems can be avoided by defining

a kernel function that can compute the same similarity value, but with the

data points in the original space, i.e., This is known as

the kernel trick. Despite the name, the kernel trick has a very solid

4×2+9xy+7y2=10 (2.17)

w=x2u=xyv=y2

4u+9v+7w=10 (2.18)

φ

φ(x) φ(y)

⟨x, y⟩

κ

κ(x, y)=⟨ φ(x), φ(y) ⟩.

mathematical foundation and is a remarkably powerful approach for data

analysis.

Not every function of a pair of data objects satisfies the properties needed for

a kernel function, but it has been possible to design many useful kernels for a

wide variety of data types. For example, three common kernel functions are

the polynomial, Gaussian (radial basis function (RBF)), and sigmoid kernels. If

x and y are two data objects, specifically, two data vectors, then these two

kernel functions can be expressed as follows, respectively:

where and are constants, d is an integer parameter that gives the

polynomial degree, is the length of the vector and is a

parameter that governs the “spread” of a Gaussian.

Example 2.24 (The Polynomial Kernel).

Note that the kernel functions presented in the previous section are

computing the same similarity value as would be computed if we actually

mapped the data to a higher dimensional space and then computed an

inner product there. For example, for the polynomial kernel of degree 2, let

be the function that maps a two-dimensional data vector to the

higher dimensional space. Specifically, let

κ(x, y)−(x′y+c)d (2.19)

κ(x, y)=exp(−ǁ x−y ǁ/2σ2) (2.20)

κ(x, y)=tanh(αx′y+c) (2.21)

α c≥0

ǁ x−y ǁ x−y σ>0

φ x=(x1, x2)

φ(x)=(x12, x22, 2x1x2, 2cx1, 2cx2, c). (2.22)

For the higher dimensional space, let the proximity be defined as the inner

product of and i.e., Then, as previously

mentioned, it can be shown that

where is defined by Equation 2.19 above. Specifically, if

and then

More generally, the kernel trick depends on defining and so that

Equation 2.23 holds. This has been done for a wide variety of kernels.

This discussion of kernel-based approaches was intended only to provide a

brief introduction to this topic and has omitted many details. A fuller discussion

of the kernel-based approach is provided in Section 4.9.4, which discusses

these issues in the context of nonlinear support vector machines for

classification. More general references for the kernel based analysis can be

found in the Bibliographic Notes of this chapter.

2.4.8 Bregman Divergence*

This section provides a brief description of Bregman divergences, which are a

family of proximity functions that share some common properties. As a result,

it is possible to construct general data mining algorithms, such as clustering

algorithms, that work with any Bregman divergence. A concrete example is

the K-means clustering algorithm (Section 7.2). Note that this section requires

knowledge of vector calculus.

φ(x) φ(y), ⟨ φ(x), φ(y) ⟩.

κ(x, y)=⟨ φ(x), φ(y) ⟩ (2.23)

κ x=(x1, x2)

y=(y1, y2),

κ(x, y)=⟨ x, y ⟩=x′y=(x12y12, x22y22, 2x1x2y1y2, 2cx1y1, 2cx2y2, c2).(2.24)

κ φ

Bregman divergences are loss or distortion functions. To understand the idea

of a loss function, consider the following. Let x and y be two points, where y is

regarded as the original point and x is some distortion or approximation of it.

For example, x may be a point that was generated by adding random noise to

y. The goal is to measure the resulting distortion or loss that results if y is

approximated by x. Of course, the more similar x and y are, the smaller the

loss or distortion. Thus, Bregman divergences can be used as dissimilarity

functions.

More formally, we have the following definition.

Definition 2.6 (Bregman divergence)

Given a strictly convex function (with a few modest restrictions

that are generally satisfied), the Bregman divergence (loss

function) generated by that function is given by the

following equation:

where is the gradient of evaluated at is the vector

difference between x and y, and is the inner

product between and For points in Euclidean space,

the inner product is just the dot product.

D(x, y) can be written as where

and represents the equation of a plane that is tangent to the function at y.

ϕ

D(x, y)

D(x, y)=ϕ(x)−ϕ(y)−⟨ ∇ϕ(y), (x−y) ⟩ (2.25)

∇ϕ(y) ϕ y, x−y,

⟨ ∇ϕ(y), (x−y) ⟩

∇ϕ(y) (x−y).

D(x, y)=ϕ(x)−L(x), L(x)=ϕ(y)+⟨ ∇ϕ(y), (x−y) ⟩

ϕ

Using calculus terminology, L(x) is the linearization of around the point y,

and the Bregman divergence is just the difference between a function and a

linear approximation to that function. Different Bregman divergences are

obtained by using different choices for

Example 2.25.

We provide a concrete example using squared Euclidean distance, but

restrict ourselves to one dimension to simplify the mathematics. Let x and

y be real numbers and be the real-valued function, In that

case, the gradient reduces to the derivative, and the dot product reduces

to multiplication. Specifically, Equation 2.25 becomes Equation 2.26 .

The graph for this example, with is shown in Figure 2.20 . The

Bregman divergence is shown for two values of x: and

ϕ

ϕ.

ϕ(t) ϕ(t)=t2.

D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)

y=1,

x=2 x=3.

Figure 2.20.

Illustration of Bregman divergence.

2.4.9 Issues in Proximity Calculation

This section discusses several important issues related to proximity

measures: (1) how to handle the case in which attributes have different scales

and/or are correlated, (2) how to calculate proximity between objects that are

composed of different types of attributes, e.g., quantitative and qualitative, (3)

and how to handle proximity calculations when attributes have different

weights; i.e., when not all attributes contribute equally to the proximity of

objects.

Standardization and Correlation for Distance

Measures

An important issue with distance measures is how to handle the situation

when attributes do not have the same range of values. (This situation is often

described by saying that “the variables have different scales.”) In a previous

example, Euclidean distance was used to measure the distance between

people based on two attributes: age and income. Unless these two attributes

are standardized, the distance between two people will be dominated by

income.

A related issue is how to compute distance when there is correlation between

some of the attributes, perhaps in addition to differences in the ranges of

values. A generalization of Euclidean distance, the Mahalanobis distance, is

useful when attributes are correlated, have different ranges of values (different

variances), and the distribution of the data is approximately Gaussian

(normal). Correlated variables have a large impact on standard distance

measures since a change in any of the correlated variables is reflected in a

change in all the correlated variables. Specifically, the Mahalanobis distance

between two objects (vectors) x and y is defined as

where is the inverse of the covariance matrix of the data. Note that the

covariance matrix is the matrix whose entry is the covariance of the

and attributes as defined by Equation 2.11 .

Example 2.26.

In Figure 2.21 , there are 1000 points, whose x and y attributes have a

correlation of 0.6. The distance between the two large points at the

opposite ends of the long axis of the ellipse is 14.7 in terms of Euclidean

Mahalanobis(x, y)=(x−y)′∑−1(x−y), (2.27)

∑−1

∑ ijth ith

jth

distance, but only 6 with respect to Mahalanobis distance. This is because

the Mahalanobis distance gives less emphasis to the direction of largest

variance. In practice, computing the Mahalanobis distance is expensive,

but can be worthwhile for data whose attributes are correlated. If the

attributes are relatively uncorrelated, but have different ranges, then

standardizing the variables is sufficient.

Figure 2.21.

Set of two-dimensional points. The Mahalanobis distance between the two

points represented by large dots is 6; their Euclidean distance is 14.7.

Combining Similarities for Heterogeneous

Attributes

The previous definitions of similarity were based on approaches that assumed

all the attributes were of the same type. A general approach is needed when

the attributes are of different types. One straightforward approach is to

compute the similarity between each attribute separately using Table 2.7 ,

and then combine these similarities using a method that results in a similarity

between 0 and 1. One possible approach is to define the overall similarity as

the average of all the individual attribute similarities. Unfortunately, this

approach does not work well if some of the attributes are asymmetric

attributes. For example, if all the attributes are asymmetric binary attributes,

then the similarity measure suggested previously reduces to the simple

matching coefficient, a measure that is not appropriate for asymmetric binary

attributes. The easiest way to fix this problem is to omit asymmetric attributes

from the similarity calculation when their values are 0 for both of the objects

whose similarity is being computed. A similar approach also works well for

handling missing values.

In summary, Algorithm 2.2 is effective for computing an overall similarity

between two objects, x and y, with different types of attributes. This procedure

can be easily modified to work with dissimilarities.

Algorithm 2.2 Similarities of heterogeneous

objects.

1: For the attribute, compute a similarity, in the

range [0, 1].

2: Define an indicator variable, for the attribute as

follows:

kth sk(x, y),

δk, kth

δk={

0if the kth attribute is an asymmetric attribute andboth objects have a value of

Using Weights

In much of the previous discussion, all attributes were treated equally when

computing proximity. This is not desirable when some attributes are more

important to the definition of proximity than others. To address these

situations, the formulas for proximity can be modified by weighting the

contribution of each attribute.

With attribute weights, (2.28) becomes

The definition of the Minkowski distance can also be modified as follows:

2.4.10 Selecting the Right Proximity

Measure

A few general observations may be helpful. First, the type of proximity

measure should fit the type of data. For many types of dense, continuous

data, metric distance measures such as Euclidean distance are often used.

Proximity between continuous attributes is most often expressed in terms of

3: Compute the overall similarity between the two objects using

the following formula:

similarity (x, y)=∑k=1nδksk(x, y)∑k=1nδk(2.28)

wk,

similarity (x, y)=∑k=1nwkδksk(x, y)∑k=1nwkδk. (2.29)

d (x, y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)

differences, and distance measures provide a well-defined way of combining

these differences into an overall proximity measure. Although attributes can

have different scales and be of differing importance, these issues can often be

dealt with as described earlier, such as normalization and weighting of

attributes.

For sparse data, which often consists of asymmetric attributes, we typically

employ similarity measures that ignore 0–0 matches. Conceptually, this

reflects the fact that, for a pair of complex objects, similarity depends on the

number of characteristics they both share, rather than the number of

characteristics they both lack. The cosine, Jaccard, and extended Jaccard

measures are appropriate for such data.

There are other characteristics of data vectors that often need to be

considered. Invariance to scaling (multiplication) and to translation (addition)

were previously discussed with respect to Euclidean distance and the cosine

and correlation measures. The practical implications of such considerations

are that, for example, cosine is more suitable for sparse document data where

only scaling is important, while correlation works better for time series, where

both scaling and translation are important. Euclidean distance or other types

of Minkowski distance are most appropriate when two data vectors are to

match as closely as possible across all components (features).

In some cases, transformation or normalization of the data is needed to obtain

a proper similarity measure. For instance, time series can have trends or

periodic patterns that significantly impact similarity. Also, a proper computation

of similarity often requires that time lags be taken into account. Finally, two

time series may be similar only over specific periods of time. For example,

there is a strong relationship between temperature and the use of natural gas,

but only during the heating season.

Practical consideration can also be important. Sometimes, one or more

proximity measures are already in use in a particular field, and thus, others

will have answered the question of which proximity measures should be used.

Other times, the software package or clustering algorithm being used can

drastically limit the choices. If efficiency is a concern, then we may want to

choose a proximity measure that has a property, such as the triangle

inequality, that can be used to reduce the number of proximity calculations.

(See Exercise 25 .)

However, if common practice or practical restrictions do not dictate a choice,

then the proper choice of a proximity measure can be a time-consuming task

that requires careful consideration of both domain knowledge and the purpose

for which the measure is being used. A number of different similarity

measures may need to be evaluated to see which ones produce results that

make the most sense.

2.5 Bibliographic Notes

It is essential to understand the nature of the data that is being analyzed, and

at a fundamental level, this is the subject of measurement theory. In particular,

one of the initial motivations for defining types of attributes was to be precise

about which statistical operations were valid for what sorts of data. We have

presented the view of measurement theory that was initially described in a

classic paper by S. S. Stevens [112]. (Tables 2.2 and 2.3 are derived

from those presented by Stevens [113].) While this is the most common view

and is reasonably easy to understand and apply, there is, of course, much

more to measurement theory. An authoritative discussion can be found in a

three-volume series on the foundations of measurement theory [88, 94, 114].

Also of interest is a wide-ranging article by Hand [77], which discusses

measurement theory and statistics, and is accompanied by comments from

other researchers in the field. Numerous critiques and extensions of the

approach of Stevens have been made [66, 97, 117]. Finally, many books and

articles describe measurement issues for particular areas of science and

engineering.

Data quality is a broad subject that spans every discipline that uses data.

Discussions of precision, bias, accuracy, and significant figures can be found

in many introductory science, engineering, and statistics textbooks. The view

of data quality as “fitness for use” is explained in more detail in the book by

Redman [103]. Those interested in data quality may also be interested in

MIT’s Information Quality (MITIQ) Program [95, 118]. However, the knowledge

needed to deal with specific data quality issues in a particular domain is often

best obtained by investigating the data quality practices of researchers in that

field.

Aggregation is a less well-defined subject than many other preprocessing

tasks. However, aggregation is one of the main techniques used by the

database area of Online Analytical Processing (OLAP) [68, 76, 102]. There

has also been relevant work in the area of symbolic data analysis (Bock and

Diday [64]). One of the goals in this area is to summarize traditional record

data in terms of symbolic data objects whose attributes are more complex

than traditional attributes. Specifically, these attributes can have values that

are sets of values (categories), intervals, or sets of values with weights

(histograms). Another goal of symbolic data analysis is to be able to perform

clustering, classification, and other kinds of data analysis on data that consists

of symbolic data objects.

Sampling is a subject that has been well studied in statistics and related fields.

Many introductory statistics books, such as the one by Lindgren [90], have

some discussion about sampling, and entire books are devoted to the subject,

such as the classic text by Cochran [67]. A survey of sampling for data mining

is provided by Gu and Liu [74], while a survey of sampling for databases is

provided by Olken and Rotem [98]. There are a number of other data mining

and database-related sampling references that may be of interest, including

papers by Palmer and Faloutsos [100], Provost et al. [101], Toivonen [115],

and Zaki et al. [119].

In statistics, the traditional techniques that have been used for dimensionality

reduction are multidimensional scaling (MDS) (Borg and Groenen [65],

Kruskal and Uslaner [89]) and principal component analysis (PCA) (Jolliffe

[80]), which is similar to singular value decomposition (SVD) (Demmel [70]).

Dimensionality reduction is discussed in more detail in Appendix B.

Discretization is a topic that has been extensively investigated in data mining.

Some classification algorithms work only with categorical data, and

association analysis requires binary data, and thus, there is a significant

motivation to investigate how to best binarize or discretize continuous

attributes. For association analysis, we refer the reader to work by Srikant and

Agrawal [111], while some useful references for discretization in the area of

classification include work by Dougherty et al. [71], Elomaa and Rousu [72],

Fayyad and Irani [73], and Hussain et al. [78].

Feature selection is another topic well investigated in data mining. A broad

coverage of this topic is provided in a survey by Molina et al. [96] and two

books by Liu and Motada [91, 92]. Other useful papers include those by Blum

and Langley [63], Kohavi and John [87], and Liu et al. [93].

It is difficult to provide references for the subject of feature transformations

because practices vary from one discipline to another. Many statistics books

have a discussion of transformations, but typically the discussion is restricted

to a particular purpose, such as ensuring the normality of a variable or making

sure that variables have equal variance. We offer two references: Osborne

[99] and Tukey [116].

While we have covered some of the most commonly used distance and

similarity measures, there are hundreds of such measures and more are

being created all the time. As with so many other topics in this chapter, many

of these measures are specific to particular fields, e.g., in the area of time

series see papers by Kalpakis et al. [81] and Keogh and Pazzani [83].

Clustering books provide the best general discussions. In particular, see the

books by Anderberg [62], Jain and Dubes [79], Kaufman and Rousseeuw [82],

and Sneath and Sokal [109].

Information-based measures of similarity have become more popular lately

despite the computational difficulties and expense of calculating them. A good

introduction to information theory is provided by Cover and Thomas [69].

Computing the mutual information for continuous variables can be

straightforward if they follow a well-know distribution, such as Gaussian.

However, this is often not the case, and many techniques have been

developed. As one example, the article by Khan, et al. [85] compares various

methods in the context of comparing short time series. See also the

information and mutual information packages for R and Matlab. Mutual

information has been the subject of considerable recent attention due to paper

by Reshef, et al. [104, 105] that introduced an alternative measure, albeit one

based on mutual information, which was claimed to have superior properties.

Although this approach had some early support, e.g., [110], others have

pointed out various limitations [75, 86, 108].

Two popular books on the topic of kernel methods are [106] and [107]. The

latter also has a website with links to kernel-related materials [84]. In addition,

many current data mining, machine learning, and statistical learning textbooks

have some material about kernel methods. Further references for kernel

methods in the context of support vector machine classifiers are provided in

the bibliographic Notes of Section 4.9.4.

Bibliography

[62] M. R. Anderberg. Cluster Analysis for Applications. Academic Press, New

York, December 1973.

[63] A. Blum and P. Langley. Selection of Relevant Features and Examples in

Machine Learning. Artificial Intelligence, 97(1–2):245–271, 1997.

[64] H. H. Bock and E. Diday. Analysis of Symbolic Data: Exploratory Methods

for Extracting Statistical Information from Complex Data (Studies in

Classification, Data Analysis, and Knowledge Organization). Springer-

Verlag Telos, January 2000.

[65] I. Borg and P. Groenen. Modern Multidimensional Scaling—Theory and

Applications. Springer-Verlag, February 1997.

[66] N. R. Chrisman. Rethinking levels of measurement for cartography.

Cartography and Geographic Information Systems, 25(4):231–242, 1998.

[67] W. G. Cochran. Sampling Techniques. John Wiley & Sons, 3rd edition,

July 1977.

[68] E. F. Codd, S. B. Codd, and C. T. Smalley. Providing OLAP (On-line

Analytical Processing) to User- Analysts: An IT Mandate. White Paper, E.F.

Codd and Associates, 1993.

[69] T. M. Cover and J. A. Thomas. Elements of information theory. John

Wiley & Sons, 2012.

[70] J. W. Demmel. Applied Numerical Linear Algebra. Society for Industrial &

Applied Mathematics, September 1997.

[71] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsupervised

Discretization of Continuous Features. In Proc. of the 12th Intl. Conf. on

Machine Learning, pages 194–202, 1995.

[72] T. Elomaa and J. Rousu. General and Efficient Multisplitting of Numerical

Attributes. Machine Learning, 36(3):201–244, 1999.

[73] U. M. Fayyad and K. B. Irani. Multi-interval discretization of

continuousvalued attributes for classification learning. In Proc. 13th Int.

Joint Conf. on Artificial Intelligence, pages 1022–1027. Morgan Kaufman,

1993.

[74] F. H. Gaohua Gu and H. Liu. Sampling and Its Application in Data Mining:

A Survey. Technical Report TRA6/00, National University of Singapore,

Singapore, 2000.

[75] M. Gorfine, R. Heller, and Y. Heller. Comment on Detecting novel

associations in large data sets. Unpublished (available at http://emotion.

technion. ac. il/ gorfinm/files/science6. pdf on 11 Nov. 2012), 2012.

[76] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M.

Venkatrao, F. Pellow, and H. Pirahesh. Data Cube: A Relational

Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.

Journal Data Mining and Knowledge Discovery, 1(1): 29–53, 1997.

[77] D. J. Hand. Statistics and the Theory of Measurement. Journal of the

Royal Statistical Society: Series A (Statistics in Society), 159(3):445–492,

1996.

[78] F. Hussain, H. Liu, C. L. Tan, and M. Dash. TRC6/99: Discretization: an

enabling technique. Technical report, National University of Singapore,

Singapore, 1999.

[79] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall

Advanced Reference Series. Prentice Hall, March 1988.

[80] I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 2nd edition,

October 2002.

[81] K. Kalpakis, D. Gada, and V. Puttagunta. Distance Measures for Effective

Clustering of ARIMA Time-Series. In Proc. of the 2001 IEEE Intl. Conf. on

Data Mining, pages 273–280. IEEE Computer Society, 2001.

[82] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An

Introduction to Cluster Analysis. Wiley Series in Probability and Statistics.

John Wiley and Sons, New York, November 1990.

[83] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping for

datamining applications. In KDD, pages 285–289, 2000.

[84] Kernel Methods for Pattern Analysis Website. http://www.kernel-

methods.net/, 2014.

[85] S. Khan, S. Bandyopadhyay, A. R. Ganguly, S. Saigal, D. J. Erickson III,

V. Protopopescu, and G. Ostrouchov. Relative performance of mutual

information estimation methods for quantifying the dependence among

short and noisy data. Physical Review E, 76(2):026209, 2007.

[86] J. B. Kinney and G. S. Atwal. Equitability, mutual information, and the

maximal information coefficient. Proceedings of the National Academy of

Sciences, 2014.

[87] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection.

Artificial Intelligence, 97(1–2):273–324, 1997.

[88] D. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foundations of

Measurements: Volume 1: Additive and polynomial representations.

Academic Press, New York, 1971.

[89] J. B. Kruskal and E. M. Uslaner. Multidimensional Scaling. Sage

Publications, August 1978.

[90] B. W. Lindgren. Statistical Theory. CRC Press, January 1993.

[91] H. Liu and H. Motoda, editors. Feature Extraction, Construction and

Selection: A Data Mining Perspective. Kluwer International Series in

Engineering and Computer Science, 453. Kluwer Academic Publishers,

July 1998.

[92] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and

Data Mining. Kluwer International Series in Engineering and Computer

Science, 454. Kluwer Academic Publishers, July 1998.

[93] H. Liu, H. Motoda, and L. Yu. Feature Extraction, Selection, and

Construction. In N. Ye, editor, The Handbook of Data Mining, pages 22–

41. Lawrence Erlbaum Associates, Inc., Mahwah, NJ, 2003.

[94] R. D. Luce, D. Krantz, P. Suppes, and A. Tversky. Foundations of

Measurements: Volume 3: Representation, Axiomatization, and

Invariance. Academic Press, New York, 1990.

[95] MIT Information Quality (MITIQ) Program. http://mitiq.mit.edu/, 2014.

[96] L. C. Molina, L. Belanche, and A. Nebot. Feature Selection Algorithms: A

Survey and Experimental Evaluation. In Proc. of the 2002 IEEE Intl. Conf.

on Data Mining, 2002.

[97] F. Mosteller and J. W. Tukey. Data analysis and regression: a second

course in statistics. Addison-Wesley, 1977.

[98] F. Olken and D. Rotem. Random Sampling from Databases—A Survey.

Statistics & Computing, 5(1):25–42, March 1995.

[99] J. Osborne. Notes on the Use of Data Transformations. Practical

Assessment, Research & Evaluation, 28(6), 2002.

[100] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved

method for data mining and clustering. ACM SIGMOD Record, 29(2):82–

92, 2000.

[101] F. J. Provost, D. Jensen, and T. Oates. Efficient Progressive Sampling.

In Proc. of the 5th Intl. Conf. on Knowledge Discovery and Data Mining,

pages 23–32, 1999.

[102] R. Ramakrishnan and J. Gehrke. Database Management Systems.

McGraw-Hill, 3rd edition, August 2002.

[103] T. C. Redman. Data Quality: The Field Guide. Digital Press, January

2001.

[104] D. Reshef, Y. Reshef, M. Mitzenmacher, and P. Sabeti. Equitability

analysis of the maximal information coefficient, with comparisons. arXiv

preprint arXiv:1301.6314, 2013.

[105] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G.

McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C.

Sabeti. Detecting novel associations in large data sets. science,

334(6062):1518–1524, 2011.

[106] B. Schölkopf and A. J. Smola. Learning with kernels: support vector

machines, regularization, optimization, and beyond. MIT press, 2002.

[107] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis.

Cambridge university press, 2004.

[108] N. Simon and R. Tibshirani. Comment on” Detecting Novel Associations

In Large Data Sets” by Reshef Et Al, Science Dec 16, 2011. arXiv preprint

arXiv:1401.7645, 2014.

[109] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, San

Francisco, 1971.

[110] T. Speed. A correlation for the 21st century. Science, 334(6062):1502–

1503, 2011.

[111] R. Srikant and R. Agrawal. Mining Quantitative Association Rules in

Large Relational Tables. In Proc. of 1996 ACM-SIGMOD Intl. Conf. on

Management of Data, pages 1–12, Montreal, Quebec, Canada, August

1996.

[112] S. S. Stevens. On the Theory of Scales of Measurement. Science,

103(2684):677–680, June 1946.

[113] S. S. Stevens. Measurement. In G. M. Maranell, editor, Scaling: A

Sourcebook for Behavioral Scientists, pages 22–41. Aldine Publishing Co.,

Chicago, 1974.

[114] P. Suppes, D. Krantz, R. D. Luce, and A. Tversky. Foundations of

Measurements: Volume 2: Geometrical, Threshold, and Probabilistic

Representations. Academic Press, New York, 1989.

[115] H. Toivonen. Sampling Large Databases for Association Rules. In

VLDB96, pages 134–145. Morgan Kaufman, September 1996.

[116] J. W. Tukey. On the Comparative Anatomy of Transformations. Annals of

Mathematical Statistics, 28(3):602–632, September 1957.

[117] P. F. Velleman and L. Wilkinson. Nominal, ordinal, interval, and ratio

typologies are misleading. The American Statistician, 47(1):65–72, 1993.

[118] R. Y. Wang, M. Ziad, Y. W. Lee, and Y. R. Wang. Data Quality. The

Kluwer International Series on Advances in Database Systems, Volume

23. Kluwer Academic Publishers, January 2001.

[119] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of

Sampling for Data Mining of Association Rules. Technical Report TR617,

Rensselaer Polytechnic Institute, 1996.

2.6 Exercises

1. In the initial example of Chapter 2 , the statistician says, “Yes, fields 2

and 3 are basically the same.” Can you tell from the three lines of sample data

that are shown why she says that?

2. Classify the following attributes as binary, discrete, or continuous. Also

classify them as qualitative (nominal or ordinal) or quantitative (interval or

ratio). Some cases may have more than one interpretation, so briefly indicate

your reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio

a. Time in terms of AM or PM.

b. Brightness as measured by a light meter.

c. Brightness as measured by people’s judgments.

d. Angles as measured in degrees between 0 and 360.

e. Bronze, Silver, and Gold medals as awarded at the Olympics.

f. Height above sea level.

g. Number of patients in a hospital.

h. ISBN numbers for books. (Look up the format on the Web.)

i. Ability to pass light in terms of the following values: opaque, translucent,

transparent.

j. Military rank.

k. Distance from the center of campus.

l. Density of a substance in grams per cubic centimeter.

m. Coat check number. (When you attend an event, you can often give your

coat to someone who, in turn, gives you a number that you can use to

claim your coat when you leave.)

3. You are approached by the marketing director of a local company, who

believes that he has devised a foolproof way to measure customer

satisfaction. He explains his scheme as follows: “It’s so simple that I can’t

believe that no one has thought of it before. I just keep track of the number of

customer complaints for each product. I read in a data mining book that

counts are ratio attributes, and so, my measure of product satisfaction must

be a ratio attribute. But when I rated the products based on my new customer

satisfaction measure and showed them to my boss, he told me that I had

overlooked the obvious, and that my measure was worthless. I think that he

was just mad because our bestselling product had the worst satisfaction since

it had the most complaints. Could you help me set him straight?”

a. Who is right, the marketing director or his boss? If you answered, his

boss, what would you do to fix the measure of satisfaction?

b. What can you say about the attribute type of the original product

satisfaction attribute?

4. A few months later, you are again approached by the same marketing

director as in Exercise 3 . This time, he has devised a better approach to

measure the extent to which a customer prefers one product over other similar

products. He explains, “When we develop new products, we typically create

several variations and evaluate which one customers prefer. Our standard

procedure is to give our test subjects all of the product variations at one time

and then ask them to rank the product variations in order of preference.

However, our test subjects are very indecisive, especially when there are

more than two products. As a result, testing takes forever. I suggested that we

perform the comparisons in pairs and then use these comparisons to get the

rankings. Thus, if we have three product variations, we have the customers

compare variations 1 and 2, then 2 and 3, and finally 3 and 1. Our testing time

with my new procedure is a third of what it was for the old procedure, but the

employees conducting the tests complain that they cannot come up with a

consistent ranking from the results. And my boss wants the latest product

evaluations, yesterday. I should also mention that he was the person who

came up with the old product evaluation approach. Can you help me?”

a. Is the marketing director in trouble? Will his approach work for generating

an ordinal ranking of the product variations in terms of customer

preference? Explain.

b. Is there a way to fix the marketing director’s approach? More generally,

what can you say about trying to create an ordinal measurement scale

based on pairwise comparisons?

c. For the original product evaluation scheme, the overall rankings of each

product variation are found by computing its average over all test

subjects. Comment on whether you think that this is a reasonable

approach. What other approaches might you take?

5. Can you think of a situation in which identification numbers would be useful

for prediction?

6. An educational psychologist wants to use association analysis to analyze

test results. The test consists of 100 questions with four possible answers

each.

a. How would you convert this data into a form suitable for association

analysis?

b. In particular, what type of attributes would you have and how many of

them are there?

7. Which of the following quantities is likely to show more temporal

autocorrelation: daily rainfall or daily temperature? Why?

8. Discuss why a document-term matrix is an example of a data set that has

asymmetric discrete or asymmetric continuous features.

9. Many sciences rely on observation instead of (or in addition to) designed

experiments. Compare the data quality issues involved in observational

science with those of experimental science and data mining.

10. Discuss the difference between the precision of a measurement and the

terms single and double precision, as they are used in computer science,

typically to represent floating-point numbers that require 32 and 64 bits,

respectively.

11. Give at least two advantages to working with data stored in text files

instead of in a binary format.

12. Distinguish between noise and outliers. Be sure to consider the following

questions.

a. Is noise ever interesting or desirable? Outliers?

b. Can noise objects be outliers?

c. Are noise objects always outliers?

d. Are outliers always noise objects?

e. Can noise make a typical value into an unusual one, or vice versa?

Algorithm 2.3 Algorithm for finding k-

nearest neighbors.

13. Consider the problem of finding the K-nearest neighbors of a data object.

A programmer designs Algorithm 2.3 for this task.

a. Describe the potential problems with this algorithm if there are duplicate

objects in the data set. Assume the distance function will return a

distance of 0 only for objects that are the same.

b. How would you fix this problem?

14. The following attributes are measured for members of a herd of Asian

elephants: weight, height, tusk length, trunk length, and ear area. Based on

these measurements, what sort of proximity measure from Section 2.4

would you use to compare or group these elephants? Justify your answer and

explain any special circumstances.

15. You are given a set of m objects that is divided into k groups, where the i

group is of size If the goal is to obtain a sample of size what is the

difference between the following two sampling schemes? (Assume sampling

with replacement.)

: for to number of data objects do

: Find the distances of the object to all other objects.

3: Sort these distances in decreasing order.

(Keep track of which object is associated with each distance.)

4: return the objects associated with the first k distances of the

sorted list

5: end for

1 i=1

2 ith

th

mi. n<m,

a. We randomly select elements from each group.

b. We randomly select n elements from the data set, without regard for the

group to which an object belongs.

16. Consider a document-term matrix, where is the frequency of the

word (term) in the document and m is the number of documents. Consider

the variable transformation that is defined by

where is the number of documents in which the term appears, which is

known as the document frequency of the term. This transformation is known

as the inverse document frequency transformation.

a. What is the effect of this transformation if a term occurs in one

document? In every document?

b. What might be the purpose of this transformation?

17. Assume that we apply a square root transformation to a ratio attribute x to

obtain the new attribute As part of your analysis, you identify an interval (a,

b) in which has a linear relationship to another attributey.

a. What is the corresponding interval (A, B) in terms of x ?

b. Give an equation that relates y to x.

18. This exercise compares and contrasts some similarity and distance

measures.

a. For binary data, the L1 distance corresponds to the Hamming distance;

that is, the number of bits that are different between two binary vectors.

The Jaccard similarity is a measure of the similarity between two binary

n×mi/m

tfij ith

jth

tfij′=tfij×logmdfi, (2.31)

dfi ith

x*.

x*

vectors. Compute the Hamming distance and the Jaccard similarity

between the following two binary vectors.

b. Which approach, Jaccard or Hamming distance, is more similar to the

Simple Matching Coefficient, and which approach is more similar to the

cosine measure? Explain. (Note: The Hamming measure is a distance,

while the other three measures are similarities, but don’t let this confuse

you.)

c. Suppose that you are comparing how similar two organisms of different

species are in terms of the number of genes they share. Describe which

measure, Hamming or Jaccard, you think would be more appropriate for

comparing the genetic makeup of two organisms. Explain. (Assume that

each animal is represented as a binary vector, where each attribute is 1 if

a particular gene is present in the organism and 0 otherwise.)

d. If you wanted to compare the genetic makeup of two organisms of the

same species, e.g., two human beings, would you use the Hamming

distance, the Jaccard coefficient, or a different measure of similarity or

distance? Explain. (Note that two human beings share of the

same genes.)

19. For the following vectors, x and y, calculate the indicated similarity or

distance measures.

a. cosine, correlation, Euclidean

b. cosine, correlation, Euclidean, Jaccard

c. cosine, correlation, Euclidean

d. cosine, correlation, Jaccard

x=0101010001y=0100011000

>99.9%

x=(1, 1, 1, 1), y=(2, 2, 2, 2)

x=(0, 1, 0, 1), y=(1, 0, 1, 0)

x=(0, −1, 0, 1), y=(1, 0, −1, 0)

x=(1, 1, 0, 1, 0, 1), y=(1, 1, 1, 0, 0, 1)

e. cosine, correlation

20. Here, we further explore the cosine and correlation measures.

a. What is the range of values possible for the cosine measure?

b. If two objects have a cosine measure of 1, are they identical? Explain.

c. What is the relationship of the cosine measure to correlation, if any?

(Hint: Look at statistical measures such as mean and standard deviation

in cases where cosine and correlation are the same and different.)

d. Figure 2.22(a) shows the relationship of the cosine measure to

Euclidean distance for 100,000 randomly generated points that have

been normalized to have an L2 length of 1. What general observation can

you make about the relationship between Euclidean distance and cosine

similarity when vectors have an L2 norm of 1?

Figure 2.22.

Graphs for Exercise 20 .

x=(2, −1, 0, 2, 0, −3), y=( −1, 1, −1, 0, 0, −1)

e. Figure 2.22(b) shows the relationship of correlation to Euclidean

distance for 100,000 randomly generated points that have been

standardized to have a mean of 0 and a standard deviation of 1. What

general observation can you make about the relationship between

Euclidean distance and correlation when the vectors have been

standardized to have a mean of 0 and a standard deviation of 1?

f. Derive the mathematical relationship between cosine similarity and

Euclidean distance when each data object has an L length of 1.

g. Derive the mathematical relationship between correlation and Euclidean

distance when each data point has been been standardized by

subtracting its mean and dividing by its standard deviation.

21. Show that the set difference metric given by

satisfies the metric axioms given on page 77 . A and B are sets and is

the set difference.

22. Discuss how you might map correlation values from the interval to

the interval [0, 1]. Note that the type of transformation that you use might

depend on the application that you have in mind. Thus, consider two

applications: clustering time series and predicting the behavior of one time

series given another.

23. Given a similarity measure with values in the interval [0, 1], describe two

ways to transform this similarity value into a dissimilarity value in the interval

24. Proximity is typically defined between a pair of objects.

2

d(A, B)=size(A−B)+size(B−A) (2.32)

A−B

[−1, 1]

[0, ∞].

a. Define two ways in which you might define the proximity among a group

of objects.

b. How might you define the distance between two sets of points in

Euclidean space?

c. How might you define the proximity between two sets of data objects?

(Make no assumption about the data objects, except that a proximity

measure is defined between any pair of objects.)

25. You are given a set of points s in Euclidean space, as well as the distance

of each point in s to a point x. (It does not matter if )

a. If the goal is to find all points within a specified distance of point

explain how you could use the triangle inequality and the already

calculated distances to x to potentially reduce the number of distance

calculations necessary? Hint: The triangle inequality,

can be rewritten as

b. In general, how would the distance between x and y affect the number of

distance calculations?

c. Suppose that you can find a small subset of points from the original

data set, such that every point in the data set is within a specified

distance of at least one of the points in and that you also have the

pairwise distance matrix for Describe a technique that uses this

information to compute, with a minimum of distance calculations, the set

of all points within a distance of of a specified point from the data set.

26. Show that 1 minus the Jaccard similarity is a distance measure between

two data objects, x and y, that satisfies the metric axioms given on page

77 . Specifically,

x∈S.

ε y, y≠x,

d(x, z)≤d(x, y)+d(y, x), d(x, y)≥d(x, z)−d(y, z).

S′,

ε S′,

S′.

β

d(x, y)=1−J(x, y).

27. Show that the distance measure defined as the angle between two data

vectors, x and y, satisfies the metric axioms given on page 77 . Specifically,

28. Explain why computing the proximity between two attributes is often

simpler than computing the similarity between two objects.

d(x, y)=arccos(cos(x, y)).

3 Classification: Basic Concepts and

Techniques

Humans have an innate ability to classify things into

categories, e.g., mundane tasks such as filtering spam

email messages or more specialized tasks such as

recognizing celestial objects in telescope images (see

Figure 3.1 ). While manual classification often suffices

for small and simple data sets with only a few

attributes, larger and more complex data sets require

an automated solution.

Figure 3.1.

Classification of galaxies from telescope images taken

from the NASA website.

This chapter introduces the basic concepts of

classification and describes some of its key issues such

as model overfitting, model selection, and model

evaluation. While these topics are illustrated using a

classification technique known as decision tree

induction, most of the discussion in this chapter is also

applicable to other classification techniques, many of

which are covered in Chapter 4 .

3.1 Basic Concepts

Figure 3.2 illustrates the general idea behind classification. The data for a

classification task consists of a collection of instances (records). Each such

instance is characterized by the tuple ( , y), where is the set of attribute

values that describe the instance and y is the class label of the instance. The

attribute set can contain attributes of any type, while the class label y must

be categorical.

Figure 3.2.

A schematic illustration of a classification task.

A classification model is an abstract representation of the relationship

between the attribute set and the class label. As will be seen in the next two

chapters, the model can be represented in many ways, e.g., as a tree, a

probability table, or simply, a vector of real-valued parameters. More formally,

we can express it mathematically as a target function f that takes as input the

attribute set and produces an output corresponding to the predicted class

label. The model is said to classify an instance ( , y) correctly if .

Table 3.1 shows examples of attribute sets and class labels for various

classification tasks. Spam filtering and tumor identification are examples of

binary classification problems, in which each data instance can be categorized

into one of two classes. If the number of classes is larger than 2, as in the

f(x)=y

galaxy classification example, then it is called a multiclass classification

problem.

Table 3.1. Examples of classification tasks.

Task Attribute set Class label

Spam filtering Features extracted from email message header

and content

spam or non-spam

Tumor

identification

Features extracted from magnetic resonance

imaging (MRI) scans

malignant or benign

Galaxy

classification

Features extracted from telescope images elliptical, spiral, or

irregular-shaped

We illustrate the basic concepts of classification in this chapter with the

following two examples.

3.1. Example Vertebrate Classification

Table 3.2 shows a sample data set for classifying vertebrates into

mammals, reptiles, birds, fishes, and amphibians. The attribute set

includes characteristics of the vertebrate such as its body temperature,

skin cover, and ability to fly. The data set can also be used for a binary

classification task such as mammal classification, by grouping the reptiles,

birds, fishes, and amphibians into a single category called non-mammals.

Table 3.2. A sample data for the vertebrate classification problem.

Vertebrate

Name

Body

Temperature

Skin

Cover

Gives

Birth

Aquatic

Creature

Aerial

Creature

Has

Legs

Hibernates Class

Label

human warm-

blooded

hair yes no no yes no mammal

3.2. Example Loan Borrower Classification

Consider the problem of predicting whether a loan borrower will repay the

loan or default on the loan payments. The data set used to build the

blooded

python cold-blooded scales no no no no yes reptile

salmon cold-blooded scales no yes no no no fish

whale warm-

blooded

hair yes yes no no no mammal

frog cold-blooded none no semi no yes yes amphibian

komodo cold-blooded scales no no no yes no reptile

dragon

bat warm-

blooded

hair yes no yes yes yes mammal

pigeon warm-

blooded

feathers no no yes yes no bird

cat warm-

blooded

fur yes no no yes no mammal

leopard cold-blooded scales yes yes no no no fish

shark

turtle cold-blooded scales no semi no yes no reptile

penguin warm-

blooded

feathers no semi no yes no bird

porcupine warm-

blooded

quills yes no no yes yes mammal

eel cold-blooded scales no yes no no no fish

salamander cold-blooded none no semi no yes yes amphibian

classification model is shown in Table 3.3 . The attribute set includes

personal information of the borrower such as marital status and annual

income, while the class label indicates whether the borrower had defaulted

on the loan payments.

Table 3.3. A sample data for the loan borrower classification problem.

ID Home Owner Marital Status Annual Income Defaulted?

1 Yes Single 125000 No

2 No Married 100000 No

3 No Single 70000 No

4 Yes Married 120000 No

5 No Divorced 95000 Yes

6 No Single 60000 No

7 Yes Divorced 220000 No

8 No Single 85000 Yes

9 No Married 75000 No

10 No Single 90000 Yes

A classification model serves two important roles in data mining. First, it is

used as a predictive model to classify previously unlabeled instances. A

good classification model must provide accurate predictions with a fast

response time. Second, it serves as a descriptive model to identify the

characteristics that distinguish instances from different classes. This is

particularly useful for critical applications, such as medical diagnosis, where it

is insufficient to have a model that makes a prediction without justifying how it

reaches such a decision.

For example, a classification model induced from the vertebrate data set

shown in Table 3.2 can be used to predict the class label of the following

vertebrate:

In addition, it can be used as a descriptive model to help determine

characteristics that define a vertebrate as a mammal, a reptile, a bird, a fish,

or an amphibian. For example, the model may identify mammals as warm-

blooded vertebrates that give birth to their young.

There are several points worth noting regarding the previous example. First,

although all the attributes shown in Table 3.2 are qualitative, there are no

restrictions on the type of attributes that can be used as predictor variables.

The class label, on the other hand, must be of nominal type. This

distinguishes classification from other predictive modeling tasks such as

regression, where the predicted value is often quantitative. More information

about regression can be found in Appendix D.

Another point worth noting is that not all attributes may be relevant to the

classification task. For example, the average length or weight of a vertebrate

may not be useful for classifying mammals, as these attributes can show

same value for both mammals and non-mammals. Such an attribute is

typically discarded during preprocessing. The remaining attributes might not

be able to distinguish the classes by themselves, and thus, must be used in

Vertebrate

Name

Body

Temperature

Skin

Cover

Gives

Birth

Aquatic

Creature

Aerial

Creature

Has

Legs

Hibernates Class

Label

gila

monster

cold-blooded scales no no no yes yes ?

concert with other attributes. For instance, the Body Temperature attribute is

insufficient to distinguish mammals from other vertebrates. When it is used

together with Gives Birth, the classification of mammals improves significantly.

However, when additional attributes, such as Skin Cover are included, the

model becomes overly specific and no longer covers all mammals. Finding the

optimal combination of attributes that best discriminates instances from

different classes is the key challenge in building classification models.

3.2 General Framework for

Classification

Classification is the task of assigning labels to unlabeled data instances and a

classifier is used to perform such a task. A classifier is typically described in

terms of a model as illustrated in the previous section. The model is created

using a given a set of instances, known as the training set, which contains

attribute values as well as class labels for each instance. The systematic

approach for learning a classification model given a training set is known as a

learning algorithm. The process of using a learning algorithm to build a

classification model from the training data is known as induction. This

process is also often described as “learning a model” or “building a model.”

This process of applying a classification model on unseen test instances to

predict their class labels is known as deduction. Thus, the process of

classification involves two steps: applying a learning algorithm to training data

to learn a model, and then applying the model to assign labels to unlabeled

instances. Figure 3.3 illustrates the general framework for classification.

Figure 3.3.

General framework for building a classification model.

A classification technique refers to a general approach to classification, e.g.,

the decision tree technique that we will study in this chapter. This classification

technique like most others, consists of a family of related models and a

number of algorithms for learning these models. In Chapter 4 , we will study

additional classification techniques, including neural networks and support

vector machines.

A couple notes on terminology. First, the terms “classifier” and “model” are

often taken to be synonymous. If a classification technique builds a single,

global model, then this is fine. However, while every model defines a classifier,

not every classifier is defined by a single model. Some classifiers, such as k-

nearest neighbor classifiers, do not build an explicit model (Section 4.3 ),

while other classifiers, such as ensemble classifiers, combine the output of a

collection of models (Section 4.10 ). Second, the term “classifier” is often

used in a more general sense to refer to a classification technique. Thus, for

example, “decision tree classifier” can refer to the decision tree classification

technique or a specific classifier built using that technique. Fortunately, the

meaning of “classifier” is usually clear from the context.

In the general framework shown in Figure 3.3 , the induction and deduction

steps should be performed separately. In fact, as will be discussed later in

Section 3.6 , the training and test sets should be independent of each other

to ensure that the induced model can accurately predict the class labels of

instances it has never encountered before. Models that deliver such predictive

insights are said to have good generalization performance. The

performance of a model (classifier) can be evaluated by comparing the

predicted labels against the true labels of instances. This information can be

summarized in a table called a confusion matrix. Table 3.4 depicts the

confusion matrix for a binary classification problem. Each entry denotes the

number of instances from class i predicted to be of class j. For example, is

the number of instances from class 0 incorrectly predicted as class 1. The

number of correct predictions made by the model is and the number

of incorrect predictions is .

Table 3.4. Confusion matrix for a binary classification problem.

Predicted Class

Actual Class

fij

f01

(f11+f00)

(f10+f01)

Class=1 Class=0

Class=1 f11 f10

Although a confusion matrix provides the information needed to determine

how well a classification model performs, summarizing this information into a

single number makes it more convenient to compare the relative performance

of different models. This can be done using an evaluation metric such as

accuracy, which is computed in the following way:

Accuracy =

For binary classification problems, the accuracy of a model is given by

Error rate is another related metric, which is defined as follows for binary

classification problems:

The learning algorithms of most classification techniques are designed to

learn models that attain the highest accuracy, or equivalently, the lowest error

rate when applied to the test set. We will revisit the topic of model evaluation

in Section 3.6 .

Class=0 f01 f00

Accuracy=Number of correct predictionsTotal number of predictions. (3.1)

Accuracy=f11+f00f11+f10+f01+f00. (3.2)

Error rate=Number of wrong predictionsTotal number of predictions=f10+f01f11(3.3)

3.3 Decision Tree Classifier

This section introduces a simple classification technique known as the

decision tree classifier. To illustrate how a decision tree works, consider the

classification problem of distinguishing mammals from non-mammals using

the vertebrate data set shown in Table 3.2 . Suppose a new species is

discovered by scientists. How can we tell whether it is a mammal or a non-

mammal? One approach is to pose a series of questions about the

characteristics of the species. The first question we may ask is whether the

species is cold- or warm-blooded. If it is cold-blooded, then it is definitely not a

mammal. Otherwise, it is either a bird or a mammal. In the latter case, we

need to ask a follow-up question: Do the females of the species give birth to

their young? Those that do give birth are definitely mammals, while those that

do not are likely to be non-mammals (with the exception of egg-laying

mammals such as the platypus and spiny anteater).

The previous example illustrates how we can solve a classification problem by

asking a series of carefully crafted questions about the attributes of the test

instance. Each time we receive an answer, we could ask a follow-up question

until we can conclusively decide on its class label. The series of questions and

their possible answers can be organized into a hierarchical structure called a

decision tree. Figure 3.4 shows an example of the decision tree for the

mammal classification problem. The tree has three types of nodes:

A root node, with no incoming links and zero or more outgoing links.

Internal nodes, each of which has exactly one incoming link and two or

more outgoing links.

Leaf or terminal nodes, each of which has exactly one incoming link and

no outgoing links.

Every leaf node in the decision tree is associated with a class label. The non-

terminal nodes, which include the root and internal nodes, contain attribute

test conditions that are typically defined using a single attribute. Each

possible outcome of the attribute test condition is associated with exactly one

child of this node. For example, the root node of the tree shown in Figure

3.4 uses the attribute to define an attribute test condition

that has two outcomes, warm and cold, resulting in two child nodes.

Figure 3.4.

A decision tree for the mammal classification problem.

Given a decision tree, classifying a test instance is straightforward. Starting

from the root node, we apply its attribute test condition and follow the

appropriate branch based on the outcome of the test. This will lead us either

to another internal node, for which a new attribute test condition is applied, or

to a leaf node. Once a leaf node is reached, we assign the class label

associated with the node to the test instance. As an illustration, Figure 3.5

traces the path used to predict the class label of a flamingo. The path

terminates at a leaf node labeled as .

Figure 3.5.

Classifying an unlabeled vertebrate. The dashed lines represent the outcomes

of applying various attribute test conditions on the unlabeled vertebrate. The

vertebrate is eventually assigned to the class.

3.3.1 A Basic Algorithm to Build a

Decision Tree

Many possible decision trees that can be constructed from a particular data

set. While some trees are better than others, finding an optimal one is

computationally expensive due to the exponential size of the search space.

Efficient algorithms have been developed to induce a reasonably accurate,

albeit suboptimal, decision tree in a reasonable amount of time. These

algorithms usually employ a greedy strategy to grow the decision tree in a top-

down fashion by making a series of locally optimal decisions about which

attribute to use when partitioning the training data. One of the earliest method

is Hunt’s algorithm, which is the basis for many current implementations of

decision tree classifiers, including ID3, C4.5, and CART. This subsection

presents Hunt’s algorithm and describes some of the design issues that must

be considered when building a decision tree.

Hunt’s Algorithm

In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree

initially contains a single root node that is associated with all the training

instances. If a node is associated with instances from more than one class, it

is expanded using an attribute test condition that is determined using a

splitting criterion. A child leaf node is created for each outcome of the

attribute test condition and the instances associated with the parent node are

distributed to the children based on the test outcomes. This node expansion

step can then be recursively applied to each child node, as long as it has

labels of more than one class. If all the instances associated with a leaf node

have identical class labels, then the node is not expanded any further. Each

leaf node is assigned a class label that occurs most frequently in the training

instances associated with the node.

To illustrate how the algorithm works, consider the training set shown in Table

3.3 for the loan borrower classification problem. Suppose we apply Hunt’s

algorithm to fit the training data. The tree initially contains only a single leaf

node as shown in Figure 3.6(a) . This node is labeled as Defaulted = No,

since the majority of the borrowers did not default on their loan payments. The

training error of this tree is 30% as three out of the ten training instances have

the class label . The leaf node can therefore be further

expanded because it contains training instances from more than one class.

Figure 3.6.

Hunt’s algorithm for building decision trees.

Let Home Owner be the attribute chosen to split the training instances. The

justification for choosing this attribute as the attribute test condition will be

discussed later. The resulting binary split on the Home Owner attribute is

shown in Figure 3.6(b) . All the training instances for which Home Owner =

Yes are propagated to the left child of the root node and the rest are

propagated to the right child. Hunt’s algorithm is then recursively applied to

each child. The left child becomes a leaf node labeled , since

Defaulted = Yes

Defaulted = No

all instances associated with this node have identical class label

. The right child has instances from each class label. Hence,

we split it further. The resulting subtrees after recursively expanding the right

child are shown in Figures 3.6(c) and (d) .

Hunt’s algorithm, as described above, makes some simplifying assumptions

that are often not true in practice. In the following, we describe these

assumptions and briefly discuss some of the possible ways for handling them.

1. Some of the child nodes created in Hunt’s algorithm can be empty if

none of the training instances have the particular attribute values. One

way to handle this is by declaring each of them as a leaf node with a

class label that occurs most frequently among the training instances

associated with their parent nodes.

2. If all training instances associated with a node have identical attribute

values but different class labels, it is not possible to expand this node

any further. One way to handle this case is to declare it a leaf node and

assign it the class label that occurs most frequently in the training

instances associated with this node.

Design Issues of Decision Tree Induction

Hunt’s algorithm is a generic procedure for growing decision trees in a greedy

fashion. To implement the algorithm, there are two key design issues that

must be addressed.

1. What is the splitting criterion? At each recursive step, an attribute

must be selected to partition the training instances associated with a

node into smaller subsets associated with its child nodes. The splitting

criterion determines which attribute is chosen as the test condition and

Defaulted = No

how the training instances should be distributed to the child nodes. This

will be discussed in Sections 3.3.2 and 3.3.3 .

2. What is the stopping criterion? The basic algorithm stops expanding

a node only when all the training instances associated with the node

have the same class labels or have identical attribute values. Although

these conditions are sufficient, there are reasons to stop expanding a

node much earlier even if the leaf node contains training instances from

more than one class. This process is called early termination and the

condition used to determine when a node should be stopped from

expanding is called a stopping criterion. The advantages of early

termination are discussed in Section 3.4 .

3.3.2 Methods for Expressing Attribute

Test Conditions

Decision tree induction algorithms must provide a method for expressing an

attribute test condition and its corresponding outcomes for different attribute

types.

Binary Attributes

The test condition for a binary attribute generates two potential outcomes, as

shown in Figure 3.7 .

Figure 3.7.

Attribute test condition for a binary attribute.

Nominal Attributes

Since a nominal attribute can have many values, its attribute test condition

can be expressed in two ways, as a multiway split or a binary split as shown in

Figure 3.8 . For a multiway split (Figure 3.8(a) ), the number of outcomes

depends on the number of distinct values for the corresponding attribute. For

example, if an attribute such as marital status has three distinct values—

single, married, or divorced—its test condition will produce a three-way split. It

is also possible to create a binary split by partitioning all values taken by the

nominal attribute into two groups. For example, some decision tree

algorithms, such as CART, produce only binary splits by considering all

ways of creating a binary partition of k attribute values. Figure 3.8(b)

illustrates three different ways of grouping the attribute values for marital

status into two subsets.

2k

−1−1

Figure 3.8.

Attribute test conditions for nominal attributes.

Ordinal Attributes

Ordinal attributes can also produce binary or multi-way splits. Ordinal attribute

values can be grouped as long as the grouping does not violate the order

property of the attribute values. Figure 3.9 illustrates various ways of

splitting training records based on the Shirt Size attribute. The groupings

shown in Figures 3.9(a) and (b) preserve the order among the attribute

values, whereas the grouping shown in Figure 3.9(c) violates this property

because it combines the attribute values Small and Large into the same

partition while Medium and Extra Large are combined into another partition.

Figure 3.9.

Different ways of grouping ordinal attribute values.

Continuous Attributes

For continuous attributes, the attribute test condition can be expressed as a

comparison test (e.g., ) producing a binary split, or as a range query of the

form , for producing a multiway split. The difference

between these approaches is shown in Figure 3.10 . For the binary split,

any possible value v between the minimum and maximum attribute values in

the training data can be used for constructing the comparison test .

However, it is sufficient to only consider distinct attribute values in the training

set as candidate split positions. For the multiway split, any possible collection

of attribute value ranges can be used, as long as they are mutually exclusive

and cover the entire range of attribute values between the minimum and

maximum values observed in the training set. One approach for constructing

multiway splits is to apply the discretization strategies described in Section

2.3.6 on page 63. After discretization, a new ordinal value is assigned to

each discretized interval, and the attribute test condition is then defined using

this newly constructed ordinal attribute.

A<v

vi≤A<vi+1 i=1, …, k,

A<v

Figure 3.10.

Test condition for continuous attributes.

3.3.3 Measures for Selecting an

Attribute Test Condition

There are many measures that can be used to determine the goodness of an

attribute test condition. These measures try to give preference to attribute test

conditions that partition the training instances into purer subsets in the child

nodes, which mostly have the same class labels. Having purer nodes is useful

since a node that has all of its training instances from the same class does not

need to be expanded further. In contrast, an impure node containing training

instances from multiple classes is likely to require several levels of node

expansions, thereby increasing the depth of the tree considerably. Larger

trees are less desirable as they are more susceptible to model overfitting, a

condition that may degrade the classification performance on unseen

instances, as will be discussed in Section 3.4 . They are also difficult to

interpret and incur more training and test time as compared to smaller trees.

In the following, we present different ways of measuring the impurity of a node

and the collective impurity of its child nodes, both of which will be used to

identify the best attribute test condition for a node.

Impurity Measure for a Single Node

The impurity of a node measures how dissimilar the class labels are for the

data instances belonging to a common node. Following are examples of

measures that can be used to evaluate the impurity of a node t:

where pi(t) is the relative frequency of training instances that belong to class i

at node t, c is the total number of classes, and in entropy

calculations. All three measures give a zero impurity value if a node contains

instances from a single class and maximum impurity if the node has equal

proportion of instances from multiple classes.

Figure 3.11 compares the relative magnitude of the impurity measures

when applied to binary classification problems. Since there are only two

classes, . The horizontal axis p refers to the fraction of instances

that belong to one of the two classes. Observe that all three measures attain

their maximum value when the class distribution is uniform (i.e.,

) and minimum value when all the instances belong to a single

class (i.e., either or equals to 1). The following examples illustrate

how the values of the impurity measures vary as we alter the class

distribution.

Entropy=−∑i=0c−1pi(t) log2pi(t), (3.4)

Gini index=1−∑i=0c−1pi(t)2, (3.5)

Classification error=1−maxi[pi(t)], (3.6)

0 log2 0=0

p0(t)+p1(t)=1

p0(t)+p1(t)=0.5

p0(t) p1(t)

Figure 3.11.

Comparison among the impurity measures for binary classification problems.

Node Count

0

6

Node Count

1

5

Node Count

3

N1 Gini=1−(0/6)2−(6/6)2=0

Class=0 Entropy=−(0/6) log2(0/6)−(6/6) log2(6/6)=0

Class=1 Error=1−max[0/6, 6/6]=0

N2 Gini=1−(1/6)2−(5/6)2=0.278

Class=0 Entropy=−(1/6) log2(1/6)−(5/6) log2(5/6)=0.650

Class=1 Error=1−max[1/6, 5/6]=0.167

N3 Gini=1−(3/6)2−(3/6)2=0.5

Class=0 Entropy=−(3/6) log2(3/6)−(3/6) log2(3/6)=1

3

Based on these calculations, node has the lowest impurity value, followed

by and . This example, along with Figure 3.11 , shows the

consistency among the impurity measures, i.e., if a node has lower

entropy than node , then the Gini index and error rate of will also be

lower than that of . Despite their agreement, the attribute chosen as

splitting criterion by the impurity measures can still be different (see Exercise

6 on page 187).

Collective Impurity of Child Nodes

Consider an attribute test condition that splits a node containing N training

instances into k children, , where every child node represents a

partition of the data resulting from one of the k outcomes of the attribute test

condition. Let be the number of training instances associated with a child

node , whose impurity value is . Since a training instance in the parent

node reaches node for a fraction of times, the collective impurity of

the child nodes can be computed by taking a weighted sum of the impurities

of the child nodes, as follows:

3.3. Example Weighted Entropy

Consider the candidate attribute test condition shown in Figures 3.12(a)

and (b) for the loan borrower classification problem. Splitting on the

Home Owner attribute will generate two child nodes

Class=1 Error=1−max[6/6, 3/6]=0.5

N1

N2 N3

N1

N2 N1

N2

{v1, v2, ⋯ ,vk}

N(vj)

vj I(vj)

vj N(vj)/N

I(children)=∑j=1kN(vj)NI(vj), (3.7)

Figure 3.12.

Examples of candidate attribute test conditions.

whose weighted entropy can be calculated as follows:

Splitting on Marital Status, on the other hand, leads to three child nodes

with a weighted entropy given by

Thus, Marital Status has a lower weighted entropy than Home Owner.

Identifying the best attribute test condition

To determine the goodness of an attribute test condition, we need to compare

the degree of impurity of the parent node (before splitting) with the weighted

degree of impurity of the child nodes (after splitting). The larger their

I(Home Owner=yes)=03log203−33log233=0I(Home Owner=no)=

−37log237−47log247=0.985I(Home Owner=310×0+710×0.985=0.690

I(Marital Status=Single)=

−25log225−35log235=0.971I(Marital Status=Married)=

−03log203−33log233=0I(Marital Status=Divorced)=

−12log212−12log212=1.000I(Marital Status)=510×0.971+310×0+210×1=0.686

difference, the better the test condition. This difference, , also termed as the

gain in purity of an attribute test condition, can be defined as follows:

Figure 3.13.

Splitting criteria for the loan borrower classification problem using Gini index.

where I(parent) is the impurity of a node before splitting and I(children) is the

weighted impurity measure after splitting. It can be shown that the gain is non-

negative since for any reasonable measure such as those

presented above. The higher the gain, the purer are the classes in the child

nodes relative to the parent node. The splitting criterion in the decision tree

learning algorithm selects the attribute test condition that shows the maximum

gain. Note that maximizing the gain at a given node is equivalent to

minimizing the weighted impurity measure of its children since I(parent) is the

same for all candidate attribute test conditions. Finally, when entropy is used

Δ

Δ=I(parent)−I(children), (3.8)

I(parent)≥I(children)

as the impurity measure, the difference in entropy is commonly known as

information gain, .

In the following, we present illustrative approaches for identifying the best

attribute test condition given qualitative or quantitative attributes.

Splitting of Qualitative Attributes

Consider the first two candidate splits shown in Figure 3.12 involving

qualitative attributes and . The initial class

distribution at the parent node is (0.3, 0.7), since there are 3 instances of

class and 7 instances of class in the training data. Thus,

The information gains for Home Owner and Marital Status are each given by

The information gain for Marital Status is thus higher due to its lower weighted

entropy, which will thus be considered for splitting.

Binary Splitting of Qualitative Attributes

Consider building a decision tree using only binary splits and the Gini index as

the impurity measure. Figure 3.13 shows examples of four candidate

splitting criteria for the and attributes. Since there

are 3 borrowers in the training set who defaulted and 7 others who repaid their

loan (see Table in Figure 3.13 ), the Gini index of the parent node before

splitting is

Δinfo

I(parent)=−310log2310−710log2710=0.881

Δinfo(Home Owner)=0.881−0.690=0.191Δinfo(Marital Status)=0.881−0.686=0.195

If is chosen as the splitting attribute, the Gini index for the child

nodes and are 0 and 0.490, respectively. The weighted average Gini

index for the children is

where the weights represent the proportion of training instances assigned to

each child. The gain using as splitting attribute is

. Similarly, we can apply a binary split on the

attribute. However, since is a nominal attribute with

three outcomes, there are three possible ways to group the attribute values

into a binary split. The weighted average Gini index of the children for each

candidate binary split is shown in Figure 3.13 . Based on these results,

and the last binary split using are clearly the best

candidates, since they both produce the lowest weighted average Gini index.

Binary splits can also be used for ordinal attributes, if the binary partitioning of

the attribute values does not violate the ordering property of the values.

Binary Splitting of Quantitative Attributes

Consider the problem of identifying the best binary split for

the preceding loan approval classification problem. As discussed previously,

even though can take any value between the minimum and maximum values

of annual income in the training set, it is sufficient to only consider the annual

income values observed in the training set as candidate split positions. For

each candidate , the training set is scanned once to count the number of

borrowers with annual income less than or greater than along with their class

proportions. We can then compute the Gini index at each candidate split

1−(310)2−(710)2=0.420.

N1 N2

(3/10)×0+(7/10)×0.490=0.343,

0.420−0.343=0.077

Annual Income≤τ

τ

τ

τ

position and choose the that produces the lowest value. Computing the Gini

index at each candidate split position requires O(N) operations, where N is the

number of training instances. Since there are at most N possible candidates,

the overall complexity of this brute-force method is . It is possible to

reduce the complexity of this problem to O(N log N) by using a method

described as follows (see illustration in Figure 3.14 ). In this method, we

first sort the training instances based on their annual income, a one-time cost

that requires O(N log N) operations. The candidate split positions are given by

the midpoints between every two adjacent sorted values: $55,000, $65,000,

$72,500, and so on. For the first candidate, since none of the instances has

an annual income less than or equal to $55,000, the Gini index for the child

node with is equal to zero. In contrast, there are 3

training instances of class and instances of class No with annual

income greater than $55,000. The Gini index for this node is 0.420. The

weighted average Gini index for the first candidate split position, , is

equal to .

Figure 3.14.

Splitting continuous attributes.

For the next candidate, , the class distribution of its child nodes can

be obtained with a simple update of the distribution for the previous candidate.

This is because, as increases from $55,000 to $65,000, there is only one

τ

O(N2)

Annual Income< $55,000

τ=$55,000

0×0+1×0.420=0.420

τ=$65,000

τ

training instance affected by the change. By examining the class label of the

affected training instance, the new class distribution is obtained. For example,

as increases to $65,000, there is only one borrower in the training set, with

an annual income of $60,000, affected by this change. Since the class label

for the borrower is , the count for class increases from 0 to 1 (for

) and decreases from 7 to 6 (for

), as shown in Figure 3.14 . The distribution for the

class remains unaffected. The updated Gini index for this candidate split

position is 0.400.

This procedure is repeated until the Gini index for all candidates are found.

The best split position corresponds to the one that produces the lowest Gini

index, which occurs at . Since the Gini index at each candidate split

position can be computed in O(1) time, the complexity of finding the best split

position is O(N) once all the values are kept sorted, a one-time operation that

takes O(N log N) time. The overall complexity of this method is thus O(N log

N), which is much smaller than the time taken by the brute-force

method. The amount of computation can be further reduced by considering

only candidate split positions located between two adjacent sorted instances

with different class labels. For example, we do not need to consider candidate

split positions located between $60,000 and $75,000 because all three

instances with annual income in this range ($60,000, $70,000, and $75,000)

have the same class labels. Choosing a split position within this range only

increases the degree of impurity, compared to a split position located outside

this range. Therefore, the candidate split positions at and

can be ignored. Similarly, we do not need to consider the candidate

split positions at $87,500, $92,500, $110,000, $122,500, and $172,500

because they are located between two adjacent instances with the same

labels. This strategy reduces the number of candidate split positions to

consider from 9 to 2 (excluding the two boundary cases and

).

τ

Annual Income≤$65,000

Annual Income>$65,000

τ=$97,500

O(N2)

τ=$65,000

τ=$72,500

τ=$55,000

τ=$230,000

Gain Ratio

One potential limitation of impurity measures such as entropy and Gini index

is that they tend to favor qualitative attributes with large number of distinct

values. Figure 3.12 shows three candidate attributes for partitioning the

data set given in Table 3.3 . As previously mentioned, the attribute

is a better choice than the attribute , because it provides a

larger information gain. However, if we compare them against ,

the latter produces the purest partitions with the maximum information gain,

since the weighted entropy and Gini index is equal to zero for its children. Yet,

is not a good attribute for splitting because it has a unique value

for each instance. Even though a test condition involving will

accurately classify every instance in the training data, we cannot use such a

test condition on new test instances with values that haven’t been

seen before during training. This example suggests having a low impurity

value alone is insufficient to find a good attribute test condition for a node. As

we will see later in Section 3.4 , having more number of child nodes can

make a decision tree more complex and consequently more susceptible to

overfitting. Hence, the number of children produced by the splitting attribute

should also be taken into consideration while deciding the best attribute test

condition.

There are two ways to overcome this problem. One way is to generate only

binary decision trees, thus avoiding the difficulty of handling attributes with

varying number of partitions. This strategy is employed by decision tree

classifiers such as CART. Another way is to modify the splitting criterion to

take into account the number of partitions produced by the attribute. For

example, in the C4.5 decision tree algorithm, a measure known as gain ratio

is used to compensate for attributes that produce a large number of child

nodes. This measure is computed as follows:

where is the number of instances assigned to node and k is the total

number of splits. The split information measures the entropy of splitting a

node into its child nodes and evaluates if the split results in a larger number of

equally-sized child nodes or not. For example, if every partition has the same

number of instances, then and the split information would be

equal to log k. Thus, if an attribute produces a large number of splits, its split

information is also large, which in turn, reduces the gain ratio.

3.4. Example Gain Ratio

Consider the data set given in Exercise 2 on page 185. We want to select

the best attribute test condition among the following three attributes:

, , and . The entropy before splitting is

If is used as attribute test condition:

If is used as attribute test condition:

Finally, if is used as attribute test condition:

Gain ratio=ΔinfoSplit Info=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)

−∑i=1kN(vi)Nlog2N(vi)N

(3.9)

N(vi) vi

∀i:N(vi)/N=1/k

2

Entropy(parent)=−1020log21020−1020log21020=1.

Entropy(children)=1020[−610log2610−410log2410

]×2=0.971Gain Ratio=1−0.971−1020log21020−1020log21020=0.0291=0.029

Entropy(children)=420[−14log214−34log234

]+820×0+820[−18log218−78log278

]=0.380Gain Ratio=1−0.380−420log2420−820log2820−820log2820=0.6201.52

Thus, even though has the highest information gain, its gain

ratio is lower than since it produces a larger number of splits.

3.3.4 Algorithm for Decision Tree

Induction

Algorithm 3.1 presents a pseudocode for decision tree induction algorithm.

The input to this algorithm is a set of training instances E along with the

attribute set F . The algorithm works by recursively selecting the best attribute

to split the data (Step 7) and expanding the nodes of the tree (Steps 11 and

12) until the stopping criterion is met (Step 1). The details of this algorithm are

explained below.

1. The function extends the decision tree by creating a new

node. A node in the decision tree either has a test condition, denoted

as node.test cond, or a class label, denoted as node.label.

2. The function determines the attribute test condition

for partitioning the training instances associated with a node. The

splitting attribute chosen depends on the impurity measure used. The

popular measures include entropy and the Gini index.

3. The function determines the class label to be assigned to a

leaf node. For each leaf node t, let denote the fraction of training

instances from class i associated with the node t. The label assigned to

Entropy(children)=120[−11log211−01log201

]×20=0Gain Ratio=1−0−120log2120×20=14.32=0.23

p(i|t)

the leaf node is typically the one that occurs most frequently in the

training instances that are associated with this node.

Algorithm 3.1 A skeleton decision tree

induction algorithm.

∈

∈

where the argmax operator returns the class i that maximizes .

Besides providing the information needed to determine the class label

leaf.label=argmaxi p(i|t), (3.10)

p(i|t)

of a leaf node, can also be used as a rough estimate of the

probability that an instance assigned to the leaf node t belongs to class

i. Sections 4.11.2 and 4.11.4 in the next chapter describe how

such probability estimates can be used to determine the performance

of a decision tree under different cost functions.

4. The function is used to terminate the tree-growing

process by checking whether all the instances have identical class

label or attribute values. Since decision tree classifiers employ a top-

down, recursive partitioning approach for building a model, the number

of training instances associated with a node decreases as the depth of

the tree increases. As a result, a leaf node may contain too few training

instances to make a statistically significant decision about its class

label. This is known as the data fragmentation problem. One way to

avoid this problem is to disallow splitting of a node when the number of

instances associated with the node fall below a certain threshold. A

more systematic way to control the size of a decision tree (number of

leaf nodes) will be discussed in Section 3.5.4 .

3.3.5 Example Application: Web Robot

Detection

Consider the task of distinguishing the access patterns of web robots from

those generated by human users. A web robot (also known as a web crawler)

is a software program that automatically retrieves files from one or more

websites by following the hyperlinks extracted from an initial set of seed

URLs. These programs have been deployed for various purposes, from

gathering web pages on behalf of search engines to more malicious activities

such as spamming and committing click frauds in online advertisements.

p(i|t)

Figure 3.15.

Input data for web robot detection.

The web robot detection problem can be cast as a binary classification task.

The input data for the classification task is a web server log, a sample of

which is shown in Figure 3.15(a) . Each line in the log file corresponds to a

request made by a client (i.e., a human user or a web robot) to the web

server. The fields recorded in the web log include the client’s IP address,

timestamp of the request, URL of the requested file, size of the file, and user

agent, which is a field that contains identifying information about the client.

For human users, the user agent field specifies the type of web browser or

mobile device used to fetch the files, whereas for web robots, it should

technically contain the name of the crawler program. However, web robots

may conceal their true identities by declaring their user agent fields to be

identical to known browsers. Therefore, user agent is not a reliable field to

detect web robots.

The first step toward building a classification model is to precisely define a

data instance and associated attributes. A simple approach is to consider

each log entry as a data instance and use the appropriate fields in the log file

as its attribute set. This approach, however, is inadequate for several reasons.

First, many of the attributes are nominal-valued and have a wide range of

domain values. For example, the number of unique client IP addresses, URLs,

and referrers in a log file can be very large. These attributes are undesirable

for building a decision tree because their split information is extremely high

(see Equation (3.9) ). In addition, it might not be possible to classify test

instances containing IP addresses, URLs, or referrers that are not present in

the training data. Finally, by considering each log entry as a separate data

instance, we disregard the sequence of web pages retrieved by the client—a

critical piece of information that can help distinguish web robot accesses from

those of a human user.

A better alternative is to consider each web session as a data instance. A web

session is a sequence of requests made by a client during a given visit to the

website. Each web session can be modeled as a directed graph, in which the

nodes correspond to web pages and the edges correspond to hyperlinks

connecting one web page to another. Figure 3.15(b) shows a graphical

representation of the first web session given in the log file. Every web session

can be characterized using some meaningful attributes about the graph that

contain discriminatory information. Figure 3.15(c) shows some of the

attributes extracted from the graph, including the depth and breadth of its

corresponding tree rooted at the entry point to the website. For example, the

depth and breadth of the tree shown in Figure 3.15(b) are both equal to

two.

The derived attributes shown in Figure 3.15(c) are more informative than

the original attributes given in the log file because they characterize the

behavior of the client at the website. Using this approach, a data set

containing 2916 instances was created, with equal numbers of sessions due

to web robots (class 1) and human users (class 0). 10% of the data were

reserved for training while the remaining 90% were used for testing. The

induced decision tree is shown in Figure 3.16 , which has an error rate

equal to 3.8% on the training set and 5.3% on the test set. In addition to its

low error rate, the tree also reveals some interesting properties that can help

discriminate web robots from human users:

1. Accesses by web robots tend to be broad but shallow, whereas

accesses by human users tend to be more focused (narrow but deep).

2. Web robots seldom retrieve the image pages associated with a web

page.

3. Sessions due to web robots tend to be long and contain a large number

of requested pages.

4. Web robots are more likely to make repeated requests for the same

web page than human users since the web pages retrieved by human

users are often cached by the browser.

3.3.6 Characteristics of Decision Tree

Classifiers

The following is a summary of the important characteristics of decision tree

induction algorithms.

1. Applicability: Decision trees are a nonparametric approach for

building classification models. This approach does not require any prior

assumption about the probability distribution governing the class and

attributes of the data, and thus, is applicable to a wide variety of data

sets. It is also applicable to both categorical and continuous data

without requiring the attributes to be transformed into a common

representation via binarization, normalization, or standardization.

Unlike some binary classifiers described in Chapter 4 , it can also

deal with multiclass problems without the need to decompose them into

multiple binary classification tasks. Another appealing feature of

decision tree classifiers is that the induced trees, especially the shorter

ones, are relatively easy to interpret. The accuracies of the trees are

also quite comparable to other classification techniques for many

simple data sets.

2. Expressiveness: A decision tree provides a universal representation

for discrete-valued functions. In other words, it can encode any function

of discrete-valued attributes. This is because every discrete-valued

function can be represented as an assignment table, where every

unique combination of discrete attributes is assigned a class label.

Since every combination of attributes can be represented as a leaf in

the decision tree, we can always find a decision tree whose label

assignments at the leaf nodes matches with the assignment table of

the original function. Decision trees can also help in providing compact

representations of functions when some of the unique combinations of

attributes can be represented by the same leaf node. For example,

Figure 3.17 shows the assignment table of the Boolean function

involving four binary attributes, resulting in a total of

possible assignments. The tree shown in Figure 3.17 shows

(A∧B)∨(C∧D)

24=16

a compressed encoding of this assignment table. Instead of requiring a

fully-grown tree with 16 leaf nodes, it is possible to encode the function

using a simpler tree with only 7 leaf nodes. Nevertheless, not all

decision trees for discrete-valued attributes can be simplified. One

notable example is the parity function, whose value is 1 when there is

an even number of true values among its Boolean attributes, and 0

otherwise. Accurate modeling of such a function requires a full decision

tree with nodes, where d is the number of Boolean attributes (see

Exercise 1 on page 185).

2d

Figure 3.16.

Decision tree model for web robot detection.

Figure 3.17.

Decision tree for the Boolean function .

3. Computational Efficiency: Since the number of possible decision

trees can be very large, many decision tree algorithms employ a

heuristic-based approach to guide their search in the vast hypothesis

space. For example, the algorithm presented in Section 3.3.4 uses

a greedy, top-down, recursive partitioning strategy for growing a

decision tree. For many data sets, such techniques quickly construct a

reasonably good decision tree even when the training set size is very

large. Furthermore, once a decision tree has been built, classifying a

test record is extremely fast, with a worst-case complexity of O(w),

where w is the maximum depth of the tree.

4. Handling Missing Values: A decision tree classifier can handle

missing attribute values in a number of ways, both in the training and

the test sets. When there are missing values in the test set, the

classifier must decide which branch to follow if the value of a splitting

(A∧B)∨(C∧D)

node attribute is missing for a given test instance. One approach,

known as the probabilistic split method, which is employed by the

C4.5 decision tree classifier, distributes the data instance to every child

of the splitting node according to the probability that the missing

attribute has a particular value. In contrast, the CART algorithm uses

the surrogate split method, where the instance whose splitting

attribute value is missing is assigned to one of the child nodes based

on the value of another non-missing surrogate attribute whose splits

most resemble the partitions made by the missing attribute. Another

approach, known as the separate class method is used by the CHAID

algorithm, where the missing value is treated as a separate categorical

value distinct from other values of the splitting attribute. Figure 3.18

shows an example of the three different ways for handling missing

values in a decision tree classifier. Other strategies for dealing with

missing values are based on data preprocessing, where the instance

with missing value is either imputed with the mode (for categorical

attribute) or mean (for continuous attribute) value or discarded before

the classifier is trained.

Figure 3.18.

Methods for handling missing attribute values in decision tree classifier.

During training, if an attribute v has missing values in some of the

training instances associated with a node, we need a way to measure

the gain in purity if v is used for splitting. One simple way is to exclude

instances with missing values of v in the counting of instances

associated with every child node, generated for every possible

outcome of v.Further, if v is chosen as the attribute test condition at a

node, training instances with missing values of v can be propagated to

the child nodes using any of the methods described above for handling

missing values in test instances.

5. Handling Interactions among Attributes: Attributes are considered

interacting if they are able to distinguish between classes when used

together, but individually they provide little or no information. Due to the

greedy nature of the splitting criteria in decision trees, such attributes

could be passed over in favor of other attributes that are not as useful.

This could result in more complex decision trees than necessary.

Hence, decision trees can perform poorly when there are interactions

among attributes.

To illustrate this point, consider the three-dimensional data shown in

Figure 3.19(a) , which contains 2000 data points from one of two

classes, denoted as and in the diagram. Figure 3.19(b) shows

the distribution of the two classes in the two-dimensional space

involving attributes X and Y , which is a noisy version of the XOR

Boolean function. We can see that even though the two classes are

well-separated in this two-dimensional space, neither of the two

attributes contain sufficient information to distinguish between the two

classes when used alone. For example, the entropies of the following

attribute test conditions: and , are close to 1, indicating that

neither X nor Y provide any reduction in the impurity measure when

used individually. X and Y thus represent a case of interaction among

attributes. The data set also contains a third attribute, Z, in which both

classes are distributed uniformly, as shown in Figures 3.19(c) and

+ ∘

X≤10 Y≤10

3.19(d) , and hence, the entropy of any split involving Z is close to 1.

As a result, Z is as likely to be chosen for splitting as the interacting but

useful attributes, X and Y . For further illustration of this issue, readers

are referred to Example 3.7 in Section 3.4.1 and Exercise 7 at

the end of this chapter.

Figure 3.19.

Example of a XOR data involving X and Y , along with an irrelevant

attribute Z.

6. Handling Irrelevant Attributes: An attribute is irrelevant if it is not

useful for the classification task. Since irrelevant attributes are poorly

associated with the target class labels, they will provide little or no gain

in purity and thus will be passed over by other more relevant features.

Hence, the presence of a small number of irrelevant attributes will not

impact the decision tree construction process. However, not all

attributes that provide little to no gain are irrelevant (see Figure

3.19 ). Hence, if the classification problem is complex (e.g., involving

interactions among attributes) and there are a large number of

irrelevant attributes, then some of these attributes may be accidentally

chosen during the tree-growing process, since they may provide a

better gain than a relevant attribute just by random chance. Feature

selection techniques can help to improve the accuracy of decision trees

by eliminating the irrelevant attributes during preprocessing. We will

investigate the issue of too many irrelevant attributes in Section

3.4.1 .

7. Handling Redundant Attributes: An attribute is redundant if it is

strongly correlated with another attribute in the data. Since redundant

attributes show similar gains in purity if they are selected for splitting,

only one of them will be selected as an attribute test condition in the

decision tree algorithm. Decision trees can thus handle the presence of

redundant attributes.

8. Using Rectilinear Splits: The test conditions described so far in this

chapter involve using only a single attribute at a time. As a

consequence, the tree-growing procedure can be viewed as the

process of partitioning the attribute space into disjoint regions until

each region contains records of the same class. The border between

two neighboring regions of different classes is known as a decision

boundary. Figure 3.20 shows the decision tree as well as the

decision boundary for a binary classification problem. Since the test

condition involves only a single attribute, the decision boundaries are

rectilinear; i.e., parallel to the coordinate axes. This limits the

expressiveness of decision trees in representing decision boundaries of

data sets with continuous attributes. Figure 3.21 shows a two-

dimensional data set involving binary classes that cannot be perfectly

classified by a decision tree whose attribute test conditions are defined

based on single attributes. The binary classes in the data set are

generated from two skewed Gaussian distributions, centered at (8,8)

and (12,12), respectively. The true decision boundary is represented by

the diagonal dashed line, whereas the rectilinear decision boundary

produced by the decision tree classifier is shown by the thick solid line.

In contrast, an oblique decision tree may overcome this limitation by

allowing the test condition to be specified using more than one

attribute. For example, the binary classification data shown in Figure

3.21 can be easily represented by an oblique decision tree with a

single root node with test condition

Figure 3.20.

x+y<20.

Example of a decision tree and its decision boundaries for a two-

dimensional data set.

Figure 3.21.

Example of data set that cannot be partitioned optimally using a

decision tree with single attribute test conditions. The true decision

boundary is shown by the dashed line.

Although an oblique decision tree is more expressive and can produce

more compact trees, finding the optimal test condition is

computationally more expensive.

9. Choice of Impurity Measure: It should be noted that the choice of

impurity measure often has little effect on the performance of decision

tree classifiers since many of the impurity measures are quite

consistent with each other, as shown in Figure 3.11 on page 129.

Instead, the strategy used to prune the tree has a greater impact on the

final tree than the choice of impurity measure.

3.4 Model Overfitting

Methods presented so far try to learn classification models that show the

lowest error on the training set. However, as we will show in the following

example, even if a model fits well over the training data, it can still show poor

generalization performance, a phenomenon known as model overfitting.

Figure 3.22.

Examples of training and test sets of a two-dimensional classification problem.

Figure 3.23.

Effect of varying tree size (number of leaf nodes) on training and test errors.

3.5. Example Overfitting and Underfitting of

Decision Trees

Consider the two-dimensional data set shown in Figure 3.22(a) . The

data set contains instances that belong to two separate classes,

represented as and , respectively, where each class has 5400

instances. All instances belonging to the class were generated from a

uniform distribution. For the class, 5000 instances were generated from

a Gaussian distribution centered at (10,10) with unit variance, while the

remaining 400 instances were sampled from the same uniform distribution

as the class. We can see from Figure 3.22(a) that the class can be

largely distinguished from the class by drawing a circle of appropriate

size centered at (10,10). To learn a classifier using this two-dimensional

data set, we randomly sampled 10% of the data for training and used the

remaining 90% for testing. The training set, shown in Figure 3.22(b) ,

looks quite representative of the overall data. We used Gini index as the

+ ∘

∘

+

∘ +

∘

impurity measure to construct decision trees of increasing sizes (number of

leaf nodes), by recursively expanding a node into child nodes till every leaf

node was pure, as described in Section 3.3.4 .

Figure 3.23(a) shows changes in the training and test error rates as the

size of the tree varies from 1 to 8. Both error rates are initially large when

the tree has only one or two leaf nodes. This situation is known as model

underfitting. Underfitting occurs when the learned decision tree is too

simplistic, and thus, incapable of fully representing the true relationship

between the attributes and the class labels. As we increase the tree size

from 1 to 8, we can observe two effects. First, both the error rates

decrease since larger trees are able to represent more complex decision

boundaries. Second, the training and test error rates are quite close to

each other, which indicates that the performance on the training set is fairly

representative of the generalization performance. As we further increase

the size of the tree from 8 to 150, the training error continues to steadily

decrease till it eventually reaches zero, as shown in Figure 3.23(b) .

However, in a striking contrast, the test error rate ceases to decrease any

further beyond a certain tree size, and then it begins to increase. The

training error rate thus grossly under-estimates the test error rate once the

tree becomes too large. Further, the gap between the training and test

error rates keeps on widening as we increase the tree size. This behavior,

which may seem counter-intuitive at first, can be attributed to the

phenomena of model overfitting.

3.4.1 Reasons for Model Overfitting

Model overfitting is the phenomena where, in the pursuit of minimizing the

training error rate, an overly complex model is selected that captures specific

patterns in the training data but fails to learn the true nature of relationships

between attributes and class labels in the overall data. To illustrate this,

Figure 3.24 shows decision trees and their corresponding decision

boundaries (shaded rectangles represent regions assigned to the class) for

two trees of sizes 5 and 50. We can see that the decision tree of size 5

appears quite simple and its decision boundaries provide a reasonable

approximation to the ideal decision boundary, which in this case corresponds

to a circle centered around the Gaussian distribution at (10, 10). Although its

training and test error rates are non-zero, they are very close to each other,

which indicates that the patterns learned in the training set should generalize

well over the test set. On the other hand, the decision tree of size 50 appears

much more complex than the tree of size 5, with complicated decision

boundaries. For example, some of its shaded rectangles (assigned the

class) attempt to cover narrow regions in the input space that contain only one

or two training instances. Note that the prevalence of instances in such

regions is highly specific to the training set, as these regions are mostly

dominated by – instances in the overall data. Hence, in an attempt to perfectly

fit the training data, the decision tree of size 50 starts fine tuning itself to

specific patterns in the training data, leading to poor performance on an

independently chosen test set.

+

+

+ +

Figure 3.24.

Decision trees with different model complexities.

Figure 3.25.

Performance of decision trees using 20% data for training (twice the original

training size).

There are a number of factors that influence model overfitting. In the following,

we provide brief descriptions of two of the major factors: limited training size

and high model complexity. Though they are not exhaustive, the interplay

between them can help explain most of the common model overfitting

phenomena in real-world applications.

Limited Training Size

Note that a training set consisting of a finite number of instances can only

provide a limited representation of the overall data. Hence, it is possible that

the patterns learned from a training set do not fully represent the true patterns

in the overall data, leading to model overfitting. In general, as we increase the

size of a training set (number of training instances), the patterns learned from

the training set start resembling the true patterns in the overall data. Hence,

the effect of overfitting can be reduced by increasing the training size, as

illustrated in the following example.

3.6 Example Effect of Training Size

Suppose that we use twice the number of training instances than what we

had used in the experiments conducted in Example 3.5 . Specifically, we

use 20% data for training and use the remainder for testing. Figure

3.25(b) shows the training and test error rates as the size of the tree is

varied from 1 to 150. There are two major differences in the trends shown

in this figure and those shown in Figure 3.23(b) (using only 10% of the

data for training). First, even though the training error rate decreases with

increasing tree size in both figures, its rate of decrease is much smaller

when we use twice the training size. Second, for a given tree size, the gap

between the training and test error rates is much smaller when we use

twice the training size. These differences suggest that the patterns learned

using 20% of data for training are more generalizable than those learned

using 10% of data for training.

Figure 3.25(a) shows the decision boundaries for the tree of size 50,

learned using 20% of data for training. In contrast to the tree of the same

size learned using 10% data for training (see Figure 3.24(d) ), we can

see that the decision tree is not capturing specific patterns of noisy

instances in the training set. Instead, the high model complexity of 50 leaf

nodes is being effectively used to learn the boundaries of the instances

centered at (10, 10).

High Model Complexity

Generally, a more complex model has a better ability to represent complex

patterns in the data. For example, decision trees with larger number of leaf

+

+

nodes can represent more complex decision boundaries than decision trees

with fewer leaf nodes. However, an overly complex model also has a tendency

to learn specific patterns in the training set that do not generalize well over

unseen instances. Models with high complexity should thus be judiciously

used to avoid overfitting.

One measure of model complexity is the number of “parameters” that need to

be inferred from the training set. For example, in the case of decision tree

induction, the attribute test conditions at internal nodes correspond to the

parameters of the model that need to be inferred from the training set. A

decision tree with larger number of attribute test conditions (and consequently

more leaf nodes) thus involves more “parameters” and hence is more

complex.

Given a class of models with a certain number of parameters, a learning

algorithm attempts to select the best combination of parameter values that

maximizes an evaluation metric (e.g., accuracy) over the training set. If the

number of parameter value combinations (and hence the complexity) is large,

the learning algorithm has to select the best combination from a large number

of possibilities, using a limited training set. In such cases, there is a high

chance for the learning algorithm to pick a spurious combination of

parameters that maximizes the evaluation metric just by random chance. This

is similar to the multiple comparisons problem (also referred as multiple

testing problem) in statistics.

As an illustration of the multiple comparisons problem, consider the task of

predicting whether the stock market will rise or fall in the next ten trading days.

If a stock analyst simply makes random guesses, the probability that her

prediction is correct on any trading day is 0.5. However, the probability that

she will predict correctly at least nine out of ten times is

which is extremely low.

Suppose we are interested in choosing an investment advisor from a pool of

200 stock analysts. Our strategy is to select the analyst who makes the most

number of correct predictions in the next ten trading days. The flaw in this

strategy is that even if all the analysts make their predictions in a random

fashion, the probability that at least one of them makes at least nine correct

predictions is

which is very high. Although each analyst has a low probability of predicting at

least nine times correctly, considered together, we have a high probability of

finding at least one analyst who can do so. However, there is no guarantee in

the future that such an analyst will continue to make accurate predictions by

random guessing.

How does the multiple comparisons problem relate to model overfitting? In the

context of learning a classification model, each combination of parameter

values corresponds to an analyst, while the number of training instances

corresponds to the number of days. Analogous to the task of selecting the

best analyst who makes the most accurate predictions on consecutive days,

the task of a learning algorithm is to select the best combination of parameters

that results in the highest accuracy on the training set. If the number of

parameter combinations is large but the training size is small, it is highly likely

for the learning algorithm to choose a spurious parameter combination that

provides high training accuracy just by random chance. In the following

example, we illustrate the phenomena of overfitting due to multiple

comparisons in the context of decision tree induction.

(109)+(1010)210=0.0107,

1−(1−0.0107)200=0.8847,

Figure 3.26.

Example of a two-dimensional (X-Y) data set.

Figure 3.27.

Training and test error rates illustrating the effect of multiple comparisons

problem on model overfitting.

3.7. Example Multiple Comparisons and

Overfitting

Consider the two-dimensional data set shown in Figure 3.26 containing

500 and 500 instances, which is similar to the data shown in Figure

3.19 . In this data set, the distributions of both classes are well-separated

in the two-dimensional (XY) attribute space, but none of the two attributes

(X or Y) are sufficiently informative to be used alone for separating the two

classes. Hence, splitting the data set based on any value of an X or Y

attribute will provide close to zero reduction in an impurity measure.

However, if X and Y attributes are used together in the splitting criterion

(e.g., splitting X at 10 and Y at 10), the two classes can be effectively

separated.

+ ∘

Figure 3.28.

Decision tree with 6 leaf nodes using X and Y as attributes. Splits have

been numbered from 1 to 5 in order of other occurrence in the tree.

Figure 3.27(a) shows the training and test error rates for learning

decision trees of varying sizes, when 30% of the data is used for training

and the remainder of the data for testing. We can see that the two classes

can be separated using a small number of leaf nodes. Figure 3.28

shows the decision boundaries for the tree with six leaf nodes, where the

splits have been numbered according to their order of appearance in the

tree. Note that the even though splits 1 and 3 provide trivial gains, their

consequent splits (2, 4, and 5) provide large gains, resulting in effective

discrimination of the two classes.

Assume we add 100 irrelevant attributes to the two-dimensional X-Y data.

Learning a decision tree from this resultant data will be challenging

because the number of candidate attributes to choose for splitting at every

internal node will increase from two to 102. With such a large number of

candidate attribute test conditions to choose from, it is quite likely that

spurious attribute test conditions will be selected at internal nodes because

of the multiple comparisons problem. Figure 3.27(b) shows the training

and test error rates after adding 100 irrelevant attributes to the training set.

We can see that the test error rate remains close to 0.5 even after using 50

leaf nodes, while the training error rate keeps on declining and eventually

becomes 0.

3.5 Model Selection

There are many possible classification models with varying levels of model

complexity that can be used to capture patterns in the training data. Among

these possibilities, we want to select the model that shows lowest

generalization error rate. The process of selecting a model with the right level

of complexity, which is expected to generalize well over unseen test

instances, is known as model selection. As described in the previous

section, the training error rate cannot be reliably used as the sole criterion for

model selection. In the following, we present three generic approaches to

estimate the generalization performance of a model that can be used for

model selection. We conclude this section by presenting specific strategies for

using these approaches in the context of decision tree induction.

3.5.1 Using a Validation Set

Note that we can always estimate the generalization error rate of a model by

using “out-of-sample” estimates, i.e. by evaluating the model on a separate

validation set that is not used for training the model. The error rate on the

validation set, termed as the validation error rate, is a better indicator of

generalization performance than the training error rate, since the validation set

has not been used for training the model. The validation error rate can be

used for model selection as follows.

Given a training set D.train, we can partition D.train into two smaller subsets,

D.tr and D.val, such that D.tr is used for training while D.val is used as the

validation set. For example, two-thirds of D.train can be reserved as D.tr for

training, while the remaining one-third is used as D.val for computing

validation error rate. For any choice of classification model m that is trained on

D.tr, we can estimate its validation error rate on D.val, . The model

that shows the lowest value of can then be selected as the preferred

choice of model.

The use of validation set provides a generic approach for model selection.

However, one limitation of this approach is that it is sensitive to the sizes of

D.tr and D.val, obtained by partitioning D.train. If the size of D.tr is too small, it

may result in the learning of a poor classification model with sub-standard

performance, since a smaller training set will be less representative of the

overall data. On the other hand, if the size of D.val is too small, the validation

error rate might not be reliable for selecting models, as it would be computed

over a small number of instances.

Figure 3.29.

errval(m)

errval(m)

Class distribution of validation data for the two decision trees shown in Figure

3.30 .

3.8. Example Validation Error

In the following example, we illustrate one possible approach for using a

validation set in decision tree induction. Figure 3.29 shows the

predicted labels at the leaf nodes of the decision trees generated in Figure

3.30 . The counts given beneath the leaf nodes represent the proportion

of data instances in the validation set that reach each of the nodes. Based

on the predicted labels of the nodes, the validation error rate for the left

tree is , while the validation error rate for the right

tree is . Based on their validation error rates, the right

tree is preferred over the left one.

3.5.2 Incorporating Model Complexity

Since the chance for model overfitting increases as the model becomes more

complex, a model selection approach should not only consider the training

error rate but also the model complexity. This strategy is inspired by a well-

known principle known as Occam’s razor or the principle of parsimony,

which suggests that given two models with the same errors, the simpler model

is preferred over the more complex model. A generic approach to account for

model complexity while estimating generalization performance is formally

described as follows.

Given a training set D.train, let us consider learning a classification model m

that belongs to a certain class of models, . For example, if represents the

set of all possible decision trees, then m can correspond to a specific decision

errval(TL)=6/16=0.375

errval(TR)=4/16=0.25

M M

tree learned from the training set. We are interested in estimating the

generalization error rate of m, gen.error(m). As discussed previously, the

training error rate of m, train.error(m, D.train), can under-estimate

gen.error(m) when the model complexity is high. Hence, we represent

gen.error(m) as a function of not just the training error rate but also the model

complexity of as follows:

where is a hyper-parameter that strikes a balance between minimizing

training error and reducing model complexity. A higher value of gives more

emphasis to the model complexity in the estimation of generalization

performance. To choose the right value of , we can make use of the

validation set in a similar way as described in 3.5.1 . For example, we can

iterate through a range of values of and for every possible value, we can

learn a model on a subset of the training set, D.tr, and compute its validation

error rate on a separate subset, D.val. We can then select the value of that

provides the lowest validation error rate.

Equation 3.11 provides one possible approach for incorporating model

complexity into the estimate of generalization performance. This approach is

at the heart of a number of techniques for estimating generalization

performance, such as the structural risk minimization principle, the Akaike’s

Information Criterion (AIC), and the Bayesian Information Criterion (BIC). The

structural risk minimization principle serves as the building block for learning

support vector machines, which will be discussed later in Chapter 4 . For

more details on AIC and BIC, see the Bibliographic Notes.

In the following, we present two different approaches for estimating the

complexity of a model, . While the former is specific to decision

trees, the latter is more generic and can be used with any class of models.

M, complexity(M),

gen.error(m)=train.error(m, D.train)+α×complexity(M), (3.11)

α

α

α

α

α

complexity(M)

Estimating the Complexity of Decision Trees

In the context of decision trees, the complexity of a decision tree can be

estimated as the ratio of the number of leaf nodes to the number of training

instances. Let k be the number of leaf nodes and be the number of

training instances. The complexity of a decision tree can then be described as

. This reflects the intuition that for a larger training size, we can learn a

decision tree with larger number of leaf nodes without it becoming overly

complex. The generalization error rate of a decision tree T can then be

computed using Equation 3.11 as follows:

where err(T) is the training error of the decision tree and is a hyper-

parameter that makes a trade-off between reducing training error and

minimizing model complexity, similar to the use of in Equation 3.11 .

can be viewed as the relative cost of adding a leaf node relative to incurring a

training error. In the literature on decision tree induction, the above approach

for estimating generalization error rate is also termed as the pessimistic

error estimate. It is called pessimistic as it assumes the generalization error

rate to be worse than the training error rate (by adding a penalty term for

model complexity). On the other hand, simply using the training error rate as

an estimate of the generalization error rate is called the optimistic error

estimate or the resubstitution estimate.

3.9. Example Generalization Error Estimates

Consider the two binary decision trees, and , shown in Figure

3.30 . Both trees are generated from the same training data and is

generated by expanding three leaf nodes of . The counts shown in the

leaf nodes of the trees represent the class distribution of the training

Ntrain

k/Ntrain

errgen(T)=err(T)+Ω×kNtrain,

Ω

α Ω

TL TR

TL

TR

instances. If each leaf node is labeled according to the majority class of

training instances that reach the node, the training error rate for the left

tree will be , while the training error rate for the right

tree will be . Based on their training error rates alone,

would preferred over , even though is more complex (contains

larger number of leaf nodes) than .

Figure 3.30.

Example of two decision trees generated from the same training data.

Now, assume that the cost associated with each leaf node is . Then,

the generalization error estimate for will be

and the generalization error estimate for will be

err(TL)=4/24=0.167

err(TR)=6/24=0.25

TL TR TL

TR

Ω=0.5

TL

errgen(TL)=424+0.5×724=7.524=0.3125

TR

errgen(TR) =624+0.5×424=824=0.3333.

Since has a lower generalization error rate, it will still be preferred over

. Note that implies that a node should always be expanded into

its two child nodes if it improves the prediction of at least one training

instance, since expanding a node is less costly than misclassifying a

training instance. On the other hand, if , then the generalization error

rate for is and for is

. In this case, will be preferred over

because it has a lower generalization error rate. This example illustrates

that different choices of can change our preference of decision trees

based on their generalization error estimates. However, for a given choice

of , the pessimistic error estimate provides an approach for modeling the

generalization performance on unseen test instances. The value of can

be selected with the help of a validation set.

Minimum Description Length Principle

Another way to incorporate model complexity is based on an information-

theoretic approach known as the minimum description length or MDL

principle. To illustrate this approach, consider the example shown in Figure

3.31 . In this example, both person and person are given a set of

instances with known attribute values . Assume person A knows the class

label y for every instance, while person has no such information. would

like to share the class information with by sending a message containing

the labels. The message would contain bits of information, where N is

the number of instances.

TL

TR Ω=0.5

Ω=1

TL errgen(TL)=11/24=0.458 TR

errgen(TR)=10/24=0.417 TR TL

Ω

Ω

Ω

Θ(N)

Figure 3.31.

An illustration of the minimum description length principle.

Alternatively, instead of sending the class labels explicitly, can build a

classification model from the instances and transmit it to . can then apply

the model to determine the class labels of the instances. If the model is 100%

accurate, then the cost for transmission is equal to the number of bits required

to encode the model. Otherwise, must also transmit information about

which instances are misclassified by the model so that can reproduce the

same class labels. Thus, the overall transmission cost, which is equal to the

total description length of the message, is

where the first term on the right-hand side is the number of bits needed to

encode the misclassified instances, while the second term is the number of

bits required to encode the model. There is also a hyper-parameter that

trades-off the relative costs of the misclassified instances and the model.

Cost(model, data)=Cost(data|model)+α×Cost(model), (3.12)

α

Notice the familiarity between this equation and the generic equation for

generalization error rate presented in Equation 3.11 . A good model must

have a total description length less than the number of bits required to encode

the entire sequence of class labels. Furthermore, given two competing

models, the model with lower total description length is preferred. An example

showing how to compute the total description length of a decision tree is given

in Exercise 10 on page 189.

3.5.3 Estimating Statistical Bounds

Instead of using Equation 3.11 to estimate the generalization error rate of a

model, an alternative way is to apply a statistical correction to the training

error rate of the model that is indicative of its model complexity. This can be

done if the probability distribution of training error is available or can be

assumed. For example, the number of errors committed by a leaf node in a

decision tree can be assumed to follow a binomial distribution. We can thus

compute an upper bound limit to the observed training error rate that can be

used for model selection, as illustrated in the following example.

3.10. Example Statistical Bounds on Training

Error

Consider the left-most branch of the binary decision trees shown in Figure

3.30 . Observe that the left-most leaf node of has been expanded

into two child nodes in . Before splitting, the training error rate of the

node is . By approximating a binomial distribution with a normal

distribution, the following upper bound of the training error rate e can be

derived:

TR

TL

2/7=0.286

where is the confidence level, is the standardized value from a

standard normal distribution, and N is the total number of training

instances used to compute e. By replacing and , the

upper bound for the error rate is , which

corresponds to errors. If we expand the node into its child

nodes as shown in , the training error rates for the child nodes are

and , respectively. Using Equation (3.13) , the

upper bounds of these error rates are and

, respectively. The overall training error of the

child nodes is , which is larger than the estimated

error for the corresponding node in , suggesting that it should not be

split.

3.5.4 Model Selection for Decision

Trees

Building on the generic approaches presented above, we present two

commonly used model selection strategies for decision tree induction.

Prepruning (Early Stopping Rule)

In this approach, the tree-growing algorithm is halted before generating a fully

grown tree that perfectly fits the entire training data. To do this, a more

restrictive stopping condition must be used; e.g., stop expanding a leaf node

when the observed gain in the generalization error estimate falls below a

certain threshold. This estimate of the generalization error rate can be

eupper(N, e, α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)

α zα/2

α=25%,N=7, e=2/7

eupper(7, 2/7, 0.25)=0.503

7×0.503=3.521

TL

1/4=0.250 1/3=0.333

eupper(4, 1/4,0.25)=0.537

eupper(3, 1/3, 0.25)=0.650

4×0.537+3×0.650=4.098

TR

computed using any of the approaches presented in the preceding three

subsections, e.g., by using pessimistic error estimates, by using validation

error estimates, or by using statistical bounds. The advantage of prepruning is

that it avoids the computations associated with generating overly complex

subtrees that overfit the training data. However, one major drawback of this

method is that, even if no significant gain is obtained using one of the existing

splitting criterion, subsequent splitting may result in better subtrees. Such

subtrees would not be reached if prepruning is used because of the greedy

nature of decision tree induction.

Post-pruning

In this approach, the decision tree is initially grown to its maximum size. This

is followed by a tree-pruning step, which proceeds to trim the fully grown tree

in a bottom-up fashion. Trimming can be done by replacing a subtree with (1)

a new leaf node whose class label is determined from the majority class of

instances affiliated with the subtree (approach known as subtree

replacement), or (2) the most frequently used branch of the subtree

(approach known as subtree raising). The tree-pruning step terminates when

no further improvement in the generalization error estimate is observed

beyond a certain threshold. Again, the estimates of generalization error rate

can be computed using any of the approaches presented in the previous three

subsections. Post-pruning tends to give better results than prepruning

because it makes pruning decisions based on a fully grown tree, unlike

prepruning, which can suffer from premature termination of the tree-growing

process. However, for post-pruning, the additional computations needed to

grow the full tree may be wasted when the subtree is pruned.

Figure 3.32 illustrates the simplified decision tree model for the web robot

detection example given in Section 3.3.5 . Notice that the subtree rooted at

has been replaced by one of its branches corresponding todepth=1

, and , using subtree raising. On the other hand,

the subtree corresponding to and has been replaced by

a leaf node assigned to class 0, using subtree replacement. The subtree for

and remains intact.

Figure 3.32.

Post-pruning of the decision tree for web robot detection.

breadth<=7, width>3 MultiP=1

depth>1 MultiAgent=0

depth>1 MultiAgent=1

3.6 Model Evaluation

The previous section discussed several approaches for model selection that

can be used to learn a classification model from a training set D.train. Here we

discuss methods for estimating its generalization performance, i.e. its

performance on unseen instances outside of D.train. This process is known as

model evaluation.

Note that model selection approaches discussed in Section 3.5 also

compute an estimate of the generalization performance using the training set

D.train. However, these estimates are biased indicators of the performance on

unseen instances, since they were used to guide the selection of classification

model. For example, if we use the validation error rate for model selection (as

described in Section 3.5.1 ), the resulting model would be deliberately

chosen to minimize the errors on the validation set. The validation error rate

may thus under-estimate the true generalization error rate, and hence cannot

be reliably used for model evaluation.

A correct approach for model evaluation would be to assess the performance

of a learned model on a labeled test set has not been used at any stage of

model selection. This can be achieved by partitioning the entire set of labeled

instances D, into two disjoint subsets, D.train, which is used for model

selection and D.test, which is used for computing the test error rate, . In

the following, we present two different approaches for partitioning D into

D.train and D.test, and computing the test error rate, .

3.6.1 Holdout Method

errtest

errtest

The most basic technique for partitioning a labeled data set is the holdout

method, where the labeled set D is randomly partitioned into two disjoint sets,

called the training set D.train and the test set D.test. A classification model is

then induced from D.train using the model selection approaches presented in

Section 3.5 , and its error rate on D.test, , is used as an estimate of

the generalization error rate. The proportion of data reserved for training and

for testing is typically at the discretion of the analysts, e.g., two-thirds for

training and one-third for testing.

Similar to the trade-off faced while partitioning D.train into D.tr and D.val in

Section 3.5.1 , choosing the right fraction of labeled data to be used for

training and testing is non-trivial. If the size of D.train is small, the learned

classification model may be improperly learned using an insufficient number of

training instances, resulting in a biased estimation of generalization

performance. On the other hand, if the size of D.test is small, may be

less reliable as it would be computed over a small number of test instances.

Moreover, can have a high variance as we change the random

partitioning of D into D.train and D.test.

The holdout method can be repeated several times to obtain a distribution of

the test error rates, an approach known as random subsampling or repeated

holdout method. This method produces a distribution of the error rates that

can be used to understand the variance of .

3.6.2 Cross-Validation

Cross-validation is a widely-used model evaluation method that aims to make

effective use of all labeled instances in D for both training and testing. To

illustrate this method, suppose that we are given a labeled set that we have

errtest

errtest

errtest

errtest

randomly partitioned into three equal-sized subsets, , and , as

shown in Figure 3.33 . For the first run, we train a model using subsets

and S (shown as empty blocks) and test the model on subset . The test

error rate on , denoted as , is thus computed in the first run.

Similarly, for the second run, we use and as the training set and as

the test set, to compute the test error rate, , on . Finally, we use

and for training in the third run, while is used for testing, thus resulting

in the test error rate for . The overall test error rate is obtained by

summing up the number of errors committed in each test subset across all

runs and dividing it by the total number of instances. This approach is called

three-fold cross-validation.

Figure 3.33.

Example demonstrating the technique of 3-fold cross-validation.

The k-fold cross-validation method generalizes this approach by segmenting

the labeled data D (of size N) into k equal-sized partitions (or folds). During

the i run, one of the partitions of D is chosen as D.test(i) for testing, while the

rest of the partitions are used as D.train(i) for training. A model m(i) is learned

using D.train(i) and applied on D.test(i) to obtain the sum of test errors,

S1, S2 S3

S2

3 S1

S1 err(S1)

S1 S3 S2

err(S2) S2 S1

S3 S3

err(S3) S3

th

. This procedure is repeated k times. The total test error rate, ,

is then computed as

Every instance in the data is thus used for testing exactly once and for training

exactly times. Also, every run uses fraction of the data for

training and 1/k fraction for testing.

The right choice of k in k-fold cross-validation depends on a number of

characteristics of the problem. A small value of k will result in a smaller

training set at every run, which will result in a larger estimate of generalization

error rate than what is expected of a model trained over the entire labeled set.

On the other hand, a high value of k results in a larger training set at every

run, which reduces the bias in the estimate of generalization error rate. In the

extreme case, when , every run uses exactly one data instance for testing

and the remainder of the data for testing. This special case of k-fold cross-

validation is called the leave-one-out approach. This approach has the

advantage of utilizing as much data as possible for training. However, leave-

one-out can produce quite misleading results in some special scenarios, as

illustrated in Exercise 11. Furthermore, leave-one-out can be computationally

expensive for large data sets as the cross-validation procedure needs to be

repeated N times. For most practical applications, the choice of k between 5

and 10 provides a reasonable approach for estimating the generalization error

rate, because each fold is able to make use of 80% to 90% of the labeled data

for training.

The k-fold cross-validation method, as described above, produces a single

estimate of the generalization error rate, without providing any information

about the variance of the estimate. To obtain this information, we can run k-

fold cross-validation for every possible partitioning of the data into k partitions,

errsum(i) errtest

errtest=∑i=1kerrsum(i)N. (3.14)

(k−1) (k−1)/k

k=N

and obtain a distribution of test error rates computed for every such

partitioning. The average test error rate across all possible partitionings

serves as a more robust estimate of generalization error rate. This approach

of estimating the generalization error rate and its variance is known as the

complete cross-validation approach. Even though such an estimate is quite

robust, it is usually too expensive to consider all possible partitionings of a

large data set into k partitions. A more practical solution is to repeat the cross-

validation approach multiple times, using a different random partitioning of the

data into k partitions at every time, and use the average test error rate as the

estimate of generalization error rate. Note that since there is only one possible

partitioning for the leave-one-out approach, it is not possible to estimate the

variance of generalization error rate, which is another limitation of this method.

The k-fold cross-validation does not guarantee that the fraction of positive and

negative instances in every partition of the data is equal to the fraction

observed in the overall data. A simple solution to this problem is to perform a

stratified sampling of the positive and negative instances into k partitions, an

approach called stratified cross-validation.

In k-fold cross-validation, a different model is learned at every run and the

performance of every one of the k models on their respective test folds is then

aggregated to compute the overall test error rate, . Hence, does

not reflect the generalization error rate of any of the k models. Instead, it

reflects the expected generalization error rate of the model selection

approach, when applied on a training set of the same size as one of the

training folds . This is different than the computed in the

holdout method, which exactly corresponds to the specific model learned over

D.train. Hence, although effectively utilizing every data instance in D for

training and testing, the computed in the cross-validation method does

not represent the performance of a single model learned over a specific

D.train.

errtest errtest

(N(k−1)/k) errtest

errtest

Nonetheless, in practice, is typically used as an estimate of the

generalization error of a model built on D. One motivation for this is that when

the size of the training folds is closer to the size of the overall data (when k is

large), then resembles the expected performance of a model learned

over a data set of the same size as D. For example, when k is 10, every

training fold is 90% of the overall data. The then should approach the

expected performance of a model learned over 90% of the overall data, which

will be close to the expected performance of a model learned over D.

errtest

errtest

errtest

3.7 Presence of Hyper-parameters

Hyper-parameters are parameters of learning algorithms that need to be

determined before learning the classification model. For instance, consider the

hyper-parameter that appeared in Equation 3.11 , which is repeated here

for convenience. This equation was used for estimating the generalization

error for a model selection approach that used an explicit representations of

model complexity. (See Section 3.5.2 .)

For other examples of hyper-parameters, see Chapter 4 .

Unlike regular model parameters, such as the test conditions in the internal

nodes of a decision tree, hyper-parameters such as do not appear in the

final classification model that is used to classify unlabeled instances.

However, the values of hyper-parameters need to be determined during model

selection—a process known as hyper-parameter selection—and must be

taken into account during model evaluation. Fortunately, both tasks can be

effectively accomplished via slight modifications of the cross-validation

approach described in the previous section.

3.7.1 Hyper-parameter Selection

In Section 3.5.2 , a validation set was used to select and this approach is

generally applicable for hyper-parameter section. Let p be the hyper-

parameter that needs to be selected from a finite range of values,

α

gen.error(m)=train.error(m, D.train)+α×complexity(M)

α

α

P=

. Partition D.train into D.tr and D.val. For every choice of hyper-

parameter value , we can learn a model on D.tr, and apply this model on

D.val to obtain the validation error rate . Let be the hyper-

parameter value that provides the lowest validation error rate. We can then

use the model corresponding to as the final choice of classification

model.

The above approach, although useful, uses only a subset of the data, D.train,

for training and a subset, D.val, for validation. The framework of cross-

validation, presented in Section 3.6.2 , addresses both of those issues,

albeit in the context of model evaluation. Here we indicate how to use a cross-

validation approach for hyper-parameter selection. To illustrate this approach,

let us partition D.train into three folds as shown in Figure 3.34 . At every

run, one of the folds is used as D.val for validation, and the remaining two

folds are used as D.tr for learning a model, for every choice of hyper-

parameter value . The overall validation error rate corresponding to each

is computed by summing the errors across all the three folds. We then select

the hyper-parameter value that provides the lowest validation error rate,

and use it to learn a model on the entire training set D.train.

Figure 3.34.

Example demonstrating the 3-fold cross-validation framework for hyper-

parameter selection using D.train.

{p1, p2, … pn }

pi mi

errval(pi) p*

m* p*

pi pi

p*

m*

Algorithm 3.2 generalizes the above approach using a k-fold cross-

validation framework for hyper-parameter selection. At the i run of cross-

validation, the data in the i fold is used as D.val(i) for validation (Step 4),

while the remainder of the data in D.train is used as D.tr(i) for training (Step

5). Then for every choice of hyper-parameter value , a model is learned on

D.tr(i) (Step 7), which is applied on D.val(i) to compute its validation error

(Step 8). This is used to compute the validation error rate corresponding to

models learning using over all the folds (Step 11). The hyper-parameter

value that provides the lowest validation error rate (Step 12) is now used to

learn the final model on the entire training set D.train (Step 13). Hence, at

the end of this algorithm, we obtain the best choice of the hyper-parameter

value as well as the final classification model (Step 14), both of which are

obtained by making an effective use of every data instance in D.train.

Algorithm 3.2 Procedure model-select(k, ,

D.train)

∈

th

th

pi

pi

p*

m*

P

∑

3.7.2 Nested Cross-Validation

The approach of the previous section provides a way to effectively use all the

instances in D.train to learn a classification model when hyper-parameter

selection is required. This approach can be applied over the entire data set D

to learn the final classification model. However, applying Algorithm 3.2 on

D would only return the final classification model but not an estimate of its

generalization performance, . Recall that the validation error rates used

in Algorithm 3.2 cannot be used as estimates of generalization

performance, since they are used to guide the selection of the final model .

However, to compute , we can again use a cross-validation framework

for evaluating the performance on the entire data set D, as described

originally in Section 3.6.2 . In this approach, D is partitioned into D.train (for

training) and D.test (for testing) at every run of cross-validation. When hyper-

parameters are involved, we can use Algorithm 3.2 to train a model using

D.train at every run, thus “internally” using cross-validation for model

selection. This approach is called nested cross-validation or double cross-

validation. Algorithm 3.3 describes the complete approach for estimating

using nested cross-validation in the presence of hyper-parameters.

As an illustration of this approach, see Figure 3.35 where the labeled set D

is partitioned into D.train and D.test, using a 3-fold cross-validation method.

m*

errtest

m*

errtest

errtest

Figure 3.35.

Example demonstrating 3-fold nested cross-validation for computing .

At the i run of this method, one of the folds is used as the test set, D.test(i),

while the remaining two folds are used as the training set, D.train(i). This is

represented in Figure 3.35 as the i “outer” run. In order to select a model

using D.train(i), we again use an “inner” 3-fold cross-validation framework that

partitions D.train(i) into D.tr and D.val at every one of the three inner runs

(iterations). As described in Section 3.7 , we can use the inner cross-

validation framework to select the best hyper-parameter value as well as

its resulting classification model learned over D.train(i). We can then

apply on D.test(i) to obtain the test error at the i outer run. By repeating

this process for every outer run, we can compute the average test error rate,

, over the entire labeled set D. Note that in the above approach, the

inner cross-validation framework is being used for model selection while the

outer cross-validation framework is being used for model evaluation.

Algorithm 3.3 The nested cross-validation

approach for computing .

errtest

th

th

p*(i)

m*(i)

m*(i) th

errtest

errtest

∑

3.8 Pitfalls of Model Selection and

Evaluation

Model selection and evaluation, when used effectively, serve as excellent

tools for learning classification models and assessing their generalization

performance. However, when using them effectively in practical settings, there

are several pitfalls that can result in improper and often misleading

conclusions. Some of these pitfalls are simple to understand and easy to

avoid, while others are quite subtle in nature and difficult to catch. In the

following, we present two of these pitfalls and discuss best practices to avoid

them.

3.8.1 Overlap between Training and

Test Sets

One of the basic requirements of a clean model selection and evaluation

setup is that the data used for model selection (D.train) must be kept separate

from the data used for model evaluation (D.test). If there is any overlap

between the two, the test error rate computed over D.test cannot be

considered representative of the performance on unseen instances.

Comparing the effectiveness of classification models using can then be

quite misleading, as an overly complex model can show an inaccurately low

value of due to model overfitting (see Exercise 12 at the end of this

chapter).

errtest

errtest

errtest

To illustrate the importance of ensuring no overlap between D.train and D.test,

consider a labeled data set where all the attributes are irrelevant, i.e. they

have no relationship with the class labels. Using such attributes, we should

expect no classification model to perform better than random guessing.

However, if the test set involves even a small number of data instances that

were used for training, there is a possibility for an overly complex model to

show better performance than random, even though the attributes are

completely irrelevant. As we will see later in Chapter 10 , this scenario can

actually be used as a criterion to detect overfitting due to improper setup of

experiment. If a model shows better performance than a random classifier

even when the attributes are irrelevant, it is an indication of a potential

feedback between the training and test sets.

3.8.2 Use of Validation Error as

Generalization Error

The validation error rate serves an important role during model

selection, as it provides “out-of-sample” error estimates of models on D.val,

which is not used for training the models. Hence, serves as a better

metric than the training error rate for selecting models and hyper-parameter

values, as described in Sections 3.5.1 and 3.7 , respectively. However,

once the validation set has been used for selecting a classification model

no longer reflects the performance of on unseen instances.

To realize the pitfall in using validation error rate as an estimate of

generalization performance, consider the problem of selecting a hyper-

parameter value p from a range of values using a validation set D.val. If the

number of possible values in is quite large and the size of D.val is small, it is

errval

errval

m*, errval m*

P,

P

possible to select a hyper-parameter value that shows favorable

performance on D.val just by random chance. Notice the similarity of this

problem with the multiple comparisons problem discussed in Section 3.4.1 .

Even though the classification model learned using would show a low

validation error rate, it would lack generalizability on unseen test instances.

The correct approach for estimating the generalization error rate of a model

is to use an independently chosen test set D.test that hasn’t been used in

any way to influence the selection of . As a rule of thumb, the test set

should never be examined during model selection, to ensure the absence of

any form of overfitting. If the insights gained from any portion of a labeled data

set help in improving the classification model even in an indirect way, then that

portion of data must be discarded during testing.

p*

m* p*

m*

m*

3.9 Model Comparison

One difficulty when comparing the performance of different classification

models is whether the observed difference in their performance is statistically

significant. For example, consider a pair of classification models, and .

Suppose achieves 85% accuracy when evaluated on a test set containing

30 instances, while achieves 75% accuracy on a different test set

containing 5000 instances. Based on this information, is a better model

than ? This example raises two key questions regarding the statistical

significance of a performance metric:

1. Although has a higher accuracy than , it was tested on a smaller

test set. How much confidence do we have that the accuracy for is

actually 85%?

2. Is it possible to explain the difference in accuracies between and

as a result of variations in the composition of their test sets?

The first question relates to the issue of estimating the confidence interval of

model accuracy. The second question relates to the issue of testing the

statistical significance of the observed deviation. These issues are

investigated in the remainder of this section.

3.9.1 Estimating the Confidence

Interval for Accuracy

*

MA MB

MA

MB

MA

MB

MA MB

MA

MA

MB

To determine its confidence interval, we need to establish the probability

distribution for sample accuracy. This section describes an approach for

deriving the confidence interval by modeling the classification task as a

binomial random experiment. The following describes characteristics of such

an experiment:

1. The random experiment consists of N independent trials, where each

trial has two possible outcomes: success or failure.

2. The probability of success, p, in each trial is constant.

An example of a binomial experiment is counting the number of heads that

turn up when a coin is flipped N times. If X is the number of successes

observed in N trials, then the probability that X takes a particular value is

given by a binomial distribution with mean and variance :

For example, if the coin is fair and is flipped fifty times, then the

probability that the head shows up 20 times is

If the experiment is repeated many times, then the average number of heads

expected to show up is while its variance is

The task of predicting the class labels of test instances can also be

considered as a binomial experiment. Given a test set that contains N

instances, let X be the number of instances correctly predicted by a model

and p be the true accuracy of the model. If the prediction task is modeled as a

binomial experiment, then X has a binomial distribution with mean and

variance It can be shown that the empirical accuracy, also

Np Np(1−p)

P(X=υ)=(Nυ)pυ(1−p)N−υ.

(p=0.5)

P(X=20)=(5020)0.520(1−0.5)30=0.0419.

50×0.5=25, 50×0.5×0.5=12.5.

Np

Np(1−p). acc=X/N,

has a binomial distribution with mean p and variance (see Exercise

14). The binomial distribution can be approximated by a normal distribution

when N is sufficiently large. Based on the normal distribution, the confidence

interval for acc can be derived as follows:

where and are the upper and lower bounds obtained from a

standard normal distribution at confidence level Since a standard

normal distribution is symmetric around it follows that

Rearranging this inequality leads to the following confidence interval for p:

The following table shows the values of at different confidence levels:

0.99 0.98 0.95 0.9 0.8 0.7 0.5

2.58 2.33 1.96 1.65 1.28 1.04 0.67

3.11. Example Confidence Interval for Accuracy

Consider a model that has an accuracy of 80% when evaluated on 100

test instances. What is the confidence interval for its true accuracy at a

95% confidence level? The confidence level of 95% corresponds to

according to the table given above. Inserting this term into

Equation 3.16 yields a confidence interval between 71.1% and 86.7%.

The following table shows the confidence interval when the number of

instances, N, increases:

N 20 50 100 500 1000 5000

p(1−p)/N

P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)

Zα/2 Z1−α/2

(1−α).

Z=0, Zα/2=Z1−α/2.

2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)

Zα/2

1−α

Zα/2

Za/2=1.96

Confidence 0.584 0.670 0.711 0.763 0.774 0.789

Interval

Note that the confidence interval becomes tighter when N increases.

3.9.2 Comparing the Performance of

Two Models

Consider a pair of models, and which are evaluated on two

independent test sets, and Let denote the number of instances in

and denote the number of instances in In addition, suppose the

error rate for on is and the error rate for on is Our goal is

to test whether the observed difference between and is statistically

significant.

Assuming that and are sufficiently large, the error rates and can

be approximated using normal distributions. If the observed difference in the

error rate is denoted as then d is also normally distributed with

mean , its true difference, and variance, The variance of d can be

computed as follows:

where and are the variances of the error rates.

Finally, at the confidence level, it can be shown that the confidence

interval for the true difference dt is given by the following equation:

−0.919 −0.888 −0.867 −0.833 −0.824 −0.811

M1 M2,

D1 D2. n1

D1 n2 D2.

M1 D1 e1 M2 D2 e2.

e1 e2

n1 n2 e1 e2

d=e1−e2,

dt σd2.

σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)

e1(1−e1)/n1 e2(1−e1)/n2

(1−α)%

3.12. Example Significance Testing

Consider the problem described at the beginning of this section. Model

has an error rate of when applied to test instances, while

model has an error rate of when applied to test

instances. The observed difference in their error rates is

. In this example, we are performing a two-sided test to

check whether or . The estimated variance of the observed

difference in error rates can be computed as follows:

or . Inserting this value into Equation 3.18 , we obtain the

following confidence interval for at 95% confidence level:

As the interval spans the value zero, we can conclude that the observed

difference is not statistically significant at a 95% confidence level.

At what confidence level can we reject the hypothesis that ? To do this,

we need to determine the value of such that the confidence interval for

does not span the value zero. We can reverse the preceding computation and

look for the value such that . Replacing the values of d and

gives . This value first occurs when (for a two-

sided test). The result suggests that the null hypothesis can be rejected at

confidence level of 93.6% or lower.

dt=d±zα/2σ^d. (3.18)

MA

e1=0.15 N1=30

MB e2=0.25 N2=5000

d=|0.15−0.25|=0.1

dt=0 dt≠0

σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043

σ^d=0.0655

dt

dt=0.1±1.96×0.0655=0.1±0.128.

dt=0

Zα/2 dt

Zα/2 d>Zσ/2σ^d

σ^d Zσ/2<1.527 (1−α)<~0.936

3.10 Bibliographic Notes

Early classification systems were developed to organize various collections of

objects, from living organisms to inanimate ones. Examples abound, from

Aristotle’s cataloguing of species to the Dewey Decimal and Library of

Congress classification systems for books. Such a task typically requires

considerable human efforts, both to identify properties of the objects to be

classified and to organize them into well distinguished categories.

With the development of statistics and computing, automated classification

has been a subject of intensive research. The study of classification in

classical statistics is sometimes known as discriminant analysis, where the

objective is to predict the group membership of an object based on its

corresponding features. A well-known classical method is Fisher’s linear

discriminant analysis [142], which seeks to find a linear projection of the data

that produces the best separation between objects from different classes.

Many pattern recognition problems also require the discrimination of objects

from different classes. Examples include speech recognition, handwritten

character identification, and image classification. Readers who are interested

in the application of classification techniques for pattern recognition may refer

to the survey articles by Jain et al. [150] and Kulkarni et al. [157] or classic

pattern recognition books by Bishop [125], Duda et al. [137], and Fukunaga

[143]. The subject of classification is also a major research topic in neural

networks, statistical learning, and machine learning. An in-depth treatment on

the topic of classification from the statistical and machine learning

perspectives can be found in the books by Bishop [126], Cherkassky and

Mulier [132], Hastie et al. [148], Michie et al. [162], Murphy [167], and Mitchell

[165]. Recent years have also seen the release of many publicly available

software packages for classification, which can be embedded in programming

languages such as Java (Weka [147]) and Python (scikit-learn [174]).

An overview of decision tree induction algorithms can be found in the survey

articles by Buntine [129], Moret [166], Murthy [168], and Safavian et al. [179].

Examples of some well-known decision tree algorithms include CART [127],

ID3 [175], C4.5 [177], and CHAID [153]. Both ID3 and C4.5 employ the

entropy measure as their splitting function. An in-depth discussion of the C4.5

decision tree algorithm is given by Quinlan [177]. The CART algorithm was

developed by Breiman et al. [127] and uses the Gini index as its splitting

function. CHAID [153] uses the statistical test to determine the best split

during the tree-growing process.

The decision tree algorithm presented in this chapter assumes that the

splitting condition at each internal node contains only one attribute. An oblique

decision tree can use multiple attributes to form the attribute test condition in a

single node [149, 187]. Breiman et al. [127] provide an option for using linear

combinations of attributes in their CART implementation. Other approaches

for inducing oblique decision trees were proposed by Heath et al. [149],

Murthy et al. [169], Cantú-Paz and Kamath [130], and Utgoff and Brodley

[187]. Although an oblique decision tree helps to improve the expressiveness

of the model representation, the tree induction process becomes

computationally challenging. Another way to improve the expressiveness of a

decision tree without using oblique decision trees is to apply a method known

as constructive induction [161]. This method simplifies the task of learning

complex splitting functions by creating compound features from the original

data.

Besides the top-down approach, other strategies for growing a decision tree

include the bottom-up approach by Landeweerd et al. [159] and Pattipati and

Alexandridis [173], as well as the bidirectional approach by Kim and

χ2

Landgrebe [154]. Schuermann and Doster [181] and Wang and Suen [193]

proposed using a soft splitting criterion to address the data fragmentation

problem. In this approach, each instance is assigned to different branches of

the decision tree with different probabilities.

Model overfitting is an important issue that must be addressed to ensure that

a decision tree classifier performs equally well on previously unlabeled data

instances. The model overfitting problem has been investigated by many

authors including Breiman et al. [127], Schaffer [180], Mingers [164], and

Jensen and Cohen [151]. While the presence of noise is often regarded as

one of the primary reasons for overfitting [164, 170], Jensen and Cohen [151]

viewed overfitting as an artifact of failure to compensate for the multiple

comparisons problem.

Bishop [126] and Hastie et al. [148] provide an excellent discussion of model

overfitting, relating it to a well-known framework of theoretical analysis, known

as bias-variance decomposition [146]. In this framework, the prediction of a

learning algorithm is considered to be a function of the training set, which

varies as the training set is changed. The generalization error of a model is

then described in terms of its bias (the error of the average prediction

obtained using different training sets), its variance (how different are the

predictions obtained using different training sets), and noise (the irreducible

error inherent to the problem). An underfit model is considered to have high

bias but low variance, while an overfit model is considered to have low bias

but high variance. Although the bias-variance decomposition was originally

proposed for regression problems (where the target attribute is a continuous

variable), a unified analysis that is applicable for classification has been

proposed by Domingos [136]. The bias variance decomposition will be

discussed in more detail while introducing ensemble learning methods in

Chapter 4 .

Various learning principles, such as the Probably Approximately Correct

(PAC) learning framework [188], have been developed to provide a theoretical

framework for explaining the generalization performance of learning

algorithms. In the field of statistics, a number of performance estimation

methods have been proposed that make a trade-off between the goodness of

fit of a model and the model complexity. Most noteworthy among them are the

Akaike’s Information Criterion [120] and the Bayesian Information Criterion

[182]. They both apply corrective terms to the training error rate of a model, so

as to penalize more complex models. Another widely-used approach for

measuring the complexity of any general model is the VapnikChervonenkis

(VC) Dimension [190]. The VC dimension of a class of functions C is defined

as the maximum number of points that can be shattered (every point can be

distinguished from the rest) by functions belonging to C, for any possible

configuration of points. The VC dimension lays the foundation of the structural

risk minimization principle [189], which is extensively used in many learning

algorithms, e.g., support vector machines, which will be discussed in detail in

Chapter 4 .

The Occam’s razor principle is often attributed to the philosopher William of

Occam. Domingos [135] cautioned against the pitfall of misinterpreting

Occam’s razor as comparing models with similar training errors, instead of

generalization errors. A survey on decision tree-pruning methods to avoid

overfitting is given by Breslow and Aha [128] and Esposito et al. [141]. Some

of the typical pruning methods include reduced error pruning [176], pessimistic

error pruning [176], minimum error pruning [171], critical value pruning [163],

cost-complexity pruning [127], and error-based pruning [177]. Quinlan and

Rivest proposed using the minimum description length principle for decision

tree pruning in [178].

The discussions in this chapter on the significance of cross-validation error

estimates is inspired from Chapter 7 in Hastie et al. [148]. It is also an

excellent resource for understanding “the right and wrong ways to do cross-

validation”, which is similar to the discussion on pitfalls in Section 3.8 of

this chapter. A comprehensive discussion of some of the common pitfalls in

using cross-validation for model selection and evaluation is provided in

Krstajic et al. [156].

The original cross-validation method was proposed independently by Allen

[121], Stone [184], and Geisser [145] for model assessment (evaluation).

Even though cross-validation can be used for model selection [194], its usage

for model selection is quite different than when it is used for model evaluation,

as emphasized by Stone [184]. Over the years, the distinction between the

two usages has often been ignored, resulting in incorrect findings. One of the

common mistakes while using cross-validation is to perform pre-processing

operations (e.g., hyper-parameter tuning or feature selection) using the entire

data set and not “within” the training fold of every cross-validation run.

Ambroise et al., using a number of gene expression studies as examples,

[124] provide an extensive discussion of the selection bias that arises when

feature selection is performed outside cross-validation. Useful guidelines for

evaluating models on microarray data have also been provided by Allison et

al. [122].

The use of the cross-validation protocol for hyper-parameter tuning has been

described in detail by Dudoit and van der Laan [138]. This approach has been

called “grid-search cross-validation.” The correct approach in using cross-

validation for both hyper-parameter selection and model evaluation, as

discussed in Section 3.7 of this chapter, is extensively covered by Varma

and Simon [191]. This combined approach has been referred to as “nested

cross-validation” or “double cross-validation” in the existing literature.

Recently, Tibshirani and Tibshirani [185] have proposed a new approach for

hyper-parameter selection and model evaluation. Tsamardinos et al. [186]

compared this approach to nested cross-validation. The experiments they

performed found that, on average, both approaches provide conservative

estimates of model performance with the Tibshirani and Tibshirani approach

being more computationally efficient.

Kohavi [155] has performed an extensive empirical study to compare the

performance metrics obtained using different estimation methods such as

random subsampling and k-fold cross-validation. Their results suggest that the

best estimation method is ten-fold, stratified cross-validation.

An alternative approach for model evaluation is the bootstrap method, which

was presented by Efron in 1979 [139]. In this method, training instances are

sampled with replacement from the labeled set, i.e., an instance previously

selected to be part of the training set is equally likely to be drawn again. If the

original data has N instances, it can be shown that, on average, a bootstrap

sample of size N contains about 63.2% of the instances in the original data.

Instances that are not included in the bootstrap sample become part of the

test set. The bootstrap procedure for obtaining training and test sets is

repeated b times, resulting in a different error rate on the test set, err(i), at the

i run. To obtain the overall error rate, , the .632 bootstrap approach

combines err(i) with the error rate obtained from a training set containing all

the labeled examples, , as follows:

Efron and Tibshirani [140] provided a theoretical and empirical comparison

between cross-validation and a bootstrap method known as the rule.

While the .632 bootstrap method presented above provides a robust estimate

of the generalization performance with low variance in its estimate, it may

produce misleading results for highly complex models in certain conditions, as

demonstrated by Kohavi [155]. This is because the overall error rate is not

th errboot

errs

errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)

632+

truly an out-of-sample error estimate as it depends on the training error rate,

, which can be quite small if there is overfitting.

Current techniques such as C4.5 require that the entire training data set fit

into main memory. There has been considerable effort to develop parallel and

scalable versions of decision tree induction algorithms. Some of the proposed

algorithms include SLIQ by Mehta et al. [160], SPRINT by Shafer et al. [183],

CMP by Wang and Zaniolo [192], CLOUDS by Alsabti et al. [123], RainForest

by Gehrke et al. [144], and ScalParC by Joshi et al. [152]. A survey of parallel

algorithms for classification and other data mining tasks is given in [158]. More

recently, there has been extensive research to implement large-scale

classifiers on the compute unified device architecture (CUDA) [131, 134] and

MapReduce [133, 172] platforms.

errs

Bibliography

[120] H. Akaike. Information theory and an extension of the maximum

likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199–

213. Springer, 1998.

[121] D. M. Allen. The relationship between variable selection and data

agumentation and a method for prediction. Technometrics, 16(1):125–127,

1974.

[122] D. B. Allison, X. Cui, G. P. Page, and M. Sabripour. Microarray data

analysis: from disarray to consolidation and consensus. Nature reviews

genetics, 7(1):55–65, 2006.

[123] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A Decision Tree Classifier

for Large Datasets. In Proc. of the 4th Intl. Conf. on Knowledge Discovery

and Data Mining, pages 2–8, New York, NY, August 1998.

[124] C. Ambroise and G. J. McLachlan. Selection bias in gene extraction on

the basis of microarray gene-expression data. Proceedings of the national

academy of sciences, 99 (10):6562–6566, 2002.

[125] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford

University Press, Oxford, U.K., 1995.

[126] C. M. Bishop. Pattern Recognition and Machine Learning. Springer,

2006.

[127] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone. Classification

and Regression Trees. Chapman & Hall, New York, 1984.

[128] L. A. Breslow and D. W. Aha. Simplifying Decision Trees: A Survey.

Knowledge Engineering Review, 12(1):1–40, 1997.

[129] W. Buntine. Learning classification trees. In Artificial Intelligence

Frontiers in Statistics, pages 182–201. Chapman & Hall, London, 1993.

[130] E. Cantú-Paz and C. Kamath. Using evolutionary algorithms to induce

oblique decision trees. In Proc. of the Genetic and Evolutionary

Computation Conf., pages 1053–1060, San Francisco, CA, 2000.

[131] B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector

machine training and classification on graphics processors. In Proceedings

of the 25th International Conference on Machine Learning, pages 104–

111, 2008.

[132] V. Cherkassky and F. M. Mulier. Learning from Data: Concepts, Theory,

and Methods. Wiley, 2nd edition, 2007.

[133] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K.

Olukotun. Map-reduce for machine learning on multicore. Advances in

neural information processing systems, 19:281, 2007.

[134] A. Cotter, N. Srebro, and J. Keshet. A GPU-tailored Approach for

Training Kernelized SVMs. In Proceedings of the 17th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, pages

805–813, San Diego, California, USA, 2011.

[135] P. Domingos. The Role of Occam’s Razor in Knowledge Discovery. Data

Mining and Knowledge Discovery, 3(4):409–425, 1999.

[136] P. Domingos. A unified bias-variance decomposition. In Proceedings of

17th International Conference on Machine Learning, pages 231–238,

2000.

[137] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John

Wiley & Sons, Inc., New York, 2nd edition, 2001.

[138] S. Dudoit and M. J. van der Laan. Asymptotics of cross-validated risk

estimation in estimator selection and performance assessment. Statistical

Methodology, 2(2):131–154, 2005.

[139] B. Efron. Bootstrap methods: another look at the jackknife. In

Breakthroughs in Statistics, pages 569–593. Springer, 1992.

[140] B. Efron and R. Tibshirani. Cross-validation and the Bootstrap:

Estimating the Error Rate of a Prediction Rule. Technical report, Stanford

University, 1995.

[141] F. Esposito, D. Malerba, and G. Semeraro. A Comparative Analysis of

Methods for Pruning Decision Trees. IEEE Trans. Pattern Analysis and

Machine Intelligence, 19(5):476–491, May 1997.

[142] R. A. Fisher. The use of multiple measurements in taxonomic problems.

Annals of Eugenics, 7:179–188, 1936.

[143] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic

Press, New York, 1990.

[144] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest—A Framework

for Fast Decision Tree Construction of Large Datasets. Data Mining and

Knowledge Discovery, 4(2/3):127–162, 2000.

[145] S. Geisser. The predictive sample reuse method with applications.

Journal of the American Statistical Association, 70(350):320–328, 1975.

[146] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the

bias/variance dilemma. Neural computation, 4(1):1–58, 1992.

[147] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.

Witten. The WEKA Data Mining Software: An Update. SIGKDD

Explorations, 11(1), 2009.

[148] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical

Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition,

2009.

[149] D. Heath, S. Kasif, and S. Salzberg. Induction of Oblique Decision

Trees. In Proc. of the 13th Intl. Joint Conf. on Artificial Intelligence, pages

1002–1007, Chambery, France, August 1993.

[150] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical Pattern Recognition: A

Review. IEEE Tran. Patt. Anal. and Mach. Intellig., 22(1):4–37, 2000.

[151] D. Jensen and P. R. Cohen. Multiple Comparisons in Induction

Algorithms. Machine Learning, 38(3):309–338, March 2000.

[152] M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A New Scalable and

Efficient Parallel Classification Algorithm for Mining Large Datasets. In

Proc. of 12th Intl. Parallel Processing Symp. (IPPS/SPDP), pages 573–

579, Orlando, FL, April 1998.

[153] G. V. Kass. An Exploratory Technique for Investigating Large Quantities

of Categorical Data. Applied Statistics, 29:119–127, 1980.

[154] B. Kim and D. Landgrebe. Hierarchical decision classifiers in high-

dimensional and large class data. IEEE Trans. on Geoscience and Remote

Sensing, 29(4):518–528, 1991.

[155] R. Kohavi. A Study on Cross-Validation and Bootstrap for Accuracy

Estimation and Model Selection. In Proc. of the 15th Intl. Joint Conf. on

Artificial Intelligence, pages 1137–1145, Montreal, Canada, August 1995.

[156] D. Krstajic, L. J. Buturovic, D. E. Leahy, and S. Thomas. Cross-

validation pitfalls when selecting and assessing regression and

classification models. Journal of cheminformatics, 6(1):1, 2014.

[157] S. R. Kulkarni, G. Lugosi, and S. S. Venkatesh. Learning Pattern

Classification—A Survey. IEEE Tran. Inf. Theory, 44(6):2178–2206, 1998.

[158] V. Kumar, M. V. Joshi, E.-H. Han, P. N. Tan, and M. Steinbach. High

Performance Data Mining. In High Performance Computing for

Computational Science (VECPAR 2002), pages 111–125. Springer, 2002.

[159] G. Landeweerd, T. Timmers, E. Gersema, M. Bins, and M. Halic. Binary

tree versus single level tree classification of white blood cells. Pattern

Recognition, 16:571–577, 1983.

[160] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A Fast Scalable Classifier

for Data Mining. In Proc. of the 5th Intl. Conf. on Extending Database

Technology, pages 18–32, Avignon, France, March 1996.

[161] R. S. Michalski. A theory and methodology of inductive learning. Artificial

Intelligence, 20:111–116, 1983.

[162] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning,

Neural and Statistical Classification. Ellis Horwood, Upper Saddle River,

NJ, 1994.

[163] J. Mingers. Expert Systems—Rule Induction with Statistical Data. J

Operational Research Society, 38:39–47, 1987.

[164] J. Mingers. An empirical comparison of pruning methods for decision

tree induction. Machine Learning, 4:227–243, 1989.

[165] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.

[166] B. M. E. Moret. Decision Trees and Diagrams. Computing Surveys,

14(4):593–623, 1982.

[167] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press,

2012.

[168] S. K. Murthy. Automatic Construction of Decision Trees from Data: A

Multi-Disciplinary Survey. Data Mining and Knowledge Discovery,

2(4):345–389, 1998.

[169] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique

decision trees. J of Artificial Intelligence Research, 2:1–33, 1994.

[170] T. Niblett. Constructing decision trees in noisy domains. In Proc. of the

2nd European Working Session on Learning, pages 67–78, Bled,

Yugoslavia, May 1987.

[171] T. Niblett and I. Bratko. Learning Decision Rules in Noisy Domains. In

Research and Development in Expert Systems III, Cambridge, 1986.

Cambridge University Press.

[172] I. Palit and C. K. Reddy. Scalable and parallel boosting with mapreduce.

IEEE Transactions on Knowledge and Data Engineering, 24(10):1904–

1916, 2012.

[173] K. R. Pattipati and M. G. Alexandridis. Application of heuristic search

and information theory to sequential fault diagnosis. IEEE Trans. on

Systems, Man, and Cybernetics, 20(4):872–887, 1990.

[174] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.

Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,

A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.

Scikit-learn: Machine Learning in Python. Journal of Machine Learning

Research, 12:2825–2830, 2011.

[175] J. R. Quinlan. Discovering rules by induction from large collection of

examples. In D. Michie, editor, Expert Systems in the Micro Electronic Age.

Edinburgh University Press, Edinburgh, UK, 1979.

[176] J. R. Quinlan. Simplifying Decision Trees. Intl. J. Man-Machine Studies,

27:221–234, 1987.

[177] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufmann

Publishers, San Mateo, CA, 1993.

[178] J. R. Quinlan and R. L. Rivest. Inferring Decision Trees Using the

Minimum Description Length Principle. Information and Computation,

80(3):227–248, 1989.

[179] S. R. Safavian and D. Landgrebe. A Survey of Decision Tree Classifier

Methodology. IEEE Trans. Systems, Man and Cybernetics, 22:660–674,

May/June 1998.

[180] C. Schaffer. Overfitting avoidence as bias. Machine Learning, 10:153–

178, 1993.

[181] J. Schuermann and W. Doster. A decision-theoretic approach in

hierarchical classifier design. Pattern Recognition, 17:359–369, 1984.

[182] G. Schwarz et al. Estimating the dimension of a model. The annals of

statistics, 6(2): 461–464, 1978.

[183] J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel

Classifier for Data Mining. In Proc. of the 22nd VLDB Conf., pages 544–

555, Bombay, India, September 1996.

[184] M. Stone. Cross-validatory choice and assessment of statistical

predictions. Journal of the Royal Statistical Society. Series B

(Methodological), pages 111–147, 1974.

[185] R. J. Tibshirani and R. Tibshirani. A bias correction for the minimum

error rate in cross-validation. The Annals of Applied Statistics, pages 822–

829, 2009.

[186] I. Tsamardinos, A. Rakhshani, and V. Lagani. Performance-estimation

properties of cross-validation-based protocols with simultaneous hyper-

parameter optimization. In Hellenic Conference on Artificial Intelligence,

pages 1–14. Springer, 2014.

[187] P. E. Utgoff and C. E. Brodley. An incremental method for finding

multivariate splits for decision trees. In Proc. of the 7th Intl. Conf. on

Machine Learning, pages 58–65, Austin, TX, June 1990.

[188] L. Valiant. A theory of the learnable. Communications of the ACM,

27(11):1134–1142, 1984.

[189] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

[190] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of

relative frequencies of events to their probabilities. In Measures of

Complexity, pages 11–30. Springer, 2015.

[191] S. Varma and R. Simon. Bias in error estimation when using cross-

validation for model selection. BMC bioinformatics, 7(1):1, 2006.

[192] H. Wang and C. Zaniolo. CMP: A Fast Decision Tree Classifier Using

Multivariate Predictions. In Proc. of the 16th Intl. Conf. on Data

Engineering, pages 449–460, San Diego, CA, March 2000.

[193] Q. R. Wang and C. Y. Suen. Large tree classifier with heuristic search

and global training. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 9(1):91–102, 1987.

[194] Y. Zhang and Y. Yang. Cross-validation for selecting a model selection

procedure. Journal of Econometrics, 187(1):95–112, 2015.

3.11 Exercises

1. Draw the full decision tree for the parity function of four Boolean attributes,

A, B, C, and D. Is it possible to simplify the tree?

2. Consider the training examples shown in Table 3.5 for a binary

classification problem.

Table 3.5. Data set for Exercise 2.

Customer ID Gender Car Type Shirt Size Class

1 M Family Small C0

2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0

5 M Sports Extra Large C0

6 M Sports Extra Large C0

7 F Sports Small C0

8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family Extra Large C1

13 M Family Medium C1

14 M Luxury Extra Large C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1

a. Compute the Gini index for the overall collection of training examples.

b. Compute the Gini index for the attribute.

c. Compute the Gini index for the attribute.

d. Compute the Gini index for the attribute using multiway split.

e. Compute the Gini index for the attribute using multiway split.

f. Which attribute is better, , , or ?

g. Explain why should not be used as the attribute test

condition even though it has the lowest Gini.

3. Consider the training examples shown in Table 3.6 for a binary

classification problem.

Table 3.6. Data set for Exercise 3.

Instance Target Classa1 a2 a3

1 T T 1.0 +

2 T T 6.0

3 T F 5.0

4 F F 4.0

5 F T 7.0

6 F T 3.0

7 F F 8.0

8 T F 7.0

9 F T 5.0

a. What is the entropy of this collection of training examples with respect to

the class attribute?

b. What are the information gains of and relative to these training

examples?

c. For , which is a continuous attribute, compute the information gain for

every possible split.

d. What is the best split (among , and ) according to the information

gain?

e. What is the best split (between and ) according to the

misclassification error rate?

f. What is the best split (between and ) according to the Gini index?

+

−

+

−

−

−

+

−

a1 a2

a3

a1, a2 a3

a1 a2

a1 a2

4. Show that the entropy of a node never increases after splitting it into

smaller successor nodes.

5. Consider the following data set for a binary class problem.

A B Class Label

T F

T T

T T

T F

T T

F F

F F

F F

T T

T F

a. Calculate the information gain when splitting on A and B. Which attribute

would the decision tree induction algorithm choose?

b. Calculate the gain in the Gini index when splitting on A and B. Which

attribute would the decision tree induction algorithm choose?

c. Figure 3.11 shows that entropy and the Gini index are both

monotonically increasing on the range [0, 0.5] and they are both

monotonically decreasing on the range [0.5, 1]. Is it possible that

+

+

+

−

+

−

−

−

−

−

information gain and the gain in the Gini index favor different attributes?

Explain.

6. Consider splitting a parent node P into two child nodes, and , using

some attribute test condition. The composition of labeled training instances at

every node is summarized in the Table below.

P

Class 0 7 3 4

Class 1 3 0 3

a. Calculate the Gini index and misclassification error rate of the parent

node P .

b. Calculate the weighted Gini index of the child nodes. Would you consider

this attribute test condition if Gini is used as the impurity measure?

c. Calculate the weighted misclassification rate of the child nodes. Would

you consider this attribute test condition if misclassification rate is used as

the impurity measure?

7. Consider the following set of training examples.

X Y Z No. of Class C1 Examples No. of Class C2 Examples

0 0 0 5 40

0 0 1 0 15

0 1 0 10 5

0 1 1 45 0

C1 C2

C1 C2

1 0 0 10 5

1 0 1 25 0

1 1 0 5 20

1 1 1 0 15

a. Compute a two-level decision tree using the greedy approach described

in this chapter. Use the classification error rate as the criterion for

splitting. What is the overall error rate of the induced tree?

b. Repeat part (a) using X as the first splitting attribute and then choose the

best remaining attribute for splitting at each of the two successor nodes.

What is the error rate of the induced tree?

c. Compare the results of parts (a) and (b). Comment on the suitability of

the greedy heuristic used for splitting attribute selection.

8. The following table summarizes a data set with three attributes A, B, C and

two class labels . Build a two-level decision tree.

A B C Number of Instances

+

T T T 5 0

F T T 0 20

T F T 20 0

F F T 0 5

T T F 0 0

+, −

−

F T F 25 0

T F F 0 0

F F F 0 25

a. According to the classification error rate, which attribute would be chosen

as the first splitting attribute? For each attribute, show the contingency

table and the gains in classification error rate.

b. Repeat for the two children of the root node.

c. How many instances are misclassified by the resulting decision tree?

d. Repeat parts (a), (b), and (c) using C as the splitting attribute.

e. Use the results in parts (c) and (d) to conclude about the greedy nature of

the decision tree induction algorithm.

9. Consider the decision tree shown in Figure 3.36 .

Figure 3.36.

Decision tree and data sets for Exercise 9.

a. Compute the generalization error rate of the tree using the optimistic

approach.

b. Compute the generalization error rate of the tree using the pessimistic

approach. (For simplicity, use the strategy of adding a factor of 0.5 to

each leaf node.)

c. Compute the generalization error rate of the tree using the validation set

shown above. This approach is known as reduced error pruning.

10. Consider the decision trees shown in Figure 3.37 . Assume they are

generated from a data set that contains 16 binary attributes and 3 classes,

, and .C1, C2 C3

Compute the total description length of each decision tree according to the

following formulation of the minimum description length principle.

The total description length of a tree is given by

Each internal node of the tree is encoded by the ID of the splitting

attribute. If there are m attributes, the cost of encoding each attribute is

bits.

Figure 3.37.

Decision trees for Exercise 10.

Each leaf is encoded using the ID of the class it is associated with. If

there are k classes, the cost of encoding a class is bits.

Cost(tree) is the cost of encoding all the nodes in the tree. To simplify the

computation, you can assume that the total cost of the tree is obtained by

adding up the costs of encoding each internal node and each leaf node.

Cost(tree,data)=Cost(tree)+Cost(data|tree).

log2m

log2 k

is encoded using the classification errors the tree commits

on the training set. Each error is encoded by bits, where n is the

total number of training instances.

Which decision tree is better, according to the MDL principle?

11. This exercise, inspired by the discussions in [155], highlights one of the

known limitations of the leave-one-out model evaluation procedure. Let us

consider a data set containing 50 positive and 50 negative instances, where

the attributes are purely random and contain no information about the class

labels. Hence, the generalization error rate of any classification model learned

over this data is expected to be 0.5. Let us consider a classifier that assigns

the majority class label of training instances (ties resolved by using the

positive label as the default class) to any test instance, irrespective of its

attribute values. We can call this approach as the majority inducer classifier.

Determine the error rate of this classifier using the following methods.

a. Leave-one-out.

b. 2-fold stratified cross-validation, where the proportion of class labels at

every fold is kept same as that of the overall data.

c. From the results above, which method provides a more reliable

evaluation of the classifier’s generalization error rate?

12. Consider a labeled data set containing 100 data instances, which is

randomly partitioned into two sets A and B, each containing 50 instances. We

use A as the training set to learn two decision trees, with 10 leaf nodes

and with 100 leaf nodes. The accuracies of the two decision trees on

data sets A and B are shown in Table 3.7 .

Table 3.7. Comparing the test accuracy of decision trees and .

Accuracy

Cost(data|tree)

log2 n

T10

T100

T10 T100

Data Set

A 0.86 0.97

B 0.84 0.77

a. Based on the accuracies shown in Table 3.7 , which classification

model would you expect to have better performance on unseen

instances?

b. Now, you tested and on the entire data set and found

that the classification accuracy of on data set is 0.85, whereas

the classification accuracy of on the data set is 0.87. Based

on this new information and your observations from Table 3.7 , which

classification model would you finally choose for classification?

13. Consider the following approach for testing whether a classifier A beats

another classifier B. Let N be the size of a given dataset, be the accuracy

of classifier A, be the accuracy of classifier B, and be the

average accuracy for both classifiers. To test whether classifier A is

significantly better than B, the following Z-statistic is used:

Classifier A is assumed to be better than classifier B if .

Table 3.8 compares the accuracies of three different classifiers, decision

tree classifiers, naïve Bayes classifiers, and support vector machines, on

various data sets. (The latter two classifiers are described in Chapter 4 .)

Summarize the performance of the classifiers given in Table 3.8 using the

following table:

win-loss-draw Decision tree Naïve Bayes Support vector machine

T10 T100

T10 T100 (A+B)

T10 (A+B)

T100 (A+B)

pA

pB p=(pA+pB)/2

Z=pA−pB2p(1−p)N.

Z>1.96

3×3

Decision tree 0 – 0 – 23

Naïve Bayes 0 – 0 – 23

Support vector machine 0 – 0 – 23

Table 3.8. Comparing the accuracy of various classification methods.

Data Set Size(N) Decision Tree

(%)

naïve Bayes

(%)

Support vector machine

(%)

Anneal 898 92.09 79.62 87.19

Australia 690 85.51 76.81 84.78

Auto 205 81.95 58.05 70.73

Breast 699 95.14 95.99 96.42

Cleve 303 76.24 83.50 84.49

Credit 690 85.80 77.54 85.07

Diabetes 768 72.40 75.91 76.82

German 1000 70.90 74.70 74.40

Glass 214 67.29 48.59 59.81

Heart 270 80.00 84.07 83.70

Hepatitis 155 81.94 83.23 87.10

Horse 368 85.33 78.80 82.61

Ionosphere 351 89.17 82.34 88.89

Iris 150 94.67 95.33 96.00

Labor 57 78.95 94.74 92.98

Led7 3200 73.34 73.16 73.56

Lymphography 148 77.03 83.11 86.49

Pima 768 74.35 76.04 76.95

Sonar 208 78.85 69.71 76.92

Tic-tac-toe 958 83.72 70.04 98.33

Vehicle 846 71.04 45.04 74.94

Wine 178 94.38 96.63 98.88

Zoo 101 93.07 93.07 96.04

Each cell in the table contains the number of wins, losses, and draws when

comparing the classifier in a given row to the classifier in a given column.

14. Let X be a binomial random variable with mean and variance .

Show that the ratio X/N also has a binomial distribution with mean p and

variance .

Np Np(1−p)

p(1−p)N

4 Classification: Alternative

Techniques

The previous chapter introduced the classification

problem and presented a technique known as the

decision tree classifier. Issues such as model overfitting

and model evaluation were also discussed. This

chapter presents alternative techniques for building

classification models—from simple techniques such as

rule-based and nearest neighbor classifiers to more

sophisticated techniques such as artificial neural

networks, deep learning, support vector machines, and

ensemble methods. Other practical issues such as the

class imbalance and multiclass problems are also

discussed at the end of the chapter.

4.1 Types of Classifiers

Before presenting specific techniques, we first categorize the different types of

classifiers available. One way to distinguish classifiers is by considering the

characteristics of their output.

Binary versus Multiclass

Binary classifiers assign each data instance to one of two possible labels,

typically denoted as and . The positive class usually refers to the

category we are more interested in predicting correctly compared to the

negative class (e.g., the category in email classification problems). If

there are more than two possible labels available, then the technique is known

as a multiclass classifier. As some classifiers were designed for binary classes

only, they must be adapted to deal with multiclass problems. Techniques for

transforming binary classifiers into multiclass classifiers are described in

Section 4.12 .

Deterministic versus Probabilistic

A deterministic classifier produces a discrete-valued label to each data

instance it classifies whereas a probabilistic classifier assigns a continuous

score between 0 and 1 to indicate how likely it is that an instance belongs to a

particular class, where the probability scores for all the classes sum up to 1.

Some examples of probabilistic classifiers include the naïve Bayes classifier,

Bayesian networks, and logistic regression. Probabilistic classifiers provide

additional information about the confidence in assigning an instance to a class

than deterministic classifiers. A data instance is typically assigned to the class

+1 −1

with the highest probability score, except when the cost of misclassifying the

class with lower probability is significantly higher. We will discuss the topic of

cost-sensitive classification with probabilistic outputs in Section 4.11.2 .

Another way to distinguish the different types of classifiers is based on their

technique for discriminating instances from different classes.

Linear versus Nonlinear

A linear classifier uses a linear separating hyperplane to discriminate

instances from different classes whereas a nonlinear classifier enables the

construction of more complex, nonlinear decision surfaces. We illustrate an

example of a linear classifier (perceptron) and its nonlinear counterpart (multi-

layer neural network) in Section 4.7 . Although the linearity assumption

makes the model less flexible in terms of fitting complex data, linear classifiers

are thus less susceptible to model overfitting compared to nonlinear

classifiers. Furthermore, one can transform the original set of attributes,

, into a more complex feature set, e.g.,

, before applying the linear classifier. Such feature

transformation allows the linear classifier to fit data sets with nonlinear

decision surfaces (see Section 4.9.4 ).

Global versus Local

A global classifier fits a single model to the entire data set. Unless the model

is highly nonlinear, this one-size-fits-all strategy may not be effective when the

relationship between the attributes and the class labels varies over the input

space. In contrast, a local classifier partitions the input space into smaller

regions and fits a distinct model to training instances in each region. The k-

nearest neighbor classifier (see Section 4.3 ) is a classic example of local

classifiers. While local classifiers are more flexible in terms of fitting complex

x=

(x1, x2, ⋯ ,xd) Φ(x)=

(x1, x2, x1x2, x12, x22, ⋯)

decision boundaries, they are also more susceptible to the model overfitting

problem, especially when the local regions contain few training examples.

Generative versus Discriminative

Given a data instance , the primary objective of any classifier is to predict

the class label, y, of the data instance. However, apart from predicting the

class label, we may also be interested in describing the underlying

mechanism that generates the instances belonging to every class label. For

example, in the process of classifying spam email messages, it may be useful

to understand the typical characteristics of email messages that are labeled

as spam, e.g., specific usage of keywords in the subject or the body of the

email. Classifiers that learn a generative model of every class in the process

of predicting class labels are known as generative classifiers. Some examples

of generative classifiers include the naïve Bayes classifier and Bayesian

networks. In contrast, discriminative classifiers directly predict the class labels

without explicitly describing the distribution of every class label. They solve a

simpler problem than generative models since they do not have the onus of

deriving insights about the generative mechanism of data instances. They are

thus sometimes preferred over generative models, especially when it is not

crucial to obtain information about the properties of every class. Some

examples of discriminative classifiers include decision trees, rule-based

classifier, nearest neighbor classifier, artificial neural networks, and support

vector machines.

4.2 Rule-Based Classifier

A rule-based classifier uses a collection of “if …then…” rules (also known as a

rule set) to classify data instances. Table 4.1 shows an example of a rule

set generated for the vertebrate classification problem described in the

previous chapter. Each classification rule in the rule set can be expressed in

the following way:

The left-hand side of the rule is called the rule antecedent or precondition. It

contains a conjunction of attribute test conditions:

where is an attribute-value pair and op is a comparison operator

chosen from the set . Each attribute test is also

known as a conjunct. The right-hand side of the rule is called the rule

consequent, which contains the predicted class .

A rule r covers a data instance x if the precondition of r matches the attributes

of x. r is also said to be fired or triggered whenever it covers a given instance.

For an illustration, consider the rule given in Table 4.1 and the following

attributes for two vertebrates: hawk and grizzly bear.

Table 4.1. Example of a rule set for the vertebrate classification problem.

ri:(Conditioni)→yi. (4.1)

Conditioni=(A1 op v1)∧(A2 op v2)…(Ak op vk), (4.2)

(Aj, vj)

{=, ≠, <, >, ≤, ≥} (Aj op vj)

yi

r1

r1:(Gives Birth=no)∧(Aerial Creature=yes)→Birdsr2:(Gives Birth=no)∧(Aquatic Creature=yes)→Fishesr3:(Gives Birth=yes)∧(Body Temperature=warm-

blooded)→Mammalsr4:(Gives Birth=no)∧(Aerial Creature=no)→Reptilesr5:(Aquatic Creature=semi)→Amphibians

Name Body

Temperature

Skin

Cover

Gives

Birth

Aquatic

Creature

Aerial

Creature

Has

Legs

Hibernates

hawk warm-

blooded

feather no no yes yes no

grizzly

bear

warm-

blooded

fur yes no no yes yes

covers the first vertebrate because its precondition is satisfied by the

hawk’s attributes. The rule does not cover the second vertebrate because

grizzly bears give birth to their young and cannot fly, thus violating the

precondition of .

The quality of a classification rule can be evaluated using measures such as

coverage and accuracy. Given a data set D and a classification rule r : ,

the coverage of the rule is the fraction of instances in D that trigger the rule r.

On the other hand, its accuracy or confidence factor is the fraction of

instances triggered by r whose class labels are equal to y. The formal

definitions of these measures are

where is the number of instances that satisfy the rule antecedent, is

the number of instances that satisfy both the antecedent and consequent, and

is the total number of instances.

Example 4.1.

Consider the data set shown in Table 4.2 . The rule

r1

r1

A→y

Coverage(r)=| A || D |Coverage(r)=|A∩y || A |, (4.3)

|A| |A∩y|

|D|

(Gives Birth=yes)∧(Body Temperature=warm-blooded)→Mammals

has a coverage of 33% since five of the fifteen instances support the rule

antecedent. The rule accuracy is 100% because all five vertebrates

covered by the rule are mammals.

Table 4.2. The vertebrate data set.

Name Body

Temperature

Skin

Cover

Gives

Birth

Aquatic

Creature

Aerial

Creature

Has

Legs

Hibernates Class

Label

human warm-

blooded

hair yes no no yes no Mammals

python cold-blooded scales no no no no yes Reptiles

salmon cold-blooded scales no yes no no no Fishes

whale warm-

blooded

hair yes yes no no no Mammals

frog cold-blooded none no semi no yes yes Amphibians

komodo

dragon

cold-blooded scales no no no yes no Reptiles

bat warm-

blooded

hair yes no yes yes yes Mammals

pigeon warm-

blooded

feathers no no yes yes no Birds

cat warm-

blooded

fur yes no no yes no Mammals

guppy cold-blooded scales yes yes no no no Fishes

alligator cold-blooded scales no semi no yes no Reptiles

penguin warm-

blooded

feathers no semi no yes no Birds

porcupine warm-

blooded

quills yes no no yes yes Mammals

eel cold-blooded scales no yes no no no Fishes

4.2.1 How a Rule-Based Classifier

Works

A rule-based classifier classifies a test instance based on the rule triggered by

the instance. To illustrate how a rule-based classifier works, consider the rule

set shown in Table 4.1 and the following vertebrates:

Name Body

Temperature

Skin

Cover

Gives

Birth

Aquatic

Creature

Aerial

Creature

Has

Legs

Hibernates

lemur warm-

blooded

fur yes no no yes yes

turtle cold-blooded scales no semi no yes no

dogfish

shark

cold-blooded scales yes yes no no no

The first vertebrate, which is a lemur, is warm-blooded and gives birth to its

young. It triggers the rule , and thus, is classified as a mammal.

The second vertebrate, which is a turtle, triggers the rules and . Since

the classes predicted by the rules are contradictory (reptiles versus

amphibians), their conflicting classes must be resolved.

None of the rules are applicable to a dogfish shark. In this case, we need

to determine what class to assign to such a test instance.

eel cold-blooded scales no yes no no no Fishes

salamander cold-blooded none no semi no yes yes Amphibians

r3

r4 r5

4.2.2 Properties of a Rule Set

The rule set generated by a rule-based classifier can be characterized by the

following two properties.

Definition 4.1 (Mutually Exclusive Rule

Set).

The rules in a rule set R are mutually exclusive if no two rules in

R are triggered by the same instance. This property ensures that

every instance is covered by at most one rule in R.

Definition 4.2 (Exhaustive Rule Set).

A rule set R has exhaustive coverage if there is a rule for each

combination of attribute values. This property ensures that every

instance is covered by at least one rule in R.

Table 4.3. Example of a mutually exclusive and exhaustive rule set.

r1: (Body Temperature=cold-blooded)→Non-mammalsr2: (Body Temperature=warm-

blooded)∧(Gives Birth=yes)→Mammalsr3: (Body Temperature=warm-

Together, these two properties ensure that every instance is covered by

exactly one rule. An example of a mutually exclusive and exhaustive rule set

is shown in Table 4.3 . Unfortunately, many rule-based classifiers, including

the one shown in Table 4.1 , do not have such properties. If the rule set is

not exhaustive, then a default rule, , must be added to cover the

remaining cases. A default rule has an empty antecedent and is triggered

when all other rules have failed. is known as the default class and is

typically assigned to the majority class of training instances not covered by the

existing rules. If the rule set is not mutually exclusive, then an instance can be

covered by more than one rule, some of which may predict conflicting classes.

Definition 4.3 (Ordered Rule Set).

The rules in an ordered rule set R are ranked in decreasing

order of their priority. An ordered rule set is also known as a

decision list.

The rank of a rule can be defined in many ways, e.g., based on its accuracy or

total description length. When a test instance is presented, it will be classified

by the highest-ranked rule that covers the instance. This avoids the problem

of having conflicting classes predicted by multiple classification rules if the rule

set is not mutually exclusive.

blooded)∧(Gives Birth=no)→Non-mammals

rd: ()→yd

yd

An alternative way to handle a non-mutually exclusive rule set without

ordering the rules is to consider the consequent of each rule triggered by a

test instance as a vote for a particular class. The votes are then tallied to

determine the class label of the test instance. The instance is usually

assigned to the class that receives the highest number of votes. The vote may

also be weighted by the rule’s accuracy. Using unordered rules to build a rule-

based classifier has both advantages and disadvantages. Unordered rules are

less susceptible to errors caused by the wrong rule being selected to classify

a test instance unlike classifiers based on ordered rules, which are sensitive

to the choice of rule-ordering criteria. Model building is also less expensive

because the rules do not need to be kept in sorted order. Nevertheless,

classifying a test instance can be quite expensive because the attributes of

the test instance must be compared against the precondition of every rule in

the rule set.

In the next two sections, we present techniques for extracting an ordered rule

set from data. A rule-based classifier can be constructed using (1) direct

methods, which extract classification rules directly from data, and (2) indirect

methods, which extract classification rules from more complex classification

models, such as decision trees and neural networks. Detailed discussions of

these methods are presented in Sections 4.2.3 and 4.2.4 , respectively.

4.2.3 Direct Methods for Rule

Extraction

To illustrate the direct method, we consider a widely-used rule induction

algorithm called RIPPER. This algorithm scales almost linearly with the

number of training instances and is particularly suited for building models from

data sets with imbalanced class distributions. RIPPER also works well with

noisy data because it uses a validation set to prevent model overfitting.

RIPPER uses the sequential covering algorithm to extract rules directly from

data. Rules are grown in a greedy fashion one class at a time. For binary

class problems, RIPPER chooses the majority class as its default class and

learns the rules to detect instances from the minority class. For multiclass

problems, the classes are ordered according to their prevalence in the training

set. Let be the ordered list of classes, where is the least

prevalent class and is the most prevalent class. All training instances that

belong to are initially labeled as positive examples, while those that belong

to other classes are labeled as negative examples. The sequential covering

algorithm learns a set of rules to discriminate the positive from negative

examples. Next, all training instances from are labeled as positive, while

those from classes are labeled as negative. The sequential

covering algorithm would learn the next set of rules to distinguish from

other remaining classes. This process is repeated until we are left with only

one class, , which is designated as the default class.

Example 4.1. Sequential covering algorithm.

∈

∨

(y1, y2, … ,yc) y1

yc

y1

y2

y3, y4, ⋯, yc

y2

yc

A summary of the sequential covering algorithm is shown in Algorithm 4.1 .

The algorithm starts with an empty decision list, R, and extracts rules for each

class based on the ordering specified by the class prevalence. It iteratively

extracts the rules for a given class y using the Learn-One-Rule function. Once

such a rule is found, all the training instances covered by the rule are

eliminated. The new rule is added to the bottom of the decision list R. This

procedure is repeated until the stopping criterion is met. The algorithm then

proceeds to generate rules for the next class.

Figure 4.1 demonstrates how the sequential covering algorithm works for a

data set that contains a collection of positive and negative examples. The rule

R1, whose coverage is shown in Figure 4.1(b) , is extracted first because it

covers the largest fraction of positive examples. All the training instances

covered by R1 are subsequently removed and the algorithm proceeds to look

for the next best rule, which is R2.

Learn-One-Rule Function

Finding an optimal rule is computationally expensive due to the exponential

search space to explore. The Learn-One-Rule function addresses this

problem by growing the rules in a greedy fashion. It generates an initial rule

, where the left-hand side is an empty set and the right-hand side

corresponds to the positive class. It then refines the rule until a certain

stopping criterion is met. The accuracy of the initial rule may be poor because

some of the training instances covered by the rule belong to the negative

r: {}→+

class. A new conjunct must be added to the rule antecedent to improve its

accuracy.

Figure 4.1.

An example of the sequential covering algorithm.

RIPPER uses the FOIL’s information gain measure to choose the best

conjunct to be added into the rule antecedent. The measure takes into

consideration both the gain in accuracy and support of a candidate rule,

where support is defined as the number of positive examples covered by the

rule. For example, suppose the rule initially covers positive

examples and negative examples. After adding a new conjunct B, the

extended rule covers positive examples and negative

r: A→+ p0

n0

r′: A∧B→+ p1 n1

examples. The FOIL’s information gain of the extended rule is computed as

follows:

RIPPER chooses the conjunct with highest FOIL’s information gain to extend

the rule, as illustrated in the next example.

Example 4.2. [Foil’s Information Gain]

Consider the training set for the vertebrate classification problem shown in

Table 4.2 . Suppose the target class for the Learn-One-Rule function is

mammals. Initially, the antecedent of the rule covers 5

positive and 10 negative examples. Thus, the accuracy of the rule is only

0.333. Next, consider the following three candidate conjuncts to be added

to the left-hand side of the rule: ,

and . The number of positive and negative examples covered

by the rule after adding each conjunct along with their respective accuracy

and FOIL’s information gain are shown in the following table.

Candidate rule Accuracy Info Gain

3 0 1.000 4.755

5 1 0.714 5.498

2 4 0.200

Although has the highest accuracy among the three

candidates, the conjunct has the highest FOIL’s

information gain. Thus, it is chosen to extend the rule (see Figure 4.2 ).

FOIL’s information gain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)

{}→Mammals

Skin cover=hair, Body temperature=warm

Has legs=No

p1 n1

{Skin Cover=hair}→mammals

{Body temperature=wam}→mammals

{Has legs=No}→mammals −0.737

Skin cover=hair

Body temperature=warm

This process continues until adding new conjuncts no longer improves the

information gain measure.

Rule Pruning

The rules generated by the Learn-One-Rule function can be pruned to

improve their generalization errors. RIPPER prunes the rules based on their

performance on the validation set. The following metric is computed to

determine whether pruning is needed: , where p(n) is the number

of positive (negative) examples in the validation set covered by the rule. This

metric is monotonically related to the rule’s accuracy on the validation set. If

the metric improves after pruning, then the conjunct is removed. Pruning is

done starting from the last conjunct added to the rule. For example, given a

rule , RIPPER checks whether D should be pruned first, followed by

CD, BCD, etc. While the original rule covers only positive examples, the

pruned rule may cover some of the negative examples in the training set.

Building the Rule Set

After generating a rule, all the positive and negative examples covered by the

rule are eliminated. The rule is then added into the rule set as long as it does

not violate the stopping condition, which is based on the minimum description

length principle. If the new rule increases the total description length of the

rule set by at least d bits, then RIPPER stops adding rules into its rule set (by

default, d is chosen to be 64 bits). Another stopping condition used by

RIPPER is that the error rate of the rule on the validation set must not exceed

50%.

(p−n)/(p+n)

ABCD→y

Figure 4.2.

General-to-specific and specific-to-general rule-growing strategies.

RIPPER also performs additional optimization steps to determine whether

some of the existing rules in the rule set can be replaced by better alternative

rules. Readers who are interested in the details of the optimization method

may refer to the reference cited at the end of this chapter.

Instance Elimination

After a rule is extracted, RIPPER eliminates the positive and negative

examples covered by the rule. The rationale for doing this is illustrated in the

next example.

Figure 4.3 shows three possible rules, R1, R2, and R3, extracted from a

training set that contains 29 positive examples and 21 negative examples.

The accuracies of R1, R2, and R3 are 12/15 (80%), 7/10 (70%), and 8/12

(66.7%), respectively. R1 is generated first because it has the highest

accuracy. After generating R1, the algorithm must remove the examples

covered by the rule so that the next rule generated by the algorithm is different

than R1. The question is, should it remove the positive examples only,

negative examples only, or both? To answer this, suppose the algorithm must

choose between generating R2 or R3 after R1. Even though R2 has a higher

accuracy than R3 (70% versus 66.7%), observe that the region covered by R2

is disjoint from R1, while the region covered by R3 overlaps with R1. As a

result, R1 and R3 together cover 18 positive and 5 negative examples

(resulting in an overall accuracy of 78.3%), whereas R1 and R2 together

cover 19 positive and 6 negative examples (resulting in a lower overall

accuracy of 76%). If the positive examples covered by R1 are not removed,

then we may overestimate the effective accuracy of R3. If the negative

examples covered by R1 are not removed, then we may underestimate the

accuracy of R3. In the latter case, we might end up preferring R2 over R3

even though half of the false positive errors committed by R3 have already

been accounted for by the preceding rule, R1. This example shows that the

effective accuracy after adding R2 or R3 to the rule set becomes evident only

when both positive and negative examples covered by R1 are removed.

Figure 4.3.

Elimination of training instances by the sequential covering algorithm. R1, R2,

and R3 represent regions covered by three different rules.

4.2.4 Indirect Methods for Rule

Extraction

This section presents a method for generating a rule set from a decision tree.

In principle, every path from the root node to the leaf node of a decision tree

can be expressed as a classification rule. The test conditions encountered

along the path form the conjuncts of the rule antecedent, while the class label

at the leaf node is assigned to the rule consequent. Figure 4.4 shows an

example of a rule set generated from a decision tree. Notice that the rule set

is exhaustive and contains mutually exclusive rules. However, some of the

rules can be simplified as shown in the next example.

Figure 4.4.

Converting a decision tree into classification rules.

Example 4.3.

Consider the following three rules from Figure 4.4 :

Observe that the rule set always predicts a positive class when the value

of Q is Yes. Therefore, we may simplify the rules as follows:

is retained to cover the remaining instances of the positive class.

Although the rules obtained after simplification are no longer mutually

exclusive, they are less complex and are easier to interpret.

In the following, we describe an approach used by the C4.5rules algorithm to

generate a rule set from a decision tree. Figure 4.5 shows the decision tree

r2:(P=No)∧(Q=Yes)→+r3:(P=Yes)∧(R=No)→+r5:

(P=Yes)∧(R=Yes)∧(Q=Yes)→+.

r2′:(Q=Yes)→+r3:(P=Yes)∧(R=No)→+.

r3

and resulting classification rules obtained for the data set given in Table

4.2 .

Rule Generation

Classification rules are extracted for every path from the root to one of the leaf

nodes in the decision tree. Given a classification rule , we consider a

simplified rule, where is obtained by removing one of the conjuncts

in A. The simplified rule with the lowest pessimistic error rate is retained

provided its error rate is less than that of the original rule. The rule-pruning

step is repeated until the pessimistic error of the rule cannot be improved

further. Because some of the rules may become identical after pruning, the

duplicate rules are discarded.

Figure 4.5.

r:A→y

r′:A′→y A′

Classification rules extracted from a decision tree for the vertebrate

classification problem.

Rule Ordering

After generating the rule set, C4.5rules uses the class-based ordering scheme

to order the extracted rules. Rules that predict the same class are grouped

together into the same subset. The total description length for each subset is

computed, and the classes are arranged in increasing order of their total

description length. The class that has the smallest description length is given

the highest priority because it is expected to contain the best set of rules. The

total description length for a class is given by , where

is the number of bits needed to encode the misclassified

examples, Lmodel is the number of bits needed to encode the model, and g is

a tuning parameter whose default value is 0.5. The tuning parameter depends

on the number of redundant attributes present in the model. The value of the

tuning parameter is small if the model contains many redundant attributes.

4.2.5 Characteristics of Rule-Based

Classifiers

1. Rule-based classifiers have very similar characteristics as decision

trees. The expressiveness of a rule set is almost equivalent to that of a

decision tree because a decision tree can be represented by a set of

mutually exclusive and exhaustive rules. Both rule-based and decision

tree classifiers create rectilinear partitions of the attribute space and

assign a class to each partition. However, a rule-based classifier can

Lexception+g×Lmodel

Lexception

allow multiple rules to be triggered for a given instance, thus enabling

the learning of more complex models than decision trees.

2. Like decision trees, rule-based classifiers can handle varying types of

categorical and continuous attributes and can easily work in multiclass

classification scenarios. Rule-based classifiers are generally used to

produce descriptive models that are easier to interpret but give

comparable performance to the decision tree classifier.

3. Rule-based classifiers can easily handle the presence of redundant

attributes that are highly correlated with one other. This is because

once an attribute has been used as a conjunct in a rule antecedent, the

remaining redundant attributes would show little to no FOIL’s

information gain and would thus be ignored.

4. Since irrelevant attributes show poor information gain, rule-based

classifiers can avoid selecting irrelevant attributes if there are other

relevant attributes that show better information gain. However, if the

problem is complex and there are interacting attributes that can

collectively distinguish between the classes but individually show poor

information gain, it is likely for an irrelevant attribute to be accidentally

favored over a relevant attribute just by random chance. Hence, rule-

based classifiers can show poor performance in the presence of

interacting attributes, when the number of irrelevant attributes is large.

5. The class-based ordering strategy adopted by RIPPER, which

emphasizes giving higher priority to rare classes, is well suited for

handling training data sets with imbalanced class distributions.

6. Rule-based classifiers are not well-suited for handling missing values in

the test set. This is because the position of rules in a rule set follows a

certain ordering strategy and even if a test instance is covered by

multiple rules, they can assign different class labels depending on their

position in the rule set. Hence, if a certain rule involves an attribute that

is missing in a test instance, it is difficult to ignore the rule and proceed

to the subsequent rules in the rule set, as such a strategy can result in

incorrect class assignments.

4.3 Nearest Neighbor Classifiers

The classification framework shown in Figure 3.3 involves a two-step

process:

(1) an inductive step for constructing a classification model from data, and

(2) a deductive step for applying the model to test examples. Decision tree

and rule-based classifiers are examples of eager learners because they are

designed to learn a model that maps the input attributes to the class label as

soon as the training data becomes available. An opposite strategy would be to

delay the process of modeling the training data until it is needed to classify the

test instances. Techniques that employ this strategy are known as lazy

learners. An example of a lazy learner is the Rote classifier, which

memorizes the entire training data and performs classification only if the

attributes of a test instance match one of the training examples exactly. An

obvious drawback of this approach is that some test instances may not be

classified because they do not match any training example.

One way to make this approach more flexible is to find all the training

examples that are relatively similar to the attributes of the test instances.

These examples, which are known as nearest neighbors, can be used to

determine the class label of the test instance. The justification for using

nearest neighbors is best exemplified by the following saying: “If it walks like a

duck, quacks like a duck, and looks like a duck, then it’s probably a duck.” A

nearest neighbor classifier represents each example as a data point in a d-

dimensional space, where d is the number of attributes. Given a test instance,

we compute its proximity to the training instances according to one of the

proximity measures described in Section 2.4 on page 71. The k-nearest

neighbors of a given test instance z refer to the k training examples that are

closest to z.

Figure 4.6 illustrates the 1-, 2-, and 3-nearest neighbors of a test instance

located at the center of each circle. The instance is classified based on the

class labels of its neighbors. In the case where the neighbors have more than

one label, the test instance is assigned to the majority class of its nearest

neighbors. In Figure 4.6(a) , the 1-nearest neighbor of the instance is a

negative example. Therefore the instance is assigned to the negative class. If

the number of nearest neighbors is three, as shown in Figure 4.6(c) , then

the neighborhood contains two positive examples and one negative example.

Using the majority voting scheme, the instance is assigned to the positive

class. In the case where there is a tie between the classes (see Figure

4.6(b) ), we may randomly choose one of them to classify the data point.

Figure 4.6.

The 1-, 2-, and 3-nearest neighbors of an instance.

The preceding discussion underscores the importance of choosing the right

value for k. If k is too small, then the nearest neighbor classifier may be

susceptible to overfitting due to noise, i.e., mislabeled examples in the training

data. On the other hand, if k is too large, the nearest neighbor classifier may

misclassify the test instance because its list of nearest neighbors includes

training examples that are located far away from its neighborhood (see Figure

4.7 ).

Figure 4.7.

k-nearest neighbor classification with large k.

4.3.1 Algorithm

A high-level summary of the nearest neighbor classification method is given in

Algorithm 4.2 . The algorithm computes the distance (or similarity) between

each test instance and all the training examples to

determine its nearest neighbor list, . Such computation can be costly if the

number of training examples is large. However, efficient indexing techniques

are available to reduce the computation needed to find the nearest neighbors

of a test instance.

z=(x′, y′) (x, y)∈D

Dz

Algorithm 4.2 The k-nearest neighbor classifier.

′ ′

′ ∑ ∈

Once the nearest neighbor list is obtained, the test instance is classified

based on the majority class of its nearest neighbors:

where v is a class label, is the class label for one of the nearest neighbors,

and is an indicator function that returns the value 1 if its argument is true

and 0 otherwise.

In the majority voting approach, every neighbor has the same impact on the

classification. This makes the algorithm sensitive to the choice of k, as shown

in Figure 4.6 . One way to reduce the impact of k is to weight the influence

of each nearest neighbor according to its distance: . As a

result, training examples that are located far away from z have a weaker

impact on the classification compared to those that are located close to z.

Using the distance-weighted voting scheme, the class label can be

determined as follows:

′

∈

⊆

Majority Voting: y′=argmaxv∑(xi, yi)∈DzI(v=yi), (4.5)

yi

I(⋅)

xi wi=1/d(x′, xi)2

Distance-Weighted Voting: y′=argmaxv∑(xi, yi)∈Dzwi×I(v=yi). (4.6)

4.3.2 Characteristics of Nearest

Neighbor Classifiers

1. Nearest neighbor classification is part of a more general technique

known as instance-based learning, which does not build a global

model, but rather uses the training examples to make predictions for a

test instance. (Thus, such classifiers are often said to be “model free.”)

Such algorithms require a proximity measure to determine the similarity

or distance between instances and a classification function that returns

the predicted class of a test instance based on its proximity to other

instances.

2. Although lazy learners, such as nearest neighbor classifiers, do not

require model building, classifying a test instance can be quite

expensive because we need to compute the proximity values

individually between the test and training examples. In contrast, eager

learners often spend the bulk of their computing resources for model

building. Once a model has been built, classifying a test instance is

extremely fast.

3. Nearest neighbor classifiers make their predictions based on local

information. (This is equivalent to building a local model for each test

instance.) By contrast, decision tree and rule-based classifiers attempt

to find a global model that fits the entire input space. Because the

classification decisions are made locally, nearest neighbor classifiers

(with small values of k) are quite susceptible to noise.

4. Nearest neighbor classifiers can produce decision boundaries of

arbitrary shape. Such boundaries provide a more flexible model

representation compared to decision tree and rule-based classifiers

that are often constrained to rectilinear decision boundaries. The

decision boundaries of nearest neighbor classifiers also have high

variability because they depend on the composition of training

examples in the local neighborhood. Increasing the number of nearest

neighbors may reduce such variability.

5. Nearest neighbor classifiers have difficulty handling missing values in

both the training and test sets since proximity computations normally

require the presence of all attributes. Although, the subset of attributes

present in two instances can be used to compute a proximity, such an

approach may not produce good results since the proximity measures

may be different for each pair of instances and thus hard to compare.

6. Nearest neighbor classifiers can handle the presence of interacting

attributes, i.e., attributes that have more predictive power taken in

combination then by themselves, by using appropriate proximity

measures that can incorporate the effects of multiple attributes

together.

7. The presence of irrelevant attributes can distort commonly used

proximity measures, especially when the number of irrelevant attributes

is large. Furthermore, if there are a large number of redundant

attributes that are highly correlated with each other, then the proximity

measure can be overly biased toward such attributes, resulting in

improper estimates of distance. Hence, the presence of irrelevant and

redundant attributes can adversely affect the performance of nearest

neighbor classifiers.

8. Nearest neighbor classifiers can produce wrong predictions unless the

appropriate proximity measure and data preprocessing steps are taken.

For example, suppose we want to classify a group of people based on

attributes such as height (measured in meters) and weight (measured

in pounds). The height attribute has a low variability, ranging from 1.5

m to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250

lb. If the scale of the attributes are not taken into consideration, the

proximity measure may be dominated by differences in the weights of a

person.

4.4 Naïve Bayes Classifier

Many classification problems involve uncertainty. First, the observed attributes

and class labels may be unreliable due to imperfections in the measurement

process, e.g., due to the limited preciseness of sensor devices. Second, the

set of attributes chosen for classification may not be fully representative of the

target class, resulting in uncertain predictions. To illustrate this, consider the

problem of predicting a person’s risk for heart disease based on a model that

uses their diet and workout frequency as attributes. Although most people

who eat healthily and exercise regularly have less chance of developing heart

disease, they may still be at risk due to other latent factors, such as heredity,

excessive smoking, and alcohol abuse, that are not captured in the model.

Third, a classification model learned over a finite training set may not be able

to fully capture the true relationships in the overall data, as discussed in the

context of model overfitting in the previous chapter. Finally, uncertainty in

predictions may arise due to the inherent random nature of real-world

systems, such as those encountered in weather forecasting problems.

In the presence of uncertainty, there is a need to not only make predictions of

class labels but also provide a measure of confidence associated with every

prediction. Probability theory offers a systematic way for quantifying and

manipulating uncertainty in data, and thus, is an appealing framework for

assessing the confidence of predictions. Classification models that make use

of probability theory to represent the relationship between attributes and class

labels are known as probabilistic classification models. In this section, we

present the naïve Bayes classifier, which is one of the simplest and most

widely-used probabilistic classification models.

4.4.1 Basics of Probability Theory

Before we discuss how the naïve Bayes classifier works, we first introduce

some basics of probability theory that will be useful in understanding the

probabilistic classification models presented in this chapter. This involves

defining the notion of probability and introducing some common approaches

for manipulating probability values.

Consider a variable X, which can take any discrete value from the set

. When we have multiple observations of that variable, such as in a

data set where the variable describes some characteristic of data objects,

then we can compute the relative frequency with which each value occurs.

Specifically, suppose that X has the value for data objects. The relative

frequency with which we observe the event is then , where N

denotes the total number of occurrences ( ). These relative

frequencies characterize the uncertainty that we have with respect to what

value X may take for an unseen observation and motivates the notion of

probability.

More formally, the probability of an event e, e.g., , measures how

likely it is for the event e to occur. The most traditional view of probability is

based on relative frequency of events (frequentist), while the Bayesian

viewpoint (described later) takes a more flexible view of probabilities. In either

case, a probability is always a number between 0 and 1. Further, the sum of

probability values of all possible events, e.g., outcomes of a variable X is

equal to 1. Variables that have probabilities associated with each possible

outcome (values) are known as random variables.

Now, let us consider two random variables, X and Y , that can each take k

discrete values. Let be the number of times we observe and , out

{x1, …, xk}

xi ni

X=xi ni/N

N=∑i=1kni

P(X=xi)

nij X=xi Y=yj

of a total number of N occurrences. The joint probability of observing

and together can be estimated as

(This is an estimate since we typically have only a finite subset of all possible

observations.) Joint probabilities can be used to answer questions such as

“what is the probability that there will be a surprise quiz today I will be late

for the class.” Joint probabilities are symmetric, i.e.,

. For joint probabilities, it is to useful to consider

their sum with respect to one of the random variables, as described in the

following equation:

where is the total number of times we observe irrespective of the value

of Y. Notice that is essentially the probability of observing . Hence,

by summing out the joint probabilities with respect to a random variable Y , we

obtain the probability of observing the remaining variable X. This operation is

called marginalization and the probability value obtained by

marginalizing out Y is sometimes called the marginal probability of X. As we

will see later, joint probability and marginal probability form the basic building

blocks of a number of probabilistic classification models discussed in this

chapter.

Notice that in the previous discussions, we used to denote the

probability of a particular outcome of a random variable X. This notation can

easily become cumbersome when a number of random variables are involved.

Hence, in the remainder of this section, we will use P(X) to denote the

probability of any generic outcome of the random variable X, while will be

used to represent the probability of the specific outcome .

X=xi

Y=yj

P(X=xi, Y=yi)=nijN. (4.7)

P(X=x, Y=y)=P(Y=y, X=x)

∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)

ni X=xi

ni/N X=xi

P(X=xi)

P(X=xi)

P(xi)

xi

Bayes Theorem

Suppose you have invited two of your friends Alex and Martha to a dinner party. You know that Alex

attends 40% of the parties he is invited to. Further, if Alex is going to a party, there is an 80% chance

of Martha coming along. On the other hand, if Alex is not going to the party, the chance of Martha

coming to the party is reduced to 30%. If Martha has responded that she will be coming to your party,

what is the probability that Alex will also be coming?

Bayes theorem presents the statistical principle for answering questions like

the previous one, where evidence from multiple sources has to be combined

with prior beliefs to arrive at predictions. Bayes theorem can be briefly

described as follows.

Let denotethe conditional probability of observing the random

variable Y whenever the random variable X takes a particular value. is

often read as the probability of observing Y conditioned on the outcome of X.

Conditional probabilities can be used for answering questions such as “given

that it is going to rain today, what will be the probability that I will go to the

class.” Conditional probabilities of X and Y are related to their joint

probability in the following way:

Rearranging the last two expressions in Equation 4.10 leads to Equation

4.11 , which is known as Bayes theorem:

P(Y|X)

P(Y|X)

P(Y|X)=P(X, Y)P(X), which implies (4.9)

P(X, Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)

P(Y|X)=P(X|Y)P(Y)P(X). (4.11)

Bayes theorem provides a relationship between the conditional probabilities

and . Note that the denominator in Equation 4.11 involves the

marginal probability of X, which can also be represented as

Using the previous expression for P(X), we can obtain the following equation

for solely in terms of and P(Y):

Example 4.4. [Bayes Theorem]

Bayes theorem can be used to solve a number of inferential questions

about random variables. For example, consider the problem stated at the

beginning on inferring whether Alex will come to the party. Let

denote the probability of Alex going to a party, while denotes the

probability of him not going to a party. We know that

Further, let denote the conditional probability of Martha going to

a party conditioned on whether Alex is going to the party. takes

the following values:

We can use the above values of and P(A) to compute the

probability of Alex going to the party given Martha is going to the party,

, as follows:

P(Y|X) P(X|Y)

P(X)=∑i=1kP(X, yi)=∑i=1kP(X|yi)×P(yi).

P(Y|X) P(X|Y)

P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)

P(A=1)

P(A=0)

P(A=1)=0.4,andP(A=0)=1−P(A=1)=0.6.

P(M=1|A)

P(M=1|A)

P(M=1|A=1)=0.8,andP(M=1|A=0)=0.3.

P(M|A)

P(A=1|M=1)

Notice that even though the prior probability P(A) of Alex going to the party

is low, the observation that Martha is going, , affects the conditional

probability . This shows the value of Bayes theorem in

combining prior assumptions with observed outcomes to make predictions.

Since , it is more likely for Alex to join if Martha is going to

the party.

Using Bayes Theorem for Classification

For the purpose of classification, we are interested in computing the

probability of observing a class label y for a data instance given its set of

attribute values . This can be represented as , which is known as the

posterior probability of the target class. Using the Bayes Theorem, we can

represent the posterior probability as

Note that the numerator of the previous equation involves two terms,

and P(y), both of which contribute to the posterior probability . We

describe both of these terms in the following.

The first term is known as the class-conditional probability of the

attributes given the class label. measures the likelihood of observing

from the distribution of instances belonging to y. If indeed belongs to class

y, then we should expect to be high. From this point of view, the use of

class-conditional probabilities attempts to capture the process from which the

data instances were generated. Because of this interpretation, probabilistic

classification models that involve computing class-conditional probabilities are

P(A=1|M=1)=P(M=1|A=1)P(A=1)P(M=1|A=0)P(A=0)+P(M=1|A=1)P(A=1),=0.8(4.13)

M=1

P(A=1|M=1)

P(A=1|M=1)>0.5

P(y|x)

P(y|x)=P(x|y)P(y)P(x) (4.14)

P(x|y)

P(y|x)

P(x|y)

P(x|y)

P(x|y)

known as generative classification models. Apart from their use in

computing posterior probabilities and making predictions, class-conditional

probabilities also provide insights about the underlying mechanism behind the

generation of attribute values.

The second term in the numerator of Equation 4.14 is the prior probability

P(y). The prior probability captures our prior beliefs about the distribution of

class labels, independent of the observed attribute values. (This is the

Bayesian viewpoint.) For example, we may have a prior belief that the

likelihood of any person to suffer from a heart disease is , irrespective of their

diagnosis reports. The prior probability can either be obtained using expert

knowledge, or inferred from historical distribution of class labels.

The denominator in Equation 4.14 involves the probability of evidence, P

( ). Note that this term does not depend on the class label and thus can be

treated as a normalization constant in the computation of posterior

probabilities. Further, the value of P( ) can be calculated as

.

Bayes theorem provides a convenient way to combine our prior beliefs with

the likelihood of obtaining the observed attribute values. During the training

phase, we are required to learn the parameters for P(y) and . The prior

probability P(y) can be easily estimated from the training set by computing the

fraction of training instances that belong to each class. To compute the class-

conditional probabilities, one approach is to consider the fraction of training

instances of a given class for every possible combination of attribute values.

For example, suppose that there are two attributes and that can each

take a discrete value from to . Let denote the number of training

instances belonging to class 0, out of which number of training instances

have and . The class-conditional probability can then be given as

α

P(x)=∑iP(x|yi)P(yi)

P(x|y)

X1 X2

c1 ck n0

nij0

X1=ci X2=cj

This approach can easily become computationally prohibitive as the number

of attributes increase, due to the exponential growth in the number of attribute

value combinations. For example, if every attribute can take k discrete values,

then the number of attribute value combinations is equal to , where d is the

number of attributes. The large number of attribute value combinations can

also result in poor estimates of class-conditional probabilities, since every

combination will have fewer training instances when the size of training set is

small.

In the following, we present the naïve Bayes classifier, which makes a

simplifying assumption about the class-conditional probabilities, known as the

naïve Bayes assumption. The use of this assumption significantly helps in

obtaining reliable estimates of class-conditional probabilities, even when the

number of attributes are large.

4.4.2 Naïve Bayes Assumption

The naïve Bayes classifier assumes that the class-conditional probability of all

attributes can be factored as a product of class-conditional probabilities of

every attribute , as described in the following equation:

where every data instance consists of d attributes, . The

basic assumption behind the previous equation is that the attribute values

are conditionally independent of each other, given the class label y. This

means that the attributes are influenced only by the target class and if we

P(X1=ci, X2=cj|Y=0)=nij0n0.

kd

xi

P(x|y)=∏i=1dP(xi|y), (4.15)

{x1, x2, …, xd}

xi

know the class label, then we can consider the attributes to be independent of

each other. The concept of conditional independence can be formally stated

as follows.

Conditional Independence

Let , and Y denote three sets of random variables. The variables in

are said to be conditionally independent of , given Y, if the following

condition holds:

This means that conditioned on Y, the distribution of is not influenced by

the outcomes of , and hence is conditionally independent of . To illustrate

the notion of conditional independence, consider the relationship between a

person’s arm length and his or her reading skills . One might observe

that people with longer arms tend to have higher levels of reading skills, and

thus consider and to be related to each other. However, this

relationship can be explained by another factor, which is the age of the person

(Y). A young child tends to have short arms and lacks the reading skills of an

adult. If the age of a person is fixed, then the observed relationship between

arm length and reading skills disappears. Thus, we can conclude that arm

length and reading skills are not directly related to each other and are

conditionally independent when the age variable is fixed.

Another way of describing conditional independence is to consider the joint

conditional probability, , as follows:

X1, X2, X1

X2

P(X1|X2, Y)=P(X1|Y). (4.16)

X1

X2 X2

(X1) (X2)

X1 X2

P(X1, X2|Y)

P(X1, X2|Y)=P(X1, X2, Y)P(Y)=P(X1, X2, Y)P(X2, Y)×P(X2, Y)P(Y)=P(X1|X2, Y(4.17)

where Equation 4.16 was used to obtain the last line of Equation 4.17 .

The previous description of conditional independence is quite useful from an

operational perspective. It states that the joint conditional probability of and

given Y can be factored as the product of conditional probabilities of

and considered separately. This forms the basis of the naïve Bayes

assumption stated in Equation 4.15 .

How a Naïve Bayes Classifier Works

Using the naïve Bayes assumption, we only need to estimate the conditional

probability of each given Y separately, instead of computing the class-

conditional probability for every combination of attribute values. For example,

if and denote the number of training instances belonging to class 0

with and , respectively, then the class-conditional probability can

be estimated as

In the previous equation, we only need to count the number of training

instances for every one of the k values of an attribute X, irrespective of the

values of other attributes. Hence, the number of parameters needed to learn

class-conditional probabilities is reduced from to dk. This greatly simplifies

the expression for the class-conditional probability and makes it more

amenable to learning parameters and making predictions, even in high-

dimensional settings.

The naïve Bayes classifier computes the posterior probability for a test

instance by using the following equation:

X1

X2 X1

X2

xi

ni0 nj0

X1=ci X2=cj

P(X1=ci, X2=xj|Y=0)=ni0n0×nj0n0.

dk

P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)

Since P ( ) is fixed for every y and only acts as a normalizing constant to

ensure that , we can write

Hence, it is sufficient to choose the class that maximizes .

One of the useful properties of the naïve Bayes classifier is that it can easily

work with incomplete information about data instances, when only a subset of

attributes are observed at every instance. For example, if we only observe p

out of d attributes at a data instance, then we can still compute

using those p attributes and choose the class with the

maximum value. The naïve Bayes classifier can thus naturally handle missing

values in test instances. In fact, in the extreme case where no attributes are

observed, we can still use the prior probability P(y) as an estimate of the

posterior probability. As we observe more attributes, we can keep refining the

posterior probability to better reflect the likelihood of observing the data

instance.

In the next two subsections, we describe several approaches for estimating

the conditional probabilities for categorical and continuous attributes

from the training set.

Estimating Conditional Probabilities for

Categorical Attributes

For a categorical attribute , the conditional probability is estimated

according to the fraction of training instances in class y where takes on a

particular categorical value c.

P(y|x)∈[0, 1]

P(y|x)∝P(y)∏i=1dP(xi|y).

P(y)∏i=1dP(xi|y)

P(y)∏i=1pP(xi|y)

P(xi|y)

Xi P(Xi=c|y)

Xi

where n is the number of training instances belonging to class y, out of which

number of instances have . For example, in the training set given in

Figure 4.8 , seven people have the class label , out

of which three people have while the remaining four have

. As a result, the conditional probability for

is equal to 3/7. Similarly, the

conditional probability for defaulted borrowers with is

given by . Note that the

sum of conditional probabilities over all possible outcomes of is equal to

one, i.e., .

Figure 4.8.

Training set for predicting the loan default problem.

P(Xi=c|y)=ncn,

nc Xi=c

Defaulted Borrower=No

Home Owner=Yes

Home Owner=No

P(Home Owner=Yes|Defaulted Borrower=No)

Marital Status=Single

P(Marital Status=Single|Defaulted Borrower=Yes)=2/3

Xi

∑cP(Xi=c|y)=1,

Estimating Conditional Probabilities for

Continuous Attributes

There are two ways to estimate the class-conditional probabilities for

continuous attributes:

1. We can discretize each continuous attribute and then replace the

continuous values with their corresponding discrete intervals. This

approach transforms the continuous attributes into ordinal attributes,

and the simple method described previously for computing the

conditional probabilities of categorical attributes can be employed. Note

that the estimation error of this method depends on the discretization

strategy (as described in Section 2.3.6 on page 63), as well as the

number of discrete intervals. If the number of intervals is too large,

every interval may have an insufficient number of training instances to

provide a reliable estimate of . On the other hand, if the number

of intervals is too small, then the discretization process may loose

information about the true distribution of continuous values, and thus

result in poor predictions.

2. We can assume a certain form of probability distribution for the

continuous variable and estimate the parameters of the distribution

using the training data. For example, we can use a Gaussian

distribution to represent the conditional probability of continuous

attributes. The Gaussian distribution is characterized by two

parameters, the mean, , and the variance, . For each class , the

class-conditional probability

for attribute is

P(Xi|Y)

μ σ2 yj

Xi

P(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2 ]. (4.19)

The parameter can be estimated using the sample mean of

for all training instances that belong to . Similarly, can be

estimated from the sample variance of such training instances. For

example, consider the annual income attribute shown in Figure 4.8 .

The sample mean and variance for this attribute with respect to the

class are

Given a test instance with taxable income equal to $120K, we can use

the following value as its conditional probability given class :

Example 4.5. [Naïve Bayes Classifier]

Consider the data set shown in Figure 4.9(a) , where the target class is

Defaulted Borrower, which can take two values Yes and No. We can

compute the class-conditional probability for each categorical attribute and

the sample mean and variance for the continuous attribute, as summarized

in Figure 4.9(b) .

We are interested in predicting the class label of a test instance

. To do

this, we first compute the prior probabilities by counting the number of

training instances belonging to every class. We thus obtain and

. Next, we can compute the class-conditional probability as

follows:

μij Xi(x¯)

yj σij2

(s2)

x¯=125+100+70+…+757=100s2=(125−110)2+(100−110)2+…

(75−110)26=2975s=2975=54.54.

P(Income=120|No)=12π(54.54)exp−(120−110)22×2975=0.0072.

x=

(Home Owner=No, Marital Status=Married, Annual Income=$120K)

P(yes)=0.3

P(No)=0.7

Figure 4.9.

The naïve Bayes classifier for the loan classification problem.

Notice that the class-conditional probability for class has become 0

because there are no instances belonging to class with

in the training set. Using these class-conditional

probabilities, we can estimate the posterior probabilities as

where is a normalizing constant. Since , the

instance is classified as .

P(x|NO)=P(Home Owner=No|No)×P(Status=Married|No)×P(Annual Income

Status=Married

P(No|x)=0.7×0.0024P(x).=0.0016α.P(Yes|x)=0.3×0P(x)=0.

α=1/P(x) P(No|x)>P(Yes|x)

Handling Zero Conditional Probabilities

The preceding example illustrates a potential problem with using the naïve

Bayes assumption in estimating class-conditional probabilities. If the

conditional probability for any of the attributes is zero, then the entire

expression for the class-conditional probability becomes zero. Note that zero

conditional probabilities arise when the number of training instances is small

and the number of possible values of an attribute is large. In such cases, it

may happen that a combination of attribute values and class labels are never

observed, resulting in a zero conditional probability.

In a more extreme case, if the training instances do not cover some

combinations of attribute values and class labels, then we may not be able to

even classify some of the test instances. For example, if

is zero instead of 1/7, then a data instance

with attribute set

has the

following class-conditional probabilities:

Since both the class-conditional probabilities are 0, the naïve Bayes classifier

will not be able to classify the instance. To address this problem, it is important

to adjust the conditional probability estimates so that they are not as brittle as

simply using fractions of training instances. This can be achieved by using the

following alternate estimates of conditional probability:

P(Marital Status=Divorced|No)

x=

(Home Owner=Yes, Marital Status=Divorced, Income=$120K)

P(x|No)=3/7×0×0.0072=0.P(x|Yes)=0×1/3×1.2×10−9=0.

Laplace estimate:P(Xi=c|y)=nc+1n+v, (4.20)

m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)

where n is the number of training instances belonging to class y, is the

number of training instances with and , v is the total number of

attribute values that can take, p is some initial estimate of that is

known a priori, and m is a hyper-parameter that indicates our confidence in

using p when the fraction of training instances is too brittle. Note that even if

, both Laplace and m-estimate provide non-zero values of conditional

probabilities. Hence, they avoid the problem of vanishing class-conditional

probabilities and thus generally provide more robust estimates of posterior

probabilities.

Characteristics of Naïve Bayes Classifiers

1. Naïve Bayes classifiers are probabilistic classification models that are

able to quantify the uncertainty in predictions by providing posterior

probability estimates. They are also generative classification models as

they treat the target class as the causative factor for generating the

data instances. Hence, apart from computing posterior probabilities,

naïve Bayes classifiers also attempt to capture the underlying

mechanism behind the generation of data instances belonging to every

class. They are thus useful for gaining predictive as well as descriptive

insights.

2. By using the naïve Bayes assumption, they can easily compute class-

conditional probabilities even in high-dimensional settings, provided

that the attributes are conditionally independent of each other given the

class labels. This property makes naïve Bayes classifier a simple and

effective classification technique that is commonly used in diverse

application problems, such as text classification.

3. Naïve Bayes classifiers are robust to isolated noise points because

such points are not able to significantly impact the conditional

probability estimates, as they are often averaged out during training.

nc

Xi=c Y=y

Xi P(Xi=c|y)

nc=0

4. Naïve Bayes classifiers can handle missing values in the training set by

ignoring the missing values of every attribute while computing its

conditional probability estimates. Further, naïve Bayes classifiers can

effectively handle missing values in a test instance, by using only the

non-missing attribute values while computing posterior probabilities. If

the frequency of missing values for a particular attribute value depends

on class label, then this approach will not accurately estimate posterior

probabilities.

5. Naïve Bayes classifiers are robust to irrelevant attributes. If is an

irrelevant attribute, then becomes almost uniformly distributed

for every class y. The class-conditional probabilities for every class

thus receive similar contributions of , resulting in negligible

impact on the posterior probability estimates.

6. Correlated attributes can degrade the performance of naïve Bayes

classifiers because the naïve Bayes assumption of conditional

independence no longer holds for such attributes. For example,

consider the following probabilities:

where A is a binary attribute and Y is a binary class variable. Suppose

there is another binary attribute B that is perfectly correlated with A

when , but is independent of A when . For simplicity, assume

that the conditional probabilities for B are the same as for A. Given an

instance with attributes , and assuming conditional

independence, we can compute its posterior probabilities as follows:

If , then the naïve Bayes classifier would assign the instance

to class 1. However, the truth is,

Xi

P(Xi|Y)

P(Xi|Y)

P(A=0|Y=0)=0.4,P(A=1|Y=0)=0.6,P(A=0|Y=1)=0.6,P(A=1|Y=1)=0.4,

Y=0 Y=1

A=0, B=0

P(Y=0|A=0, B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0, B=0)=0.16×P(Y

P(Y=0)=P(Y=1)

P(A=0, B=0|Y=0)=P(A=0|Y=0)=0.4,

because A and B are perfectly correlated when . As a result, the posterior

probability for is

which is larger than that for . The instance should have been classified as

class 0. Hence, the naïve Bayes classifier can produce incorrect results when

the attributes are not conditionally independent given the class labels. Naïve

Bayes classifiers are thus not well-suited for handling redundant or interacting

attributes.

Y=0

Y=0

P(Y=0|A=0, B=0)=P(A=0, B=0|Y=0)P(Y=0)P(A=0, B=0)=0.4×P(Y=0)P(A=0, B=

Y=1

4.5 Bayesian Networks

The conditional independence assumption made by naïve Bayes classifiers

may seem too rigid, especially for classification problems where the attributes

are dependent on each other even after conditioning on the class labels. We

thus need an approach to relax the naïve Bayes assumption so that we can

capture more generic representations of conditional independence among

attributes.

In this section, we present a flexible framework for modeling probabilistic

relationships between attributes and class labels, known as Bayesian

Networks. By building on concepts from probability theory and graph theory,

Bayesian networks are able to capture more generic forms of conditional

independence using simple schematic representations. They also provide the

necessary computational structure to perform inferences over random

variables in an efficient way. In the following, we first describe the basic

representation of a Bayesian network, and then discuss methods for

performing inference and learning model parameters in the context of

classification.

4.5.1 Graphical Representation

Bayesian networks belong to a broader family of models for capturing

probabilistic relationships among random variables, known as probabilistic

graphical models. The basic concept behind these models is to use

graphical representations where the nodes of the graph correspond to random

variables and the edges between the nodes express probabilistic

relationships. Figures 4.10(a) and 4.10(b) show examples of

probabilistic graphical models using directed edges (with arrows) and

undirected edges (without arrows), respectively. Directed graphical models

are also known as Bayesian networks while undirected graphical models are

known as Markov random fields. The two approaches use different

semantics for expressing relationships among random variables and are thus

useful in different contexts. In the following, we briefly describe Bayesian

networks that are useful in the context of classification.

A Bayesian network (also referred to as a belief network) involves directed

edges between nodes, where every edge represents a direction of influence

among random variables. For example, Figure 4.10(a) shows a Bayesian

network where variable C depends upon the values of variables A and B, as

indicated by the arrows pointing toward C from A and B. Consequently, the

variable C influences the values of variables D and E. Every edge in a

Bayesian network thus encodes a dependence relationship between random

variables with a particular directionality.

Figure 4.10.

Illustrations of two basic types of graphical models.

Bayesian networks are directed acyclic graphs (DAG) because they do not

contain any directed cycles such that the influence of a node loops back to the

same node. Figure 4.11 shows some examples of Bayesian networks that

capture different types of dependence structures among random variables. In

a directed acyclic graph, if there is a directed edge from X to Y ,then X is

called the parent of Y and Y is called the child of X. Note that a node can

have multiple parents in a Bayesian network, e.g., node D has two parent

nodes, B and C, in Figure 4.11(a) . Furthermore, if there is a directed path

in the network from X to Z, then X is an ancestor of Z, while Z is a

descendant of X. For example, in the diagram shown in Figure 4.11(b) , A

is a descendant of D and D is an ancestor of B. Note that there can be

multiple directed paths between two nodes of a directed acyclic graph, as is

the case for nodes A and D in Figure 4.11(a) .

Figure 4.11.

Examples of Bayesian networks.

Conditional Independence

An important property of a Bayesian network is its ability to represent varying

forms of conditional independence among random variables. There are

several ways of describing the conditional independence assumptions

captured by Bayesian networks. One of the most generic ways of expressing

conditional independence is the concept of d-separation, which can be used

to determine if any two sets of nodes A and B are conditionally independent

given another set of nodes C. Another useful concept is that of the Markov

blanket of a node Y , which denotes the minimal set of nodes X that makes Y

independent of the other nodes in the graph, when conditioned on X. (See

Bibliographic Notes for more details on d-separation and Markov blanket.)

However, for the purpose of classification, it is sufficient to describe a simpler

expression of conditional independence in Bayesian networks, known as the

local Markov property.

Property 1 (Local Markov Property).

A node in a Bayesian network is conditionally independent of its

non-descendants, if its parents are known.

To illustrate the local Markov property, consider the Bayes network shown in

Figure 4.11(b) . We can state that A is conditionally independent of both B

and D given C, because C is the parent of A and nodes B and D are non-

descendants of A. The local Markov property helps in interpreting parent-child

relationships in Bayesian networks as representations of conditional

probabilities. Since a node is conditionally independent of its non-descendants

given it parents, the conditional independence assumptions imposed by a

Bayesian network is often sparse in structure. Nonetheless, Bayesian

networks are able to express a richer class of conditional independence

statements among attributes and class labels than the naïve Bayes classifier.

In fact, the naïve Bayes classifier can be viewed as a special type of Bayesian

network, where the target class Y is at the root of a tree and every attribute

is connected to the root node by a directed edge, as shown in Figure

4.12(a) .

Figure 4.12.

Comparing the graphical representation of a naïve Bayes classifier with that of

a generic Bayesian network.

Note that in a naïve Bayes classifier, every directed edge points from the

target class to the observed attributes, suggesting that the class label is a

factor behind the generation of attributes. Inferring the class label can thus be

viewed as diagnosing the root cause behind the observed attributes. On the

other hand, Bayesian networks provide a more generic structure of

probabilistic relationships, since the target class is not required to be at the

root of a tree but can appear anywhere in the graph, as shown in Figure

Xi

4.12(b) . In this diagram, inferring Y not only helps in diagnosing the factors

influencing and , but also helps in predicting the influence of and .

Joint Probability

The local Markov property can be used to succinctly express the joint

probability of the set of random variables involved in a Bayesian network. To

realize this, let us first consider a Bayesian network consisting of d nodes,

to , where the nodes have been numbered in such a way that is an

ancestor of only if . The joint probability of can be

generically factorized using the chain rule of probability as

By the way we have constructed the graph, note that the set

contains only non-descendants of . Hence, by using the local Markov

property, we can write as , where denotes

the parents of . The joint probability can then be represented as

It is thus sufficient to represent the probability of every node in terms of its

parent nodes, , for computing P( ). This is achieved with the help of

probability tables that associate every node to its parent nodes as follows:

1. The probability table for node contains the conditional probability

values for every combination of values in and .

2. If has no parents , then the table contains only the prior

probability .

X3 X4 X1 X2

X1

Xd Xi

Xj i<j X={X1, …, Xd}

P(X)=P(X1)P(X2|X1)P(X3|X1, X2) … P(Xd|X1, … Xd−1)=∏i=1dP(Xi|X1, … Xi

−1)

(4.22)

{X1, … Xi−1 }

Xi

P(Xi|X1, … Xi−1) P(Xi|pa(Xi)) pa(Xi)

Xi

P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)

Xi

pa(Xi)

Xi

P(Xi|pa(Xi)) Xi pa(Xi)

Xi (pa(Xi)=ϕ)

P(Xi)

Example 4.6. [Probability Tables]

Figure 4.13 shows an example of a Bayesian network for modeling the

relationships between a patient’s symptoms and risk factors. The

probability tables are shown at the side of every node in the figure. The

probability tables associated with the risk factors (Exercise and Diet)

contain only the prior probabilities, whereas the tables for heart disease,

heartburn, blood pressure, and chest pain, contain the conditional

probabilities.

Figure 4.13.

A Bayesian network for detecting heart disease and heartburn in patients.

Use of Hidden Variables

A Bayesian network typically involves two types of variables: observed

variables that are clamped to specific observed values, and unobserved

variables, whose values are not known and need to be inferred from the

network. To distinguish between these two types of variables, observed

variables are generally represented using shaded nodes while unobserved

variables are represented using empty nodes. Figure 4.14 shows an

example of a Bayesian network with observed variables (A, B, and E ) and

unobserved variables (C and D).

Figure 4.14.

Observed and unobserved variables are represented using unshaded and

shaded circles, respectively.

In the context of classification, the observed variables correspond to the set of

attributes X, while the target class is represented using an unobserved

variable Y that needs to be inferred during testing. However, note that a

generic Bayesian network may contain many other unobserved variables

apart from the target class, as represented in Figure 4.15 as the set of

variables H. These unobserved variables represent hidden or confounding

factors that affect the probabilities of attributes and class labels, although they

are never directly observed. The use of hidden variables enhances the

expressive power of Bayesian networks in representing complex probabilistic

relationships between attributes and class labels. This is one of the key

distinguishing properties of Bayesian networks as compared to naïve Bayes

classifiers.

4.5.2 Inference and Learning

Given the probability tables corresponding to every node in a Bayesian

network, the problem of inference corresponds to computing the probabilities

of different sets of random variables. In the context of classification, one of the

key inference problems is to compute the probability of a target class Y taking

on a specific value y, given the set of observed attributes at a data instance,

. This can be represented using the following conditional probability:

The previous equation involves marginal probabilities of the form P(y, ).

They can be computed by marginalizing out the hidden variables H from the

joint probability as follows:

where the joint probability P(y, , H) can be obtained by using the

factorization described in Equation 4.23 . To understand the nature of

computations involved in estimating P(y, ), consider the example Bayesian

network shown in Figure 4.15 , which involves a target class, Y , three

observed attributes, to , and four hidden variables, to . For this

network, we can express P(y, ) as

P(Y=y|x)=(y, x)P(x)=(y, x)∑y′P(y′, x) (4.24)

P(y, x)=∑HP(y, x, H), (4.25)

X1 X3 H1 H4

Figure 4.15.

An example of a Bayesian network with four hidden variables, to , three

observed attributes, to , and one target class Y .

where f is a factor that depends on the values of to . In the previous

simplistic expression of P(y, ), a different summand is considered for every

combination of values, to , in the hidden variables, to . If we

assume that every variable in the network can take k discrete values, then the

summation has to be carried out for a total number of times. The

computational complexity of this approach is thus . Moreover, the

number of computations grows exponentially with the number of hidden

variables, making it difficult to use this approach with networks that have a

large number of hidden variables. In the following, we present different

computational techniques for efficiently performing inferences in Bayesian

networks.

H1 H4

X1 X3

P(y, x)=∑h1∑h2∑h3∑h4P(y, x1, x2, h1, h2, h3, h4),=∑h1∑h2∑h3∑h4

[P(h1)P(h2)P(x2)P(h4)P(x1|h1, h2) ×P(h3|x2, h2)P(y|x1, h3)P(x3|h3, h4) ],

(4.26)

=∑h1∑h2∑h3∑h4f(h1, h2, h3, h4), (4.27)

h1 h4

h1 h4 H1 H4

k4

O(k4)

Variable Elimination

To reduce the number of computations involved in estimating P(y, ), let us

closely examine the expressions in Equations 4.26 and 4.27 . Notice that

although depends on the values of all four hidden variables, it

can be decomposed as a product of several smaller factors, where every

factor involves only a small number of hidden variables. For example, the

factor depends only on the value of , and thus acts as a constant

multiplicative term when summations are performed over , or .

Hence, if we place outside the summations of to , we can save

some repeated multiplications occurring inside every summand.

In general, we can push every summation as far inside as possible, so that

the factors that do not depend on the summing variable are placed outside the

summation. This will help reduce the number of wasteful computations by

using smaller factors at every summation. To illustrate this process, consider

the following sequence of steps for computing P(y, ), by rearranging the

order

of summations in Equation 4.26 .

where represents the intermediate factor term obtained by summing out .

To check if the previous rearrangements provide any improvements in

f(h1, h2, h3, h4)

P(h4) h4

h1, h2 h3

P(h4) h1 h3

P(y, x)=P(x2)∑h4P(h4)∑h3P(y|x1, h3)P(x3|h3, h4)×∑h2P(h2)P(h3|x2, h2)∑h1P(4.28)

=P(x2)∑h4P(h4)∑h3P(y|x1, h3)P(x3|h3, h4)×∑h2P(h2)P(h3|x2, h2)f1(h2)(4.29)

=P(x2)∑h4P(h4)∑h3P(y|x1, h3)P(x3|h3, h4)f2(h3) (4.30)

=P(x2)∑h4P(h4)f3(h4) (4.31)

fi hi

computational efficiency, let us count the number of computations occurring at

every step of the process. At the first step (Equation 4.28 ), we perform a

summation over using factors that depend on and . This requires

considering every pair of values in and , resulting in computations.

Similarly, the second step (Equation 4.29 ) involves summing out using

factors of and , leading to computations. The third step (Equation

4.30 ) again requires computations as it involves summing out

over factors depending on and . Finally, the fourth step (Equation

4.31 ) involves summing out using factors depending on , resulting in

O(k) computations.

The overall complexity of the previous approach is thus , which is

considerably smaller than the complexity of the basic approach. Hence,

by merely rearranging summations and using algebraic manipulations, we are

able to improve the computational efficiency in computing P(y, ). This

procedure is known as variable elimination.

The basic concept that variable elimination exploits to reduce the number of

computations is the distributive nature of multiplication over addition

operations. For example, consider the following multiplication and addition

operations:

Notice that the right-hand side of the previous equation involves three

multiplications and three additions, while the left-hand side involves only one

multiplication and three additions, thus saving on two arithmetic operations.

This property is utilized by variable elimination in pushing out constant terms

outside the summation, such that they are multiplied only once.

h1 h1 h2

h1 h2 O(k2)

h2

h2 h3 O(k2)

O(k2) h3

h3 h4

h4 h4

O(k2)

O(k4)

a.(b+c+d)=a.b+a.c+a.d

Note that the efficiency of variable elimination depends on the order of hidden

variables used for performing summations. Hence, we would ideally like to

find the optimal order of variables that result in the smallest number of

computations. Unfortunately, finding the optimal order of summations for a

generic Bayesian network is an NP-Hard problem, i.e., there does not exist an

efficient algorithm for finding the optimal ordering that can run in polynomial

time. However, there exists efficient techniques for handling special types of

Bayesian networks, e.g., those involving tree-like graphs, as described in the

following.

Sum-Product Algorithm for Trees

Note that in Equations 4.28 and 4.29 , whenever a variable is

eliminated during marginalization, it results in the creation of a factor that

depends on the neighboring nodes of . is then absorbed in the factors of

neighboring variables and the process is repeated until all unobserved

variables have marginalized. This phenomena of variable elimination can be

viewed as transmitting a local message from the variable being marginalized

to its neighboring nodes. This idea of message passing utilizes the structure

of the graph for performing computations, thus making it possible to use

graph-theoretic approaches for making effective inferences. The sum-

product algorithm builds on the concept of message passing for computing

marginal and conditional probabilities on tree-based graphs.

Figure 4.16 shows an example of a tree involving five variables, to .

A key characteristic of a tree is that every node in the tree has exactly one

parent, and there is only one directed edge between any two nodes in the

tree. For the purpose of illustration, let us consider the problem of estimating

the marginal probability of . This can be obtained by marginalizing

out every variable in the graph except and rearranging the summations to

obtain the following expression:

hi

fi

hi fi

X1 X5

X2, P(X2)

X2

Figure 4.16.

An example of a Bayesian network with a tree structure.

where has been conveniently chosen to represent the factor of that is

obtained by summing out . We can view as a local message passed

from node to node , as shown using arrows in Figure 4.17(a) . These

local messages capture the influence of eliminating nodes on the marginal

probabilities of neighboring nodes.

Before we formally describe the formula for computing and , we

first define a potential function that is associated every node and edge of

the graph. We can define the potential of a node as

P(x2)=∑x1∑x3∑x4∑x5P(x1)P(x2|x1)P(x3|x2)P(x4|x3)P(x5|x3),=

(∑x1P(x1)P(x2|x1))︸m12(x2)(∑x3P(x3|x2)(∑x4P(x4|x3))︸m43(x3)

(∑x5P(x5|x3))︸m53(x3)),︸m32(x2)

mij(xj) xj

xi mij(xj)

xi xj

mij(xj) P(xj)

ψ(⋅)

Xi

ψ(Xi)={P(Xi),if Xi is the root node.1,otherwise. (4.32)

Figure 4.17.

Illustration of message passing in the sum-product algorithm.

Similarly, we can define the potential of an edge between nodes and

(where is the parent of ) as

Using and , we can represent using the following

equation:

where N(i) represents the set of neighbors of node . The message that is

transmitted from to can thus be recursively computed using the

Xi Xj

Xi Xj

ψ(Xi, Xj)=P(Xj|Xi).

ψ(Xi) ψ(Xi, Xj) mij(xj)

mij(xj)=∑xi(ψ(xi)ψ(xi, xj)∏k∈N(i)imki(xi)), (4.33)

Xi mij

Xi Xj

messages incident on from its neighboring nodes excluding . Note that

the formula for involves taking a sum over all possible values of , after

multiplying the factors obtained from the neighbors of . This approach of

message passing is thus called the “sum-product” algorithm. Further, since

represents a notion of “belief” propagated from to , this algorithm is also

known as belief propagation. The marginal probability of a node

is then given as

A useful property of the sum-product algorithm is that it allows the messages

to be reused for computing a different marginal probability in the future. For

example, if we had to compute the marginal probability for node , we would

require the following messages from its neighboring nodes: ,

and . However, note that , and have already been

computed in the process of computing the marginal probability of and thus

can be reused.

Notice that the basic operations of the sum-product algorithm resemble a

message passing protocol over the edges of the network. A node sends out a

message to all its neighboring nodes only after it has received incoming

messages from all its neighbors. Hence, we can initialize the message

passing protocol from the leaf nodes, and transmit messages till we reach the

root node. We can then run a second pass of messages from the root node

back to the leaf nodes. In this way, we can compute the messages for every

edge in both directions, using just operations, where is the number

of edges. Once we have transmitted all possible messages as shown in

Figure 4.17(b) , we can easily compute the marginal probability of every

node in the graph using Equation 4.34 .

Xi Xi

mij Xj

Xj

mij

Xi Xj

Xi

P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)

X3

m23(x3), m43(x3)

m53(x3) m43(x3) m53(x3)

X2

O(2|E|) |E|

In the context of classification, the sum-product algorithm can be easily

modified for computing the conditional probability of the class label y given the

set of observed attributes , i.e., . This basically amounts to

computing in Equation 4.24 , where X is clamped to the

observed values . To handle the scenario where some of the random

variables are fixed and do not need to be normalized, we consider the

following modification.

If is a random variable that is fixed to a specific value , then we can

simply modify and as follows:

We can run the sum-product algorithm using these modified values for every

observed variable and thus compute .

x^ P(y|x^)

P(y, X=x^)

x^

Xi x^i

ψ(Xi) ψ(Xi, Xj)

ψ(Xi)={1,if Xi=x^i.0,otherwise. (4.35)

ψ(Xi, Xj)={P(Xi|x^i),if Xi=x^i.0,otherwise. (4.36)

P(y, X=x^)

Figure 4.18.

Example of a poly-tree and its corresponding factor graph.

Generalizations for Non-Tree Graphs

The sum-product algorithm is guaranteed to optimally converge in the case of

trees using a single run of message passing in both directions of every edge.

This is because any two nodes in a tree have a unique path for the

transmission of messages. Furthermore, since every node in a tree has a

single parent, the joint probability involves only factors of at most two

variables. Hence, it is sufficient to consider potentials over edges and not

other generic substructures in the graph.

Both of the previous properties are violated in graphs that are not trees, thus

making it difficult to directly apply the sum-product algorithm for making

inferences. However, a number of variants of the sum-product algorithm have

been devised to perform inferences on a broader family of graphs than trees.

Many of these variants transform the original graph into an alternative tree-

based representation, and then apply the sum-product algorithm on the

transformed tree. In this section, we briefly discuss one such transformations

known as factor graphs.

Factor graphs are useful for making inferences over graphs that violate the

condition that every node has a single parent. Nonetheless, they still require

the absence of multiple paths between any two nodes, to guarantee

convergence. Such graphs are known as poly-trees. An example of a poly-

tree is shown in Figure 4.18(a) .

A poly-tree can be transformed into a tree-based representation with the help

of factor graphs. These graphs consist of two types of nodes, variables nodes

(that are represented using circles) and factor nodes (that are represented

using squares). The factor nodes represent conditional independence

relationships among the variables of the poly-tree. In particular, every

probability table can be represented as a factor node. The edges in a factor

graph are undirected in nature and relate a variable node to a factor node if

the variable is involved in the probability table corresponding to the factor

node. Figure 4.18(b) presents the factor graph representation of the poly-

tree shown in Figure 4.18(a) .

Note that the factor graph of a poly-tree always forms a tree-like structure,

where there is a unique path of influence between any two nodes in the factor

graph. Hence, we can apply a modified form of sum-product algorithm to

transmit messages between variable nodes and factor nodes, which is

guaranteed to converge to optimal values.

Learning Model Parameters

In all our previous discussions on Bayesian networks, we had assumed that

the topology of the Bayesian network and the values in the probability tables

of every node were already known. In this section, we discuss approaches for

learning both the topology and the probability table values of a Bayesian

network from the training data.

Let us first consider the case where the topology of the network is known and

we are only required to compute the probability tables. If there are no

unobserved variables in the training data, then we can easily compute the

probability table for , by counting the fraction of training instances

for every value of and every combination of values in . However, if

there are unobserved variables in or , then computing the fraction of

training instances for such variables is non-trivial and requires the use of

advances techniques such as the Expectation-Maximization algorithm

(described later in Chapter 8 ).

P(Xi|pa(Xi))

Xi pa(Xi)

Xi pa(Xi)

Learning the structure of the Bayesian network is a much more challenging

task than learning the probability tables. Although there are some scoring

approaches that attempt to find a graph structure that maximizes the training

likelihood, they are often computationally infeasible when the graph is large.

Hence, a common approach for constructing Bayesian networks is to use the

subjective knowledge of domain experts.

4.5.3 Characteristics of Bayesian

Networks

1. Bayesian networks provide a powerful approach for representing

probabilistic relationships between attributes and class labels with the

help of graphical models. They are able to capture complex forms of

dependencies among variables. Apart from encoding prior beliefs, they

are also able to model the presence of latent (unobserved) factors as

hidden variables in the graph. Bayesian networks are thus quite

expressive and provide predictive as well as descriptive insights about

the behavior of attributes and class labels.

2. Bayesian networks can easily handle the presence of correlated or

redundant attributes, as opposed to the naïve Bayes classifier. This is

because Bayesian networks do not use the naïve Bayes assumption

about conditional independence, but instead are able to express richer

forms of conditional independence.

3. Similar to the naïve Bayes classifier, Bayesian networks are also quite

robust to the presence of noise in the training data. Further, they can

handle missing values during training as well as testing. If a test

instance contains an attribute with a missing value, then a Bayesian

network can perform inference by treating as an unobserved node

Xi

Xi

and marginalizing out its effect on the target class. Hence, Bayesian

networks are well-suited for handling incompleteness in the data, and

can work with partial information. However, unless the pattern with

which missing values occurs is completely random, then their presence

will likely introduce some degree of error and/or bias into the analysis.

4. Bayesian networks are robust to irrelevant attributes that contain no

discriminatory information about the class labels. Such attributes show

no impact on the conditional probability of the target class, and are thus

rightfully ignored.

5. Learning the structure of a Bayesian network is a cumbersome task

that often requires assistance from expert knowledge. However, once

the structure has been decided, learning the parameters of the network

can be quite straightforward, especially if all the variables in the

network are observed.

6. Due to its additional ability of representing complex forms of

relationships, Bayesian networks are more susceptible to overfitting as

compared to the naïve Bayes classifier. Furthermore, Bayesian

networks typically require more training instances for effectively

learning the probability tables than the naïve Bayes classifier.

7. Although the sum-product algorithm provides computationally efficient

techniques for performing inference over tree-like graphs, the

complexity of the approach increase significantly when dealing with

generic graphs of large sizes. In situations where exact inference is

computationally infeasible, it is quite common to use approximate

inference techniques.

4.6 Logistic Regression

The naïve Bayes and the Bayesian network classifiers described in the

previous sections provide different ways of estimating the conditional

probability of an instance given class y, . Such models are known as

probabilistic generative models. Note that the conditional probability

essentially describes the behavior of instances in the attribute space that are

generated from class y. However, for the purpose of making predictions, we

are finally interested in computing the posterior probability . For

example, computing the following ratio of posterior probabilities is sufficient for

inferring class labels in a binary classification problem:

This ratio is known as the odds. If this ratio is greater than 1, then is

classified as . Otherwise, it is assigned to class . Hence, one may

simply learn a model of the odds based on the attribute values of training

instances, without having to compute as an intermediate quantity in the

Bayes theorem.

Classification models that directly assign class labels without computing class-

conditional probabilities are called discriminative models. In this section, we

present a probabilistic discriminative model known as logistic regression,

which directly estimates the odds of a data instance using its attribute

values. The basic idea of logistic regression is to use a linear predictor,

, for representing the odds of as follows:

P(x|y)

P(x|y)

P(y|x)

P(y=1|x)P(y=0|x)

y=1 y=0

P(x|y)

z=wTx+b

P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)

where and b are the parameters of the model and denotes the transpose

of a vector . Note that if , then belongs to class 1 since its odds

is greater than 1. Otherwise, belongs to class 0.

Figure 4.19.

Plot of sigmoid (logistic) function, .

Since , we can re-write Equation 4.37 as

This can be further simplified to express as a function of z.

where the function is known as the logistic or sigmoid function. Figure

4.19 shows the behavior of the sigmoid function as we vary z. We can see

that only when . We can also derive using as

follows:

aT

wTx+b>0

σ(z)

P(y=0|x)+P(y=1|x)=1

P(y=1|x)1−P(y=1|x)=ez.

P(y=1|x)

P(y=1|x)=11+e−z=σ(z), (4.38)

σ(⋅)

σ(z)≥0.5 z≥0 P(y=0|x) σ(z)

Hence, if we have learned a suitable value of parameters and b, we can

use Equations 4.38 and 4.39 to estimate the posterior probabilities of

any data instance and determine its class label.

4.6.1 Logistic Regression as a

Generalized Linear Model

Since the posterior probabilities are real-valued, their estimation using the

previous equations can be viewed as solving a regression problem. In fact,

logistic regression belongs to a broader family of statistical regression models,

known as generalized linear models (GLM). In these models, the target

variable y is considered to be generated from a probability distribution ,

whose mean can be estimated using a link function as follows:

For binary classification using logistic regression, y follows a Bernoulli

distribution (y can either be 0 or 1) and is equal to . The link

function of logistic regression, called the logit function, can thus be

represented as

Depending on the choice of link function and the form of probability

distribution , GLMs are able to represent a broad family of regression

models, such as linear regression and Poisson regression. They require

P(y=0|x)=1−σ(z)=11+e−z (4.39)

P(y|x)

μ g(⋅)

g(μ)=z=wT x + b. (4.40)

μ P(y=1|x)

g(⋅)

g(μ)=log(μ1−μ).

g(⋅)

P(y|x)

different approaches for estimating their model parameters, ( , ). In this

chapter, we will only discuss approaches for estimating the model parameters

of logistic regression, although methods for estimating parameters of other

types of GLMs are often similar (and sometimes even simpler). (See

Bibliographic Notes for more details on GLMs.)

Note that even though logistic regression has relationships with regression

models, it is a classification model since the computed posterior probabilities

are eventually used to determine the class label of a data instance.

4.6.2 Learning Model Parameters

The parameters of logistic regression, ( , ), are estimated during training

using a statistical approach known as the maximum likelihood estimation

(MLE) method. This method involves computing the likelihood of observing

the training data given ( , ), and then determining the model parameters

that yield maximum likelihood.

Let denote a set of n training

instances, where is a binary variable (0 or 1). For a given training instance

, we can compute its posterior probabilities using Equations 4.38 and

4.39 . We can then express the likelihood of observing given , , and b

as

where is the sigmoid function as described above, Equation 4.41

basically means that the likelihood is equal to when

(w*, b*)

D.train={(x1, y1), (x2, y2), … , (xn, yn)}

yi

xi

yi xi

P(yi|xi, w, b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=

(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi,

(4.41)

σ(⋅)

P(yi|xi, w, b) P(y=1|xi)

, and equal to when . The likelihood of all training instances,

, can then be computed by taking the product of individual likelihoods

(assuming independence among training instances) as follows:

The previous equation involves multiplying a large number of probability

values, each of which are smaller than or equal to 1. Since this naïve

computation can easily become numerically unstable when n is large, a more

practical approach is to consider the negative logarithm (to base e) of the

likelihood function, also known as the cross entropy function:

The cross entropy is a loss function that measures how unlikely it is for the

training data to be generated from the logistic regression model with

parameters ( , ). Intuitively, we would like to find model parameters

that result in the lowest cross entropy, .

where is the loss function. It is worth emphasizing that

E( , ) is a convex function, i.e., any minima of E( , ) will be a global

minima. Hence, we can use any of the standard convex optimization

techniques to solve Equation 4.43 , which are mentioned in Appendix E.

Here, we briefly describe the Newton-Raphson method that is commonly used

for estimating the parameters of logistic regression. For ease of

representation, we will use a single vector to describe , which is of

size one greater than . Similarly, we will consider the concatenated feature

vector , such that the linear predictor can be succinctly

yi=1 P(y=0|xi) yi=0

L(w, b)

L(w, b)=∏i=1nP(yi|xi, w, b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)

−logL(w, b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=

−∑i=1nyilog(σ(wTxi+b))+(1−yi)log(1−σ(wTxi+b)).

(w*, b*)

−logL(w*, b*)

(w*, b*)=argmin(w, b)E(w, b)=argmin(w, b)−logL(w, b) (4.43)

E(w, b)=−logL(w, b)

w˜=(wT b)T

x˜=(xT 1)T z=wTx+b

written as . Also, the concatenation of all training labels, to , will

be represented as y, the set consisting of to will be represented as

, and the concatenation of to will be represented as .

The Newton-Raphson is an iterative method for finding that uses the

following equation to update the model parameters at every iteration:

where and H are the first- and second-order derivatives of the loss

function with respect to , respectively. The key intuition behind

Equation 4.44 is to move the model parameters in the direction of

maximum gradient, such that takes larger steps when is large.

When arrives at a minima after some number of iterations, then

would become equal to 0 and thus result in convergence. Hence, we start with

some initial values of (either randomly assigned or set to 0) and use

Equation 4.44 to iteratively update till there are no significant changes in

its value (beyond a certain threshold).

The first-order derivative of is given by

where we have used the fact that . Using , we

can compute the second-order derivative of as

where R is a diagonal matrix whose i diagonal element . We can

now use the first- and second-order derivatives of in Equation 4.44 to

z=w˜Tx˜ y1 yn

σ(z1) σ(zn)

σ x˜1 x˜n X˜

w˜*

w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)

∇E(w˜)

E( w˜) w˜

w˜ ∇E(w˜)

w˜ ∇E(w˜)

w˜

w˜

E(w˜)

∇E(w˜)=−∑i=1nyix˜i(1−σ(w˜Tx˜i))−(1−yi)x˜iσ(w˜Tx˜i),=−∑i=1n(σ(w˜Tx˜i)

−yi)x˜i,=X˜(σ−y),

(4.45)

dσ(z)/dz=σ(z)(1−σ(z)) ∇E(w˜)

E(w˜)

H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)

th Rii=σi(1−σi)

E(w˜)

th

obtain the following update equation at the k iteration:

where the subscript k under and refers to using to compute both

terms.

4.6.3 Characteristics of Logistic

Regression

1. Logistic Regression is a discriminative model for classification that

directly computes the poster probabilities without making any

assumption about the class conditional probabilities. Hence, it is quite

generic and can be applied in diverse applications. It can also be easily

extended to multiclass classification, where it is known as multinomial

logistic regression. However, its expressive power is limited to

learning only linear decision boundaries.

2. Because there are different weights (parameters) for every attribute,

the learned parameters of logistic regression can be analyzed to

understand the relationships between attributes and class labels.

3. Because logistic regression does not involve computing densities and

distances in the attribute space, it can work more robustly even in high-

dimensional settings than distance-based methods such as nearest

neighbor classifiers. However, the objective function of logistic

regression does not involve any term relating to the complexity of the

model. Hence, logistic regression does not provide a way to make a

trade-off between model complexity and training performance, as

compared to other classification models such as support vector

th

w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)

Rk σk w˜(k)

machines. Nevertheless, variants of logistic regression can easily be

developed to account for model complexity, by including appropriate

terms in the objective function along with the cross entropy function.

4. Logistic regression can handle irrelevant attributes by learning weight

parameters close to 0 for attributes that do not provide any gain in

performance during training. It can also handle interacting attributes

since the learning of model parameters is achieved in a joint fashion by

considering the effects of all attributes together. Furthermore, if there

are redundant attributes that are duplicates of each other, then logistic

regression can learn equal weights for every redundant attribute,

without degrading classification performance. However, the presence of

a large number of irrelevant or redundant attributes in high-dimensional

settings can make logistic regression susceptible to model overfitting.

5. Logistic regression cannot handle data instances with missing values,

since the posterior probabilities are only computed by taking a

weighted sum of all the attributes. If there are missing values in a

training instance, it can be discarded from the training set. However, if

there are missing values in a test instance, then logistic regression

would fail to predict its class label.

4.7 Artificial Neural Network (ANN)

Artificial neural networks (ANN) are powerful classification models that are

able to learn highly complex and nonlinear decision boundaries purely from

the data. They have gained widespread acceptance in several applications

such as vision, speech, and language processing, where they have been

repeatedly shown to outperform other classification models (and in some

cases even human performance). Historically, the study of artificial neural

networks was inspired by attempts to emulate biological neural systems. The

human brain consists primarily of nerve cells called neurons, linked together

with other neurons via strands of fiber called axons. Whenever a neuron is

stimulated (e.g., in response to a stimuli), it transmits nerve activations via

axons to other neurons. The receptor neurons collect these nerve activations

using structures called dendrites, which are extensions from the cell body of

the neuron. The strength of the contact point between a dendrite and an axon,

known as a synapse, determines the connectivity between neurons.

Neuroscientists have discovered that the human brain learns by changing the

strength of the synaptic connection between neurons upon repeated

stimulation by the same impulse.

The human brain consists of approximately 100 billion neurons that are inter-

connected in complex ways, making it possible for us to learn new tasks and

perform regular activities. Note that a single neuron only performs a simple

modular function, which is to respond to the nerve activations coming from

sender neurons connected at its dendrite, and transmit its activation to

receptor neurons via axons. However, it is the composition of these simple

functions that together is able to express complex functions. This idea is at the

basis of constructing artificial neural networks.

Analogous to the structure of a human brain, an artificial neural network is

composed of a number of processing units, called nodes, that are connected

with each other via directed links. The nodes correspond to neurons that

perform the basic units of computation, while the directed links correspond to

connections between neurons, consisting of axons and dendrites. Further, the

weight of a directed link between two neurons represents the strength of the

synaptic connection between neurons. As in biological neural systems, the

primary objective of ANN is to adapt the weights of the links until they fit the

input-output relationships of the underlying data.

The basic motivation behind using an ANN model is to extract useful features

from the original attributes that are most relevant for classification.

Traditionally, feature extraction has been achieved by using dimensionality

reduction techniques such as PCA (introduced in Chapter 2), which show

limited success in extracting nonlinear features, or by using hand-crafted

features provided by domain experts. By using a complex combination of

inter-connected nodes, ANN models are able to extract much richer sets of

features, resulting in good classification performance. Moreover, ANN models

provide a natural way of representing features at multiple levels of abstraction,

where complex features are seen as compositions of simpler features. In

many classification problems, modeling such a hierarchy of features turns out

to be very useful. For example, in order to detect a human face in an image,

we can first identify low-level features such as sharp edges with different

gradients and orientations. These features can then be combined to identify

facial parts such as eyes, nose, ears, and lips. Finally, an appropriate

arrangement of facial parts can be used to correctly identify a human face.

ANN models provide a powerful architecture to represent a hierarchical

abstraction of features, from lower levels of abstraction (e.g., edges) to higher

levels (e.g., facial parts).

Artificial neural networks have had a long history of developments spanning

over five decades of research. Although classical models of ANN suffered

from several challenges that hindered progress for a long time, they have re-

emerged with widespread popularity because of a number of recent

developments in the last decade, collectively known as deep learning. In this

section, we examine classical approaches for learning ANN models, starting

from the simplest model called perceptrons to more complex architectures

called multi-layer neural networks. In the next section, we discuss some of

the recent advancements in the area of ANN that have made it possible to

effectively learn modern ANN models with deep architectures.

4.7.1 Perceptron

A perceptron is a basic type of ANN model that involves two types of nodes:

input nodes, which are used to represent the input attributes, and an output

node, which is used to represent the model output. Figure 4.20 illustrates

the basic architecture of a perceptron that takes three input attributes, ,

and , and produces a binary output y. The input node corresponding to an

attribute is connected via a weighted link to the output node. The

weighted link is used to emulate the strength of a synaptic connection

between neurons.

x1, x2

x3

xi wi

Figure 4.20.

Basic architecture of a perceptron.

The output node is a mathematical device that computes a weighted sum of

its inputs, adds a bias factor b to the sum, and then examines the sign of the

result to produce the output as follows:

To simplify notations, and b can be concatenated to form , while

can be appended with 1 at the end to form . The output of the

perceptron can then be written:

where the sign function acts as an activation function by providing an output

value of if the argument is positive and if its argument is negative.

Learning the Perceptron

Given a training set, we are interested in learning parameters such that

closely resembles the true y of training instances. This is achieved by using

the perceptron learning algorithm given in Algorithm 4.3 . The key

computation for this algorithm is the iterative weight update formula given in

Step 8 of the algorithm:

where is the weight parameter associated with the i input link after the

k iteration, is a parameter known as the learning rate, and is the value

y^

3^y={1,if wTx+b>0.−1,otherwise. (4.48)

w˜=(wT b)T

x˜=(xT 1)T

y^

y^=sign(w˜Tx˜),

+1 −1

w˜ y^

wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)

w(k) th

th λ xij

th

of the j attribute of the training example . The justification for Equation

4.49 is rather intuitive. Note that captures the discrepancy between

and , such that its value is 0 only when the true label and the predicted

output match. Assume is positive. If and , then is increased at

the next iteration so that can become positive. On the other hand, if

and , then is decreased so that can become negative.

Hence, the weights are modified at every iteration to reduce the discrepancies

between and y across all training instances. The learning rate , a

parameter whose value is between 0 and 1, can be used to control the

amount of adjustments made in each iteration. The algorithm halts when the

average number of discrepancies are smaller than a threshold .

Algorithm 4.3 Perceptron learning algorithm.

∈

λ

∑ γ

The perceptron is a simple classification model that is designed to learn linear

decision boundaries in the attribute space. Figure 4.21 shows the decision

th xi

(yi−y^i)

yi y^i

xij y^=0 y=1 wj

w˜Txi

y^=1 y=0 wj w˜Txi

y^ λ

γ

boundary obtained by applying the perceptron learning algorithm to the data

set provided on the left of the figure. However, note that there can be multiple

decision boundaries that can separate the two classes, and the perceptron

arbitrarily learns one of these boundaries depending on the random initial

values of parameters. (The selection of the optimal decision boundary is a

problem that will be revisited in the context of support vector machines in

Section 4.9 .) Further, the perceptron learning algorithm is only guaranteed

to converge when the classes are linearly separable. However, if the classes

are not linearly separable, the algorithm fails to converge. Figure 4.22

shows an example of a nonlinearly separable data given by the XOR function.

The perceptron cannot find the right solution for this data because there is no

linear decision boundary that can perfectly separate the training instances.

Thus, the stopping condition at line 12 of Algorithm 4.3 would never be

met and hence, the perceptron learning algorithm would fail to converge. This

is a major limitation of perceptrons since real-world classification problems

often involve nonlinearly separable classes.

Figure 4.21.

Perceptron decision boundary for the data given on the left ( represents a

positively labeled instance while o represents a negatively labeled instance.

+

Figure 4.22.

XOR classification problem. No linear hyperplane can separate the two

classes.

4.7.2 Multi-layer Neural Network

A multi-layer neural network generalizes the basic concept of a perceptron to

more complex architectures of nodes that are capable of learning nonlinear

decision boundaries. A generic architecture of a multi-layer neural network is

shown in Figure 4.23 where the nodes are arranged in groups called

layers. These layers are commonly organized in the form of a chain such that

every layer operates on the outputs of its preceding layer. In this way, the

layers represent different levels of abstraction that are applied on the input

features in a sequential manner. The composition of these abstractions

generates the final output at the last layer, which is used for making

predictions. In the following, we briefly describe the three types of layers used

in multi-layer neural networks.

Figure 4.23.

Example of a multi-layer artificial neural network (ANN).

The first layer of the network, called the input layer, is used for representing

inputs from attributes. Every numerical or binary attribute is typically

represented using a single node on this layer, while a categorical attribute is

either represented using a different node for each categorical value, or by

encoding the k-ary attribute using input nodes. These inputs are fed

into intermediary layers known as hidden layers, which are made up of

processing units known as hidden nodes. Every hidden node operates on

signals received from the input nodes or hidden nodes at the preceding layer,

and produces an activation value that is transmitted to the next layer. The final

layer is called the output layer and processes the activation values from its

preceding layer to produce predictions of output variables. For binary

classification, the output layer contains a single node representing the binary

class label. In this architecture, since the signals are propagated only in the

forward direction from the input layer to the output layer, they are also called

feedforward neural networks.

⌈log2k ⌉

A major difference between multi-layer neural networks and perceptrons is the

inclusion of hidden layers, which dramatically improves their ability to

represent arbitrarily complex decision boundaries. For example, consider the

XOR problem described in the previous section. The instances can be

classified using two hyperplanes that partition the input space into their

respective classes, as shown in Figure 4.24(a) . Because a perceptron can

create only one hyperplane, it cannot find the optimal solution. However, this

problem can be addressed by using a hidden layer consisting of two nodes,

as shown in Figure 4.24(b) . Intuitively, we can think of each hidden node

as a perceptron that tries to construct one of the two hyperplanes, while the

output node simply combines the results of the perceptrons to yield the

decision boundary shown in Figure 4.24(a) .

Figure 4.24.

A two-layer neural network for the XOR problem.

The hidden nodes can be viewed as learning latent representations or

features that are useful for distinguishing between the classes. While the first

hidden layer directly operates on the input attributes and thus captures

simpler features, the subsequent hidden layers are able to combine them and

construct more complex features. From this perspective, multi-layer neural

networks learn a hierarchy of features at different levels of abstraction that are

finally combined at the output nodes to make predictions. Further, there are

combinatorially many ways we can combine the features learned at the

hidden layers of ANN, making them highly expressive. This property chiefly

distinguishes ANN from other classification models such as decision trees,

which can learn partitions in the attribute space but are unable to combine

them in exponential ways.

Figure 4.25.

Schematic illustration of the parameters of an ANN model with hidden

layers.

To understand the nature of computations happening at the hidden and output

nodes of ANN, consider the i node at the l layer of the network , where

the layers are numbered from 0 (input layer) to L (output layer), as shown in

Figure 4.25 . The activation value generated at this node, , can be

represented as a function of the inputs received from nodes at the preceding

layer. Let represent the weight of the connection from the j node at layer

(L−1)

th th (l>0)

ail

wijl th

th

to the i node at layer l. Similarly, let us denote the bias term at this node

as . The activation value can then be expressed as

where z is called the linear predictor and is the activation function that

converts z to a. Further, note that, by definition, at the input layer and

at the output node.

There are a number of alternate activation functions apart from the sign

function that can be used in multi-layer neural networks. Some examples

include linear, sigmoid (logistic), and hyperbolic tangent functions, as shown in

Figure 4.26 . These functions are able to produce real-valued and nonlinear

activation values. Among these activation functions, the sigmoid has been

widely used in many ANN models, although the use of other types of

activation functions in the context of deep learning will be discussed in

Section 4.8 . We can thus represent as

(l−1) th

bjl ail

ail=f(zil)=f(∑jwijlajl−1+bil),

f(⋅)

aj0=xj

aL=y^

σ(⋅)

ail

Figure 4.26.

Types of activation functions used in multi-layer neural networks.

Learning Model Parameters

The weights and bias terms ( , b) of the ANN model are learned during

training so that the predictions on training instances match the true labels.

This is achieved by using a loss function

ail=σ(zil)=11+e−zil. (4.50)

E(w, b)=∑k=1nLoss (yk, y^k) (4.51)

where is the true label of the kth training instance and is equal to ,

produced by using . A typical choice of the loss function is the squared loss

function:.

Note that E( , b) is a function of the model parameters ( , b) because the

output activation value depends on the weights and bias terms. We are

interested in choosing ( , b) that minimizes the training loss E( , b).

Unfortunately, because of the use of hidden nodes with nonlinear activation

functions, E( , b) is not a convex function of and b, which means that E( ,

b) can have local minima that are not globally optimal. However, we can still

apply standard optimization techniques such as the gradient descent

method to arrive at a locally optimal solution. In particular, the weight

parameter and the bias term can be iteratively updated using the

following equations:

where is a hyper-parameter known as the learning rate. The intuition behind

this equation is to move the weights in a direction that reduces the training

loss. If we arrive at a minima using this procedure, the gradient of the training

loss will be close to 0, eliminating the second term and resulting in the

convergence of weights. The weights are commonly initialized with values

drawn randomly from a Gaussian or a uniform distribution.

A necessary tool for updating weights in Equation 4.53 is to compute the

partial derivative of E with respect to . This computation is nontrivial

especially at hidden layers , since does not directly affect (and

yk y^k aL

xk

Loss (yk, y^k)=(yk, y^k)2. (4.52)

aL

wijl bil

wijl←wijl−λ∂E∂wijl, (4.53)

bil←bil−λ∂E∂bil, (4.54)

λ

wijl

(l<L) wijl y^=aL

hence the training loss), but has complex chains of influences via activation

values at subsequent layers. To address this problem, a technique known as

backpropagation was developed, which propagates the derivatives

backward from the output layer to the hidden layers. This technique can be

described as follows.

Recall that the training loss E is simply the sum of individual losses at training

instances. Hence the partial derivative of E can be decomposed as a sum of

partial derivatives of individual losses.

To simplify discussions, we will consider only the derivatives of the loss at the

k training instance, which will be generically represented as . By

using the chain rule of differentiation, we can represent the partial derivatives

of the loss with respect to as

The last term of the previous equation can be written as

Also, if we use the sigmoid activation function, then

Equation 4.55 can thus be simplified as

∂E∂wjl=∑k=1n∂ Loss (yk, y^k)∂wjl.

th Loss(y, aL)

wijl

∂ Loss∂wijl=∂ Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)

∂zil∂wijl=∂(∑jwijlajl−1+bil)∂wijl=ajl−1.

∂ail∂zil=∂ σ(zil)∂zil=ail(1−ai1).

∂ Loss∂wijl=δil×ail(1−ai1)×ajl−1,where δil=∂ Loss∂ail. (4.56)

A similar formula for the partial derivatives with respect to the bias terms is

given by

Hence, to compute the partial derivatives, we only need to determine .

Using a squared loss function, we can easily write at the output node as

However, the approach for computing at hidden nodes is more

involved. Notice that affects the activation values of all nodes at the

next layer, which in turn influences the loss. Hence, again using the chain rule

of differentiation, can be represented as

The previous equation provides a concise representation of the values at

layer l in terms of the values computed at layer . Hence, proceeding

backward from the output layer L to the hidden layers, we can recursively

apply Equation 4.59 to compute at every hidden node. can then be

used in Equations 4.56 and 4.57 to compute the partial derivatives of

the loss with respect to and , respectively. Algorithm 4.4 summarizes

the complete approach for learning the model parameters of ANN using

backpropagation and gradient descent method.

Algorithm 4.4 Learning ANN using

backpropagation and gradient descent.

bli

∂ Loss∂bil=δil×ail(1−ai1). (4.57)

δil

δL

δL=∂ Loss∂aL=∂ (y−aL)2∂aL=2(aL−y). (4.58)

δjl (l<L)

ajl ail+1

δjl

δjl=∂ Loss∂ajl=∑i(∂ Loss∂ail+1×∂ail+1∂ajl).=∑i(∂ Loss∂ail+1×∂ail+1∂zil+1×∂zil+1(4.59)

δjl

δjl+1 l+1

δil δil

wijl bil

∈

∂ ∂ ∂ ∂

∂ ∂ ∑ ∂ ∂

∂ ∂ ∑ ∂ ∂

4.7.3 Characteristics of ANN

1. Multi-layer neural networks with at least one hidden layer are universal

approximators; i.e., they can be used to approximate any target

function. They are thus highly expressive and can be used to learn

complex decision boundaries in diverse applications. ANN can also be

used for multiclass classification and regression problems, by

appropriately modifying the output layer. However, the high model

complexity of classical ANN models makes it susceptible to overfitting,

which can be overcome to some extent by using deep learning

techniques discussed in Section 4.8.3 .

2. ANN provides a natural way to represent a hierarchy of features at

multiple levels of abstraction. The outputs at the final hidden layer of

the ANN model thus represent features at the highest level of

abstraction that are most useful for classification. These features can

also be used as inputs in other supervised classification models, e.g.,

by replacing the output node of the ANN by any generic classifier.

3. ANN represents complex high-level features as compositions of simpler

lower-level features that are easier to learn. This provides ANN the

ability to gradually increase the complexity of representations, by

adding more hidden layers to the architecture. Further, since simpler

features can be combined in combinatorial ways, the number of

complex features learned by ANN is much larger than traditional

classification models. This is one of the main reasons behind the high

expressive power of deep neural networks.

4. ANN can easily handle irrelevant attributes, by using zero weights for

attributes that do not help in improving the training loss. Also,

redundant attributes receive similar weights and do not degrade the

quality of the classifier. However, if the number of irrelevant or

redundant attributes is large, the learning of the ANN model may suffer

from overfitting, leading to poor generalization performance.

5. Since the learning of ANN model involves minimizing a non-convex

function, the solutions obtained by gradient descent are not guaranteed

to be globally optimal. For this reason, ANN has a tendency to get

stuck in local minima, a challenge that can be addressed by using deep

learning techniques discussed in Section 4.8.4 .

6. Training an ANN is a time consuming process, especially when the

number of hidden nodes is large. Nevertheless, test examples can be

classified rapidly.

7. Just like logistic regression, ANN can learn in the presence of

interacting variables, since the model parameters are jointly learned

over all variables together. In addition, ANN cannot handle instances

with missing values in the training or testing phase.

4.8 Deep Learning

As described above, the use of hidden layers in ANN is based on the general

belief that complex high-level features can be constructed by combining

simpler lower-level features. Typically, the greater the number of hidden

layers, the deeper the hierarchy of features learned by the network. This

motivates the learning of ANN models with long chains of hidden layers,

known as deep neural networks. In contrast to “shallow” neural networks

that involve only a small number of hidden layers, deep neural networks are

able to represent features at multiple levels of abstraction and often require far

fewer nodes per layer to achieve generalization performance similar to

shallow networks.

Despite the huge potential in learning deep neural networks, it has remained

challenging to learn ANN models with a large number of hidden layers using

classical approaches. Apart from reasons related to limited computational

resources and hardware architectures, there have been a number of

algorithmic challenges in learning deep neural networks. First, learning a deep

neural network with low training error has been a daunting task because of the

saturation of sigmoid activation functions, resulting in slow convergence of

gradient descent. This problem becomes even more serious as we move

away from the output node to the hidden layers, because of the compounded

effects of saturation at multiple layers, known as the vanishing gradient

problem. Because of this reason, classical ANN models have suffered from

slow and ineffective learning, leading to poor training and test performance.

Second, the learning of deep neural networks is quite sensitive to the initial

values of model parameters, chiefly because of the non-convex nature of the

optimization function and the slow convergence of gradient descent. Third,

deep neural networks with a large number of hidden layers have high model

complexity, making them susceptible to overfitting. Hence, even if a deep

neural network has been trained to show low training error, it can still suffer

from poor generalization performance.

These challenges have deterred progress in building deep neural networks for

several decades and it is only recently that we have started to unlock their

immense potential with the help of a number of advances being made in the

area of deep learning. Although some of these advances have been around

for some time, they have only gained mainstream attention in the last decade,

with deep neural networks continually beating records in various competitions

and solving problems that were too difficult for other classification approaches.

There are two factors that have played a major role in the emergence of deep

learning techniques. First, the availability of larger labeled data sets, e.g., the

ImageNet data set contains more than 10 million labeled images, has made it

possible to learn more complex ANN models than ever before, without falling

easily into the traps of model overfitting. Second, advances in computational

abilities and hardware infrastructures, such as the use of graphical processing

units (GPU) for distributed computing, have greatly helped in experimenting

with deep neural networks with larger architectures that would not have been

feasible with traditional resources.

In addition to the previous two factors, there have been a number of

algorithmic advancements to overcome the challenges faced by classical

methods in learning deep neural networks. Some examples include the use of

more responsive combinations of loss functions and activation functions,

better initialization of model parameters, novel regularization techniques, more

agile architecture designs, and better techniques for model learning and

hyper-parameter selection. In the following, we describe some of the deep

learning advances made to address the challenges in learning deep neural

networks. Further details on recent developments in deep learning can be

obtained from the Bibliographic Notes.

4.8.1 Using Synergistic Loss Functions

One of the major realizations leading to deep learning has been the

importance of choosing appropriate combinations of activation and loss

functions. Classical ANN models commonly made use of the sigmoid

activation function at the output layer, because of its ability to produce real-

valued outputs between 0 and 1, which was combined with a squared loss

objective to perform gradient descent. It was soon noticed that this particular

combination of activation and loss function resulted in the saturation of output

activation values, which can be described as follows.

Saturation of Outputs

Although the sigmoid has been widely-used as an activation function, it easily

saturates at high and low values of inputs that are far away from 0. Observe

from Figure 4.27(a) that shows variance in its values only when z is

close to 0. For this reason, is non-zero for only a small range of z

around 0, as shown in Figure 4.27(b) . Since is one of the

components in the gradient of loss (see Equation 4.55 ), we get a

diminishing gradient value when the activation values are far from 0.

σ(z)

∂σ(z)/∂z

∂σ(z)/∂z

Figure 4.27.

Plots of sigmoid function and its derivative.

To illustrate the effect of saturation on the learning of model parameters at the

output node, consider the partial derivative of loss with respect to the weight

at the output node. Using the squared loss function, we can write this as

In the previous equation, notice that when is highly negative, (and

hence the gradient) is close to 0. On the other hand, when is highly

positive, becomes close to 0, nullifying the value of the gradient.

Hence, irrespective of whether the prediction matches the true label y or

not, the gradient of the loss with respect to the weights is close to 0 whenever

is highly positive or negative. This causes an unnecessarily slow

convergence of the model parameters of the ANN model, often resulting in

poor learning.

Note that it is the combination of the squared loss function and the sigmoid

activation function at the output node that together results in diminishing

wjL

∂ Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)

zL σ(zL)

zL

(1−σ(zL))

aL

zL

gradients (and thus poor learning) upon saturation of outputs. It is thus

important to choose a synergistic combination of loss function and activation

function that does not suffer from the saturation of outputs.

Cross entropy loss function

The cross entropy loss function, which was described in the context of logistic

regression in Section 4.6.2 , can significantly avoid the problem of

saturating outputs when used in combination with the sigmoid activation

function. The cross entropy loss function of a real-valued prediction

on a data instance with binary label can be defined as

where log represents the natural logarithm (to base e) and for

convenience. The cross entropy function has foundations in information theory

and measures the amount of disagreement between y and . The partial

derivative of this loss function with respect to can be given as

Using this value of in Equation 4.56 , we can obtain the partial derivative

of the loss with respect to the weight at the output node as

Notice the simplicity of the previous formula using the cross entropy loss

function. The partial derivatives of the loss with respect to the weights at the

output node depend only on the difference between the prediction and the

true label y. In contrast to Equation 4.60 , it does not involve terms such as

that can be impacted by saturation of . Hence, the gradients

y^∈(0, 1)

y∈{0, 1}

Loss(y, y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)

0 log(0)=0

y^

y^=aL

δL=∂ Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)

δL

wjl

∂ Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)

aL

σ(zL)(1−σ(zL)) zL

are high whenever is large, promoting effective learning of the model

parameters at the output node. This has been a major breakthrough in the

learning of modern ANN models and it is now a common practice to use the

cross entropy loss function with sigmoid activations at the output node.

4.8.2 Using Responsive Activation

Functions

Even though the cross entropy loss function helps in overcoming the problem

of saturating outputs, it still does not solve the problem of saturation at hidden

layers, arising due to the use of sigmoid activation functions at hidden nodes.

In fact, the effect of saturation on the learning of model parameters is even

more aggravated at hidden layers, a problem known as the vanishing gradient

problem. In the following, we describe the vanishing gradient problem and the

use of a more responsive activation function, called the rectified linear

output unit (ReLU), to overcome this problem.

Vanishing Gradient Problem

The impact of saturating activation values on the learning of model

parameters increases at deeper hidden layers that are farther away from the

output node. Even if the activation in the output layer does not saturate, the

repeated multiplications performed as we backpropagate the gradients from

the output layer to the hidden layers may lead to decreasing gradients in the

hidden layers. This is called the vanishing gradient problem, which has been

one of the major hindrances in learning deep neural networks.

(aL−y)

To illustrate the vanishing gradient problem, consider an ANN model that

consists of a single node at every hidden layer of the network, as shown in

Figure 4.28 . This simplified architecture involves a single chain of hidden

nodes where a single weighted link connects the node at layer to the

node at layer l. Using Equations 4.56 and 4.59 , we can represent the

partial derivative of the loss with respect to as

Notice that if any of the linear predictors saturates at subsequent layers,

then the term becomes close to 0, thus diminishing the overall

gradient. The saturation of activations thus gets compounded and has

multiplicative effects on the gradients at hidden layers, making them highly

unstable and thus, unsuitable for use with gradient descent. Even though the

previous discussion only pertains to the simplified architecture involving a

single chain of hidden nodes, a similar argument can be made for any generic

ANN architecture involving multiple chains of hidden nodes. Note that the

vanishing gradient problem primarily arises because of the use of sigmoid

activation function at hidden nodes, which is known to easily saturate

especially after repeated multiplications.

Figure 4.28.

An example of an ANN model with only one node at every hidden layer.

wl l−1

wl

∂ Loss∂wl=δl×al(1−al)×al−1,where δl=2(aL−y)×∏r=lL−1(ar+1(1−ar+1)×wr+1).(4.64)

zr+1

ar+1(1−ar+1)

Figure 4.29.

Plot of the rectified linear unit (ReLU) activation function.

Rectified Linear Units (ReLU)

To overcome the vanishing gradient problem, it is important to use an

activation function f(z)at the hidden nodes that provides a stable and

significant value of the gradient whenever a hidden node is active, i.e., .

This is achieved by using rectified linear units (ReLU) as activation functions

at hidden nodes, which can be defined as

The idea of ReLU has been inspired from biological neurons, which are either

in an inactive state or show an activation value proportional to the

input. Figure 4.29 shows a plot of the ReLU function. We can see that it is

linear with respect to z when . Hence, the gradient of the activation value

with respect to z can be written as

z>0

a=f(z)={z,if z>0.0,otherwise. (4.65)

(f(z)=0)

z>0

Although f(z)is not differentiable at 0, it is common practice to use

when . Since the gradient of the ReLU activation function is equal to 1

whenever , it avoids the problem of saturation at hidden nodes, even after

repeated multiplications. Using ReLU, the partial derivatives of the loss with

respect to the weight and bias parameters can be given by

Notice that ReLU shows a linear behavior in the activation values whenever a

node is active, as compared to the nonlinear properties of the sigmoid

function. This linearity promotes better flows of gradients during

backpropagation, and thus simplifies the learning of ANN model parameters.

The ReLU is also highly responsive at large values of z away from 0, as

opposed to the sigmoid activation function, making it more suitable for

gradient descent. These differences give ReLU a major advantage over the

sigmoid function. Indeed, ReLU is used as the preferred choice of activation

function at hidden layers in most modern ANN models.

4.8.3 Regularization

A major challenge in learning deep neural networks is the high model

complexity of ANN models, which grows with the addition of hidden layers in

the network. This can become a serious concern, especially when the training

set is small, due to the phenomena of model overfitting. To overcome this

∂a∂z={1,if z>0.0,if z<0. (4.66)

∂a/∂z=0

z=0

z>0

∂ Loss∂wijl=δil×I(zil)×ajl−1, (4.67)

∂ Loss∂bil=δil×I(zil),where δil=∑i=1n(δil+1×I(zil+1)×wijl+1),and I(z)=

{1,if z>0.0,otherwise.

(4.68)

challenge, it is important to use techniques that can help in reducing the

complexity of the learned model, known as regularization techniques.

Classical approaches for learning ANN models did not have an effective way

to promote regularization of the learned model parameters. Hence, they had

often been sidelined by other classification methods, such as support vector

machines (SVM), which have in-built regularization mechanisms. (SVMs will

be discussed in more detail in Section 4.9 ).

One of the major advancements in deep learning has been the development

of novel regularization techniques for ANN models that are able to offer

significant improvements in generalization performance. In the following, we

discuss one of the regularization techniques for ANN, known as the dropout

method, that have gained a lot of attention in several applications.

Dropout

The main objective of dropout is to avoid the learning of spurious features at

hidden nodes, occurring due to model overfitting. It uses the basic intuition

that spurious features often “co-adapt” themselves such that they show good

training performance only when used in highly selective combinations. On the

other hand, relevant features can be used in a diversity of feature

combinations and hence are quite resilient to the removal or modification of

other features. The dropout method uses this intuition to break complex “co-

adaptations” in the learned features by randomly dropping input and hidden

nodes in the network during training.

Dropout belongs to a family of regularization techniques that uses the criteria

of resilience to random perturbations as a measure of the robustness (and

hence, simplicity) of a model. For example, one approach to regularization is

to inject noise in the input attributes of the training set and learn a model with

the noisy training instances. If a feature learned from the training data is

indeed generalizable, it should not be affected by the addition of noise.

Dropout can be viewed as a similar regularization approach that perturbs the

information content of the training set not only at the level of attributes but also

at multiple levels of abstractions, by dropping input and hidden nodes.

The dropout method draws inspiration from the biological process of gene

swapping in sexual reproduction, where half of the genes from both parents

are combined together to create the genes of the offspring. This favors the

selection of parent genes that are not only useful but can also inter-mingle

with diverse combinations of genes coming from the other parent. On the

other hand, co-adapted genes that function only in highly selective

combinations are soon eliminated in the process of evolution. This idea is

used in the dropout method for eliminating spurious co-adapted features. A

simplified description of the dropout method is provided in the rest of this

section.

Figure 4.30.

Examples of sub-networks generated in the dropout method using .

Let represent the model parameters of the ANN model at the k

iteration of the gradient descent method. At every iteration, we randomly

select a fraction of input and hidden nodes to be dropped from the network,

where is a hyper-parameter that is typically chosen to be 0.5. The

weighted links and bias terms involving the dropped nodes are then

eliminated, resulting in a “thinned” sub-network of smaller size. The model

parameters of the sub-network are then updated by computing

activation values and performing backpropagation on this smaller sub-

network. These updated values are then added back in the original network to

γ=0.5

(wk, bk) th

γ

γ∈(0, 1)

(wsk, bsk)

obtain the updated model parameters, , to be used in the next

iteration.

Figure 4.30 shows some examples of sub-networks that can be generated

at different iterations of the dropout method, by randomly dropping input and

hidden nodes. Since every sub-network has a different architecture, it is

difficult to learn complex co-adaptations in the features that can result in

overfitting. Instead, the features at the hidden nodes are learned to be more

agile to random modifications in the network structure, thus improving their

generalization ability. The model parameters are updated using a different

random sub-network at every iteration, till the gradient descent method

converges.

Let denote the model parameters at the last iteration

of the gradient descent method. These parameters are finally scaled down by

a factor of , to produce the weights and bias terms of the final ANN

model, as follows:

We can now use the complete neural network with model parameters

for testing. The dropout method has been shown to provide significant

improvements in the generalization performance of ANN models in a number

of applications. It is computationally cheap and can be applied in combination

with any of the other deep learning techniques. It also has a number of

similarities with a widely-used ensemble learning method known as bagging,

which learns multiple models using random subsets of the training set, and

then uses the average output of all the models to make predictions. (Bagging

will be presented in more detail later in Section 4.10.4 ). In a similar vein, it

can be shown that the predictions of the final network learned using dropout

approximates the average output of all possible sub-networks that can be

(wk+1, bk+1)

(wkmax, bkmax) kmax

(1−γ)

(w*, b*)=((1−γ)×wkmax, (1−γ)×bkmax)

(w*, b*)

2n

formed using n nodes. This is one of the reasons behind the superior

regularization abilities of dropout.

4.8.4 Initialization of Model Parameters

Because of the non-convex nature of the loss function used by ANN models, it

is possible to get stuck in locally optimal but globally inferior solutions. Hence,

the initial choice of model parameter values plays a significant role in the

learning of ANN by gradient descent. The impact of poor initialization is even

more aggravated when the model is complex, the network architecture is

deep, or the classification task is difficult. In such cases, it is often advisable to

first learn a simpler model for the problem, e.g., using a single hidden layer,

and then incrementally increase the complexity of the model, e.g., by adding

more hidden layers. An alternate approach is to train the model for a simpler

task and then use the learned model parameters as initial parameter choices

in the learning of the original task. The process of initializing ANN model

parameters before the actual training process is known as pretraining.

Pretraining helps in initializing the model to a suitable region in the parameter

space that would otherwise be inaccessible by random initialization.

Pretraining also reduces the variance in the model parameters by fixing the

starting point of gradient descent, thus reducing the chances of overfitting due

to multiple comparisons. The models learned by pretraining are thus more

consistent and provide better generalization performance.

Supervised Pretraining

A common approach for pretraining is to incrementally train the ANN model in

a layer-wise manner, by adding one hidden layer at a time. This approach,

known as supervised pretraining, ensures that the parameters learned at

every layer are obtained by solving a simpler problem, rather than learning all

model parameters together. These parameter values thus provide a good

choice for initializing the ANN model. The approach for supervised pretraining

can be briefly described as follows.

We start the supervised pretraining process by considering a reduced ANN

model with only a single hidden layer. By applying gradient descent on this

simple model, we are able to learn the model parameters of the first hidden

layer. At the next run, we add another hidden layer to the model and apply

gradient descent to learn the parameters of the newly added hidden layer,

while keeping the parameters of the first layer fixed. This procedure is

recursively applied such that while learning the parameters of the l hidden

layer, we consider a reduced model with only l hidden layers, whose first

hidden layers are not updated on the l run but are instead fixed using

pretrained values from previous runs. In this way, we are able to learn the

model parameters of all hidden layers. These pretrained values are

used to initialize the hidden layers of the final ANN model, which is fine-tuned

by applying a final round of gradient descent over all the layers.

Unsupervised Pretraining

Supervised pretraining provides a powerful way to initialize model parameters,

by gradually growing the model complexity from shallower to deeper

networks. However, supervised pretraining requires a sufficient number of

labeled training instances for effective initialization of the ANN model. An

alternate pretraining approach is unsupervised pretraining, which initializes

model parameters by using unlabeled instances that are often abundantly

available. The basic idea of unsupervised pretraining is to initialize the ANN

th

(l−1)

th

(L−1)

model in such a way that the learned features capture the latent structure in

the unlabeled data.

Figure 4.31.

The basic architecture of a single-layer autoencoder.

Unsupervised pretraining relies on the assumption that learning the

distribution of the input data can indirectly help in learning the classification

model. It is most helpful when the number of labeled examples is small and

the features for the supervised problem bear resemblance to the factors

generating the input data. Unsupervised pretraining can be viewed as a

different form of regularization, where the focus is not explicitly toward finding

simpler features but instead toward finding features that can best explain the

input data. Historically, unsupervised pretraining has played an important role

in reviving the area of deep learning, by making it possible to train any generic

deep neural network without requiring specialized architectures.

Use of Autoencoders

One simple and commonly used approach for unsupervised pretraining is to

use an unsupervised ANN model known as an autoencoder. The basic

architecture of an autoencoder is shown in Figure 4.31 . An autoencoder

attempts to learn a reconstruction of the input data by mapping the attributes

to latent features , and then re-projecting back to the original attribute

space to create the reconstruction . The latent features are represented

using a hidden layer of nodes, while the input and output layers represent the

attributes and contain the same number of nodes. During training, the goal is

to learn an autoencoder model that provides the lowest reconstruction error,

, on all input data instances. A typical choice of the reconstruction

error is the squared loss function:

The model parameters of the autoencoder can be learned by using a similar

gradient descent method as the one used for learning supervised ANN

models for classification. The key difference is the use of the reconstruction

error on all training instances as the training loss. Autoencoders that have

multiple layers of hidden layers are known as stacked autoencoders.

Autoencoders are able to capture complex representations of the input data

by the use of hidden nodes. However, if the number of hidden nodes is large,

it is possible for an autoencoder to learn the identity relationship, where the

input is just copied and returned as the output , resulting in a trivial

solution. For example, if we use as many hidden nodes as the number of

attributes, then it is possible for every hidden node to copy an attribute and

simply pass it along to an output node, without extracting any useful

information. To avoid this problem, it is common practice to keep the number

of hidden nodes smaller than the number of input attributes. This forces the

autoencoder to learn a compact and useful encoding of the input data, similar

to a dimensionality reduction technique. An alternate approach is to corrupt

x^

RE(x, x^)

RE(x, x^)=ǁx−x^ ǁ2.

x^

the input instances by adding random noise, and then learn the autoencoder

to reconstruct the original instance from the noisy input. This approach is

known as the denoising autoencoder, which offers strong regularization

capabilities and is often used to learn complex features even in the presence

of a large number of hidden nodes.

To use an autoencoder for unsupervised pretraining, we can follow a similar

layer-wise approach like supervised pretraining. In particular, to pretrain the

model parameters of the l hidden layer, we can construct a reduced ANN

model with only l hidden layers and an output layer containing the same

number of nodes as the attributes and is used for reconstruction. The

parameters of the l hidden layer of this network are then learned using a

gradient descent method to minimize the reconstruction error. The use of

unlabeled data can be viewed as providing hints to the learning of parameters

at every layer that aid in generalization. The final model parameters of the

ANN model are then learned by applying gradient descent over all the layers,

using the initial values of parameters obtained from pretraining.

Hybrid Pretraining

Unsupervised pretraining can also be combined with supervised pretraining by

using two output layers at every run of pretraining, one for reconstruction and

the other for supervised classification. The parameters of the l hidden layer

are then learned by jointly minimizing the losses on both output layers, usually

weighted by a trade-off hyper-parameter . Such a combined approach often

shows better generalization performance than either of the approaches, since

it provides a way to balance between the competing objectives of representing

the input data and improving classification performance.

th

th

th

α

4.8.5 Characteristics of Deep Learning

Apart from the basic characteristics of ANN discussed in Section 4.7.3 , the

use of deep learning techniques provides the following additional

characteristics:

1. An ANN model trained for some task can be easily re-used for a

different task that involves the same attributes, by using pretraining

strategies. For example, we can use the learned parameters of the

original task as initial parameter choices for the target task. In this way,

ANN promotes re-usability of learning, which can be quite useful when

the target application has a smaller number of labeled training

instances.

2. Deep learning techniques for regularization, such as the dropout

method, help in reducing the model complexity of ANN and thus

promoting good generalization performance. The use of regularization

techniques is especially useful in high-dimensional settings, where the

number of training labels is small but the classification problem is

inherently difficult.

3. The use of an autoencoder for pretraining can help eliminate irrelevant

attributes that are not related to other attributes. Further, it can help

reduce the impact of redundant attributes by representing them as

copies of the same attribute.

4. Although the learning of an ANN model can succumb to finding inferior

and locally optimal solutions, there are a number of deep learning

techniques that have been proposed to ensure adequate learning of an

ANN. Apart from the methods discussed in this section, some other

techniques involve novel architecture designs such as skip connections

between the output layer and lower layers, which aids the easy flow of

gradients during backpropagation.

5. A number of specialized ANN architectures have been designed to

handle a variety of input data sets. Some examples include

convolutional neural networks (CNN) for two-dimensional gridded

objects such as images, and recurrent neural network (RNN) for

sequences. While CNNs have been extensively used in the area of

computer vision, RNNs have found applications in processing speech

and language.

4.9 Support Vector Machine (SVM)

A support vector machine (SVM) is a discriminative classification model that

learns linear or nonlinear decision boundaries in the attribute space to

separate the classes. Apart from maximizing the separability of the two

classes, SVM offers strong regularization capabilities, i.e., it is able to control

the complexity of the model in order to ensure good generalization

performance. Due to its unique ability to innately regularize its learning, SVM

is able to learn highly expressive models without suffering from overfitting. It

has thus received considerable attention in the machine learning community

and is commonly used in several practical applications, ranging from

handwritten digit recognition to text categorization. SVM has strong roots in

statistical learning theory and is based on the principle of structural risk

minimization. Another unique aspect of SVM is that it represents the decision

boundary using only a subset of the training examples that are most difficult to

classify, known as the support vectors. Hence, it is a discriminative model

that is impacted only by training instances near the boundary of the two

classes, in contrast to learning the generative distribution of every class.

To illustrate the basic idea behind SVM, we first introduce the concept of the

margin of a separating hyperplane and the rationale for choosing such a

hyperplane with maximum margin. We then describe how a linear SVM can

be trained to explicitly look for this type of hyperplane. We conclude by

showing how the SVM methodology can be extended to learn nonlinear

decision boundaries by using kernel functions.

4.9.1 Margin of a Separating

Hyperplane

The generic equation of a separating hyperplane can be written as

where represents the attributes and ( , ) represent the parameters of the

hyperplane. A data instance can belong to either side of the hyperplane

depending on the sign of . For the purpose of binary classification, we

are interested in finding a hyperplane that places instances of both classes on

opposite sides of the hyperplane, thus resulting in a separation of the two

classes. If there exists a hyperplane that can perfectly separate the classes in

the data set, we say that the data set is linearly separable. Figure 4.32

shows an example of linearly separable data involving two classes, squares

and circles. Note that there can be infinitely many hyperplanes that can

separate the classes, two of which are shown in Figure 4.32 as lines

and . Even though every such hyperplane will have zero training error, they

can provide different results on previously unseen instances. Which

separating hyperplane should we thus finally choose to obtain the best

generalization performance? Ideally, we would like to choose a simple

hyperplane that is robust to small perturbations. This can be achieved by

using the concept of the margin of a separating hyperplane, which can be

briefly described as follows.

wTx+b=0,

xi

(wTxi+b)

B1

B2

Figure 4.32.

Margin of a hyperplane in a two-dimensional data set.

For every separating hyperplane , let us associate a pair of parallel

hyperplanes, and , such that they touch the closest instances of both

classes, respectively. For example, if we move parallel to its direction, we

can touch the first square using and the first circle using . and

are known as the margin hyperplanes of and the distance between them

is known as the margin of the separating hyperplane . From the diagram

shown in Figure 4.32 , notice that the margin for is considerably larger

than that for . In this example, turns out to be the separating hyperplane

with the maximum margin, known as the maximum margin hyperplane.

Rationale for Maximum Margin

Bi

bi1 bi2

B1

b11 b12 bi1 bi2

Bi

Bi

B1

B2 b1

Hyperplanes with large margins tend to have better generalization

performance than those with small margins. Intuitively, if the margin is small,

then any slight perturbation in the hyperplane or the training instances located

at the boundary can have quite an impact on the classification performance.

Small margin hyperplanes are thus more susceptible to overfitting, as they are

barely able to separate the classes with a very narrow room to allow

perturbations. On the other hand, a hyperplane that is farther away from

training instances of both classes has sufficient leeway to be robust to minor

modifications in the data, and thus shows superior generalization

performance.

The idea of choosing the maximum margin separating hyperplane also has

strong foundations in statistical learning theory. It can be shown that the

margin of such a hyperplane is inversely related to the VC-dimension of the

classifier, which is a commonly used measure of the complexity of a model.

As discussed in Section 3.4 of the last chapter, a simpler model should be

preferred over a more complex model if they both show similar training

performance. Hence, maximizing the margin results in the selection of a

separating hyperplane with the lowest model complexity, which is expected to

show better generalization performance.

4.9.2 Linear SVM

A linear SVM is a classifier that searches for a separating hyperplane with the

largest margin, which is why it is often known as a maximal margin

classifier. The basic idea of SVM can be described as follows.

Consider a binary classification problem consisting of n training instances,

where every training instance is associated with a binary label .xi yi∈{−1, 1}

Let be the equation of a separating hyperplane that separates the

two classes by placing them on opposite sides. This means that

The distance of any point from the hyperplane is then given by

where denotes the absolute value and denotes the length of a vector.

Let the distance of the closest point from the hyperplane with be .

Similarly, let denote the distance of the closest point from class .

This can be represented using the following constraints:

The previous equations can be succinctly represented by using the product of

and as

where M is a parameter related to the margin of the hyperplane, i.e., if

, then margin . In order to find the maximum margin

hyperplane that adheres to the previous constraints, we can consider the

following optimization problem:

To find the solution to the previous problem, note that if and b satisfy the

constraints of the previous problem, then any scaled version of and b would

wTx+b=0

wTxi+b>0if yi=1,wTxi+b<0if yi=−1.

D(x)=|wTx+b |ǁ w ǁ

|⋅| ǁ ⋅ ǁ

y=1 k+>0

k−>0 −1

wTxi+bǁ w ǁ≥k+if yi=1,wTxi+bǁ w ǁ≤−k−if yi=−1, (4.69)

yi (wTxi+b)

yi(wTxi+b)≥Mǁwǁ (4.70)

k+=k

−=M =k+−k−=2M

maxw, bMsubject toyi(wTxi+b)≥Mǁ w ǁ. (4.71)

satisfy them too. Hence, we can conveniently choose to simplify the

right-hand side of the inequalities. Furthermore, maximizing M amounts to

minimizing . Hence, the optimization problem of SVM is commonly

represented in the following form:

Learning Model Parameters

Equation 4.72 represents a constrained optimization problem with linear

inequalities. Since the objective function is convex and quadratic with respect

to , it is known as a quadratic programming problem (QPP), which can be

solved using standard optimization techniques, as described in Appendix E. In

the following, we present a brief sketch of the main ideas for learning the

model parameters of SVM.

First, we rewrite the objective function in a form that takes into account the

constraints imposed on its solutions. The new objective function is known as

the Lagrangian primal problem, which can be represented as follows,

where the parameters correspond to the constraints and are called the

Lagrange multipliers. Next, to minimize the Lagrangian, we take the

derivative of with respect to and b and set them equal to zero:

ǁwǁ=1/M

ǁwǁ2

minw, bǁ w ǁ22subject toyi(wTxi+b)≥1. (4.72)

LP=12ǁ w ǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)

λi≥0

LP

∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)

∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)

Note that using Equation 4.74 , we can represent completely in terms of

the Lagrange multipliers. There is another relationship between ( , b) and

that is derived from the Karush-Kuhn-Tucker (KKT) conditions, a commonly

used technique for solving QPP. This relationship can be described as

Equation 4.76 is known as the complementary slackness condition,

which sheds light on a valuable property of SVM. It states that the Lagrange

multiplier is strictly greater than 0 only when satisfies the equation

, which means that lies exactly on a margin hyperplane.

However, if is farther away from the margin hyperplanes such that

, then is necessarily 0. Hence, for only a small number of

instances that are closest to the separating hyperplane, which are known as

support vectors. Figure 4.33 shows the support vectors of a hyperplane

as filled circles and squares. Further, if we look at Equation 4.74 , we will

observe that training instances with do not contribute to the weight

parameter . This suggests that can be concisely represented only in terms

of the support vectors in the training data, which are quite fewer than the

overall number of training instances. This ability to represent the decision

function only in terms of the support vectors is what gives this classifier the

name support vector machines.

λi

λi[yi(wTxi+b)−1]=0. (4.76)

λi xi

yi(w⋅xi+b)=1 xi

xi

yi(w⋅xi+b)>1 λi λi>0

λi=0

Figure 4.33.

Support vectors of a hyperplane shown as filled circles and squares.

Using equations 4.74 , 4.75 , and 4.76 in Equation 4.73 , we obtain

the following optimization problem in terms of the Lagrange multipliers :

The previous optimization problem is called the dual optimization problem.

Maximizing the dual problem with respect to is equivalent to minimizing the

primal problem with respect to and b.

The key differences between the dual and primal problems are as follows:

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubject to∑i=1nλiyi=0,λi≥0. (4.77)

λi

1. Solving the dual problem helps us identify the support vectors in the

data that have non-zero values of . Further, the solution of the dual

problem is influenced only by the support vectors that are closest to the

decision boundary of SVM. This helps in summarizing the learning of

SVM solely in terms of its support vectors, which are easier to manage

computationally. Further, it represents a unique ability of SVM to be

dependent only on the instances closest to the boundary, which are

harder to classify, rather than the distribution of instances farther away

from the boundary.

2. The objective of the dual problem involves only terms of the form ,

which are basically inner products in the attribute space. As we will see

later in Section 4.9.4 , this property will prove to be quite useful in

learning nonlinear decision boundaries using SVM.

Because of these differences, it is useful to solve the dual optimization

problem using any of the standard solvers for QPP. Having found an optimal

solution for , we can use Equation 4.74 to solve for . We can then use

Equation 4.76 on the support vectors to solve for b as follows:

where S represents the set of support vectors and is the

number of support vectors. The maximum margin hyperplane can then be

expressed as

Using this separating hyperplane, a test instance can be assigned a class

label using the sign of f( ).

λi

xiTxj

λi

b=1nS∑i∈S1−yiwTxiyi (4.78)

(S={i|λi>0}) nS

f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)

Example 4.7.

Consider the two-dimensional data set shown in Figure 4.34 , which

contains eight training instances. Using quadratic programming, we can

solve the optimization problem stated in Equation 4.77 to obtain the

Lagrange multiplier for each training instance. The Lagrange multipliers

are depicted in the last column of the table. Notice that only the first two

instances have non-zero Lagrange multipliers. These instances

correspond to the support vectors for this data set.

Let and b denote the parameters of the decision boundary.

Using Equation 4.74 , we can solve for and in the following way:

λi

w=(w1, w2)

w1 w2

w1=∑iλiyixi1=65.5261×1×0.3858+65.5261×−1×0.4871=

−6.64.w2=∑iλiyixi2=65.5261×1×0.4687+65.5261×−1×0.611=−9.32.

Figure 4.34.

Example of a linearly separable data set.

The bias term b can be computed using Equation 4.76 for each support

vector:

Averaging these values, we obtain . The decision boundary

corresponding to these parameters is shown in Figure 4.34 .

4.9.3 Soft-margin SVM

Figure 4.35 shows a data set that is similar to Figure 4.32 , except it has

two new examples, P and Q. Although the decision boundary misclassifies

the new examples, while classifies them correctly, this does not mean that

is a better decision boundary than because the new examples may

correspond to noise in the training data. should still be preferred over

because it has a wider margin, and thus, is less susceptible to overfitting.

However, the SVM formulation presented in the previous section only

constructs decision boundaries that are mistake-free.

b(1)=1−w⋅x1=1−(−6.64)(0.3858)−(−9.32)(0.4687)=7.9300.b(2)=1−w⋅x2=

−1−(−6.64)(0.4871)−(−9.32)(0.611)=7.9289.

b=7.93

B1

B2

B2 B1

B1 B2

Figure 4.35.

Decision boundary of SVM for the non-separable case.

This section examines how the formulation of SVM can be modified to learn a

separating hyperplane that is tolerable to small number of training errors using

a method known as the soft-margin approach. More importantly, the method

presented in this section allows SVM to learn linear hyperplanes even in

situations where the classes are not linearly separable. To do this, the learning

algorithm in SVM must consider the trade-off between the width of the margin

and the number of training errors committed by the linear hyperplane.

To introduce the concept of training errors in the SVM formulation, let us relax

the inequality constraints to accommodate for some violations on a small

number of training instances. This can be done by introducing a slack

variable for every training instance as follows:ξ≥0 xi

The variable allows for some slack in the inequalities of the SVM such that

every instance does not need to strictly satisfy . Further, is

non-zero only if the margin hyperplanes are not able to place on the same

side as the rest of the instances belonging to . To illustrate this, Figure

4.36 shows a circle P that falls on the opposite side of the separating

hyperplane as the rest of the circles, and thus satisfies . The

distance between P and the margin hyperplane is equal to .

Hence, provides a measure of the error of SVM in representing using soft

inequality constraints.

Figure 4.36.

Slack variables used in soft-margin SVM.

yi(wTxi+b)≥1−ξi (4.80)

ξi

xi yi(wTxi+b)≥1 ξi

xi

yi

wTx+b=−1+ξ

wTx+b=−1 ξ/ǁ w ǁ

ξi xi

In the presence of slack variables, it is important to learn a separating

hyperplane that jointly maximizes the margin (ensuring good generalization

performance) and minimizes the values of slack variables (ensuring low

training error). This can be achieved by modifying the optimization problem of

SVM as follows:

where C is a hyper-parameter that makes a trade-off between maximizing the

margin and minimizing the training error. A large value of C pays more

emphasis on minimizing the training error than maximizing the margin. Notice

the similarity of the previous equation with the generic formula of

generalization error rate introduced in Section 3.4 of the previous chapter.

Indeed, SVM provides a natural way to balance between model complexity

and training error in order to maximize generalization performance.

To solve Equation 4.81 we apply the Lagrange multiplier method and

convert the primal problem to its corresponding dual problem, similar to the

approach described in the previous section. The Lagrangian primal problem of

Equation 4.81 can be written as follows:

where and are the Lagrange multipliers corresponding to the

inequality constraints of Equation 4.81 . Setting the derivative of with

respect to , b, and equal to 0, we obtain the following equations:

minw, b, ξiǁ w ǁ22+c∑i=1nξisubject toyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)

LP=12ǁ w ǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)

λi≥0 μi≥0

LP

ξi

∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)

∂L∂b=0⇒∑i=1nλiyi=0. (4.84)

We can also obtain the complementary slackness conditions by using the

following KKT conditions:

Equation 4.86 suggests that is zero for all training instances except those

that reside on the margin hyperplanes , or have . These

instances with are known as support vectors. On the other hand, given

in Equation 4.87 is zero for any training instance that is misclassified, i.e.,

. Further, and are related with each other by Equation 4.85 . This

results in the following three configurations of :

1. If and , then does not reside on the margin hyperplanes

and is correctly classified on the same side as other instances

belonging to .

2. If and , then is misclassified and has a non-zero slack

variable .

3. If and , then resides on one of the margin

hyperplanes.

Substituting Equations 4.83 to 4.87 into Equation 4.82 , we obtain the

following dual optimization problem:

Notice that the previous problem looks almost identical to the dual problem of

SVM for the linearly separable case (Equation 4.77 ), except that is

∂L∂ξi=0⇒λi+μi=C. (4.85)

λi(yi(wTxi+b)−1+ξi)=0, (4.86)

μiξi=0. (4.87)

λi

wTxi+b=±1 ξi>0

λi>0 μi

ξi>0 λi μi

(λi, μi)

λi=0 μi=C xi

yi

λi=C μi=0 xi

ξi

0<λi<C 0<μi<C xi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubject to∑i=1nλiyi=0,0≤λi≤C. (4.88)

λi

required to not only be greater than 0 but also smaller than a constant value

C. Clearly, when C reaches infinity, the previous optimization problem

becomes equivalent to Equation 4.77 , where the learned hyperplane

perfectly separates the classes (with no training errors). However, by capping

the values of to C, the learned hyperplane is able to tolerate a few training

errors that have .

Figure 4.37.

Hinge loss as a function of .

As before, Equation 4.88 can be solved by using any of the standard

solvers for QPP, and the optimal value of can be obtained by using

Equation 4.83 . To solve for b, we can use Equation 4.86 on the support

vectors that reside on the margin hyperplanes as follows:

λi

ξi>0

yy^

b=1nS∑i∈S1−yiwTxiyi (4.89)

where S represents the set of support vectors residing on the margin

hyperplanes and is the number of elements in S.

SVM as a Regularizer of Hinge Loss

SVM belongs to a broad class of regularization techniques that use a loss

function to represent the training errors and a norm of the model parameters

to represent the model complexity. To realize this, notice that the slack

variable , used for measuring the training errors in SVM, is equivalent to the

hinge loss function, which can be defined as follows:

where . In the case of SVM, corresponds to . Figure

4.37 shows a plot of the hinge loss function as we vary . We can see

that the hinge loss is equal to 0 as long as y and have the same sign and

. However, the hinge loss grows linearly with whenever y and are

of the opposite sign or . This is similar to the notion of the slack variable

, which is used to measure the distance of a point from its margin

hyperplane. Hence, the optimization problem of SVM can be represented in

the following equivalent form:

Note that using the hinge loss ensures that the optimization problem is convex

and can be solved using standard optimization techniques. However, if we use

a different loss function, such as the squared loss function that was introduced

in Section 4.7 on ANN, it will result in a different optimization problem that

may or may not remain convex. Nevertheless, different loss functions can be

explored to capture varying notions of training error, depending on the

characteristics of the problem.

(S={i|0<λi<C}) nS

ξ

Loss (y, y^) =max(0, 1−yy^),

y∈{+1, −1} y^ wTx+b

yy^

y^

|y^|≥1 |y^| y^

|y^|<1

ξ

minw, bǁ w ǁ22+C∑i=1nLoss (yi, wTxi+b) (4.90)

Another interesting property of SVM that relates it to a broader class of

regularization techniques is the concept of a margin. Although minimizing

has the geometric interpretation of maximizing the margin of a separating

hyperplane, it is essentially the squared norm of the model parameters,

. In general, the norm of , , is equal to the Minkowski distance of

order q from to the origin, i.e.,

Minimizing the norm of to achieve lower model complexity is a generic

regularization concept that has several interpretations. For example,

minimizing the norm amounts to finding a solution on a hypersphere of

smallest radius that shows suitable training performance. To visualize this in

two-dimensions, Figure 4.38(a) shows the plot of a circle with constant

radius r, where every point has the same norm. On the other hand, using

the norm ensures that the solution lies on the surface of a hypercube with

smallest size, with vertices along the axes. This is illustrated in Figure

4.38(b) as a square with vertices on the axes at a distance of r from the

origin. The norm is commonly used as a regularizer to obtain sparse model

parameters with only a small number of non-zero parameter values, such as

the use of Lasso in regression problems (see Bibliographic Notes).

ǁ w

ǁ2

L2 ǁ w

ǁ22 Lq ǁ w ǁq

ǁ w ǁq=(∑ipwiq)1/q

Lq

L2

L2

L1

L1

Figure 4.38.

Plots showing the behavior of two-dimensional solutions with constant and

norms.

In general, depending on the characteristics of the problem, different

combinations of norms and training loss functions can be used for learning

the model parameters, each requiring a different optimization solver. This

forms the backbone of a wide range of modeling techniques that attempt to

improve the generalization performance by jointly minimizing training error

and model complexity. However, in this section, we focus only on the squared

norm and the hinge loss function, resulting in the classical formulation of

SVM.

4.9.4 Nonlinear SVM

L2

L1

Lq

L2

The SVM formulations described in the previous sections construct a linear

decision boundary to separate the training examples into their respective

classes. This section presents a methodology for applying SVM to data sets

that have nonlinear decision boundaries. The basic idea is to transform the

data from its original attribute space in into a new space so that a

linear hyperplane can be used to separate the instances in the transformed

space, using the SVM approach. The learned hyperplane can then be

projected back to the original attribute space, resulting in a nonlinear decision

boundary.

Figure 4.39.

Classifying data with a nonlinear decision boundary.

Attribute Transformation

To illustrate how attribute transformation can lead to a linear decision

boundary, Figure 4.39(a) shows an example of a two-dimensional data set

consisting of squares (classified as ) and circles (classified as ). The

φ(x)

y=1 y=−1

data set is generated in such a way that all the circles are clustered near the

center of the diagram and all the squares are distributed farther away from the

center. Instances of the data set can be classified using the following

equation:

The decision boundary for the data can therefore be written as follows:

which can be further simplified into the following quadratic equation:

A nonlinear transformation is needed to map the data from its original

attribute space into a new space such that a linear hyperplane can separate

the classes. This can be achieved by using the following simple

transformation:

Figure 4.39(b) shows the points in the transformed space, where we can

see that all the circles are located in the lower left-hand side of the diagram. A

linear hyperplane with parameters and b can therefore be constructed in the

transformed space, to separate the instances into their respective classes.

One may think that because the nonlinear transformation possibly increases

the dimensionality of the input space, this approach can suffer from the curse

of dimensionality that is often associated with high-dimensional data.

y={1if (x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)

(x1−0.5)2+(x2−0.5)2>0.2,

x12−x1+x22−x2=−0.46.

φ

φ:(x1, x2)→(x12−x1, x22−x2). (4.92)

However, as we will see in the following section, nonlinear SVM is able to

avoid this problem by using kernel functions.

Learning a Nonlinear SVM Model

Using a suitable function, , we can transform any data instance to .

(The details on how to choose will become clear later.) The linear

hyperplane in the transformed space can be expressed as . To

learn the optimal separating hyperplane, we can substitute for in the

formulation of SVM to obtain the following optimization problem:

Using Lagrange multipliers , this can be converted into a dual optimization

problem: max

where denotes the inner product between vectors and . Also, the

equation of the hyperplane in the transformed space can be represented

using

as follows:

Further, b is given by

φ(⋅) φ(x)

φ(⋅)

wTφ(x)+b=0

φ(x)

minw, b, ξiǁ w ǁ22+C∑i=1nξisubject toyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj⟨φ(xi), φ(xj)

⟩subject to∑i=1nλjyi=0,0≤λi≤C,

(4.94)

⟨ a, b ⟩

λi

∑i=1nλiyi⟨φ(xi), φ(x) ⟩+b=0. (4.95)

b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj⟨φ(xi), φ(xj) ⟩yi), (4.96)

where is the set of support vectors residing on the margin

hyperplanes and is the number of elements in S.

Note that in order to solve the dual optimization problem in Equation 4.94 ,

or to use the learned model parameters to make predictions using Equations

4.95 and 4.96 , we need only inner products of . Hence, even though

may be nonlinear and high-dimensional, it suffices to use a function of

the inner products of in the transformed space. This can be achieved by

using a kernel trick, which can be described as follows.

The inner product between two vectors is often regarded as a measure of

similarity between the vectors. For example, the cosine similarity described in

Section 2.4.5 on page 79 can be defined as the dot product between two

vectors that are normalized to unit length. Analogously, the inner product

can also be regarded as a measure of similarity between two

instances, and , in the transformed space. The kernel trick is a method

for computing this similarity as a function of the original attributes. Specifically,

the kernel function K(u, v) between two instances u and v can be defined as

follows:

where is a function that follows certain conditions as stated by the Mercer’s

Theorem. Although the details of this theorem are outside the scope of the

book, we provide a list of some of the commonly used kernel functions:

S={i|0>λi<C}

nS

φ(x)

φ(x)

φ(x)

φ(xi), φ(xj)

xi xj

K(u, v)=⟨φ(u), φ(v) ⟩=f(u, v) (4.97)

f(⋅)

Polynomial kernelK(u, v)=(uTv+1)p (4.98)

Radial Basis Function kernelK(u, v)=e−ǁu−v ǁ2/(2σ2) (4.99)

Sigmoid kernelK(u, v)=tanh(kuTv−δ) (4.100)

By using a kernel function, we can directly work with inner products in the

transformed space without dealing with the exact forms of the nonlinear

transformation function . Specifically, this allows us to use high-dimensional

transformations (sometimes even involving infinitely many dimensions), while

performing calculations only in the original attribute space. Computing the

inner products using kernel functions is also considerably cheaper than using

the transformed attribute set . Hence, the use of kernel functions provides

a significant advantage in representing nonlinear decision boundaries, without

suffering from the curse of dimensionality. This has been one of the major

reasons behind the widespread usage of SVM in highly complex and

nonlinear problems.

Figure 4.40.

Decision boundary produced by a nonlinear SVM with polynomial kernel.

Figure 4.40 shows the nonlinear decision boundary obtained by SVM using

the polynomial kernel function given in Equation 4.98 . We can see that the

φ

φ(x)

learned decision boundary is quite close to the true decision boundary shown

in Figure 4.39(a) . Although the choice of kernel function depends on the

characteristics of the input data, a commonly used kernel function is the radial

basis function (RBF) kernel, which involves a single hyper-parameter ,

known as the standard deviation of the RBF kernel.

4.9.5 Characteristics of SVM

1. The SVM learning problem can be formulated as a convex optimization

problem, in which efficient algorithms are available to find the global

minimum of the objective function. Other classification methods, such

as rule-based classifiers and artificial neural networks, employ a greedy

strategy to search the hypothesis space. Such methods tend to find

only locally optimum solutions.

2. SVM provides an effective way of regularizing the model parameters by

maximizing the margin of the decision boundary. Furthermore, it is able

to create a balance between model complexity and training errors by

using a hyper-parameter C. This trade-off is generic to a broader class

of model learning techniques that capture the model complexity and the

training loss using different formulations.

3. Linear SVM can handle irrelevant attributes by learning zero weights

corresponding to such attributes. It can also handle redundant

attributes by learning similar weights for the duplicate attributes.

Furthermore, the ability of SVM to regularize its learning makes it more

robust to the presence of a large number of irrelevant and redundant

attributes than other classifiers, even in high-dimensional settings. For

this reason, nonlinear SVMs are less impacted by irrelevant and

redundant attributes than other highly expressive classifiers that can

learn nonlinear decision boundaries such as decision trees.

σ

To compare the effect of irrelevant attributes on the performance of

nonlinear SVMs and decision trees, consider the two-dimensional data

set shown in Figure 4.41(a) containing and instances,

where the two classes can be easily separated using a nonlinear

decision boundary. We incrementally add irrelevant attributes to this

data set and compare the performance of two classifiers: decision tree

and nonlinear SVM (using radial basis function kernel), using 70% of

the data for training and the rest for testing. Figure 4.41(b) shows

the test error rates of the two classifiers as we increase the number of

irrelevant attributes. We can see that the test error rate of decision

trees swiftly reaches 0.5 (same as random guessing) in the presence of

even a small number of irrelevant attributes. This can be attributed to

the problem of multiple comparisons while choosing splitting attributes

at internal nodes as discussed in Example 3.7 of the previous

chapter. On the other hand, nonlinear SVM shows a more robust and

steady performance even after adding a moderately large number of

irrelevant attributes. Its test error rate gradually declines and eventually

reaches close to 0.5 after adding 125 irrelevant attributes, at which

point it becomes difficult to discern the discriminative information in the

original two attributes from the noise in the remaining attributes for

learning nonlinear decision boundaries.

500+ 500o

Figure 4.41.

Comparing the effect of adding irrelevant attributes on the performance

of nonlinear SVMs and decision trees.

4. SVM can be applied to categorical data by introducing dummy

variables for each categorical attribute value present in the data. For

example, if has three values ,

we can introduce a binary variable for each of the attribute values.

5. The SVM formulation presented in this chapter is for binary class

problems. However, multiclass extensions of SVM have also been

proposed.

6. Although the training time of an SVM model can be large, the learned

parameters can be succinctly represented with the help of a small

number of support vectors, making the classification of test instances

quite fast.

{Single,Married,Divorced}

4.10 Ensemble Methods

This section presents techniques for improving classification accuracy by

aggregating the predictions of multiple classifiers. These techniques are

known as ensemble or classifier combination methods. An ensemble

method constructs a set of base classifiers from training data and performs

classification by taking a vote on the predictions made by each base classifier.

This section explains why ensemble methods tend to perform better than any

single classifier and presents techniques for constructing the classifier

ensemble.

4.10.1 Rationale for Ensemble Method

The following example illustrates how an ensemble method can improve a

classifier’s performance.

Example 4.8.

Consider an ensemble of 25 binary classifiers, each of which has an error

rate of . The ensemble classifier predicts the class label of a test

example by taking a majority vote on the predictions made by the base

classifiers. If the base classifiers are identical, then all the base classifiers

will commit the same mistakes. Thus, the error rate of the ensemble

remains 0.35. On the other hand, if the base classifiers are independent—

i.e., their errors are uncorrelated—then the ensemble makes a wrong

prediction only if more than half of the base classifiers predict incorrectly.

In this case, the error rate of the ensemble classifier is

∈=0.35

which is considerably lower than the error rate of the base classifiers.

Figure 4.42 shows the error rate of an ensemble of 25 binary classifiers

for different base classifier error rates . The diagonal line

represents the case in which the base classifiers are identical, while the solid

line represents the case in which the base classifiers are independent.

Observe that the ensemble classifier performs worse than the base classifiers

when is larger than 0.5.

The preceding example illustrates two necessary conditions for an ensemble

classifier to perform better than a single classifier: (1) the base classifiers

should be independent of each other, and (2) the base classifiers should do

better than a classifier that performs random guessing. In practice, it is difficult

to ensure total independence among the base classifiers. Nevertheless,

improvements in classification accuracies have been observed in ensemble

methods in which the base classifiers are somewhat correlated.

4.10.2 Methods for Constructing an

Ensemble Classifier

A logical view of the ensemble method is presented in Figure 4.43 . The

basic idea is to construct multiple classifiers from the original data and then

aggregate their predictions when classifying unknown examples. The

ensemble of classifiers can be constructed in many ways:

eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)

(eensemble) (∈)

∈

1. By manipulating the training set. In this approach, multiple training

sets are created by resampling the original data according to some

sampling distribution and constructing a classifier from each training

set. The sampling distribution determines how likely it is that an

example will be selected for training, and it may vary from one trial to

another. Bagging and boosting are two examples of ensemble

methods that manipulate their training sets. These methods are

described in further detail in Sections 4.10.4 and 4.10.5 .

Figure 4.42.

Comparison between errors of base classifiers and errors of the

ensemble classifier.

Figure 4.43.

A logical view of the ensemble learning method.

2. By manipulating the input features. In this approach, a subset of

input features is chosen to form each training set. The subset can be

either chosen randomly or based on the recommendation of domain

experts. Some studies have shown that this approach works very well

with data sets that contain highly redundant features. Random forest,

which is described in Section 4.10.6 , is an ensemble method that

manipulates its input features and uses decision trees as its base

classifiers.

3. By manipulating the class labels. This method can be used when the

number of classes is sufficiently large. The training data is transformed

into a binary class problem by randomly partitioning the class labels

into two disjoint subsets, and . Training examples whose classA0 A1

label belongs to the subset are assigned to class 0, while those that

belong to the subset are assigned to class 1. The relabeled

examples are then used to train a base classifier. By repeating this

process multiple times, an ensemble of base classifiers is obtained.

When a test example is presented, each base classifier is used to

predict its class label. If the test example is predicted as class 0, then

all the classes that belong to will receive a vote. Conversely, if it is

predicted to be class 1, then all the classes that belong to will

receive a vote. The votes are tallied and the class that receives the

highest vote is assigned to the test example. An example of this

approach is the error-correcting output coding method described on

page 331.

4. By manipulating the learning algorithm. Many learning algorithms

can be manipulated in such a way that applying the algorithm several

times on the same training data will result in the construction of

different classifiers. For example, an artificial neural network can

change its network topology or the initial weights of the links between

neurons. Similarly, an ensemble of decision trees can be constructed

by injecting randomness into the tree-growing procedure. For example,

instead of choosing the best splitting attribute at each node, we can

randomly choose one of the top k attributes for splitting.

The first three approaches are generic methods that are applicable to any

classifier, whereas the fourth approach depends on the type of classifier used.

The base classifiers for most of these approaches can be generated

sequentially (one after another) or in parallel (all at once). Once an ensemble

of classifiers has been learned, a test example is classified by combining

the predictions made by the base classifiers :

A0

A1

Ci

A0

A1

Ci(x)

C*(x)=f(C1(x), C2(x), …, Ck(x)).

where f is the function that combines the ensemble responses. One simple

approach for obtaining is to take a majority vote of the individual

predictions. An alternate approach is to take a weighted majority vote, where

the weight of a base classifier denotes its accuracy or relevance.

Ensemble methods show the most improvement when used with unstable

classifiers, i.e., base classifiers that are sensitive to minor perturbations in

the training set, because of high model complexity. Although unstable

classifiers may have a low bias in finding the optimal decision boundary, their

predictions have a high variance for minor changes in the training set or

model selection. This trade-off between bias and variance is discussed in

detail in the next section. By aggregating the responses of multiple unstable

classifiers, ensemble learning attempts to minimize their variance without

worsening their bias.

4.10.3 Bias-Variance Decomposition

Bias-variance decomposition is a formal method for analyzing the

generalization error of a predictive model. Although the analysis is slightly

different for classification than regression, we first discuss the basic intuition of

this decomposition by using an analogue of a regression problem.

Consider the illustrative task of reaching a target y by firing projectiles from a

starting position , as shown in Figure 4.44 . The target corresponds to the

desired output at a test instance, while the starting position corresponds to its

observed attributes. In this analogy, the projectile represents the model used

for predicting the target using the observed attributes. Let denote the point

where the projectile hits the ground, which is analogous of the prediction of

the model.

C*(x)

y^

Figure 4.44.

Bias-variance decomposition.

Ideally, we would like our predictions to be as close to the true target as

possible. However, note that different trajectories of projectiles are possible

based on differences in the training data or in the approach used for model

selection. Hence, we can observe a variance in the predictions over

different runs of projectile. Further, the target in our example is not fixed but

has some freedom to move around, resulting in a noise component in the true

target. This can be understood as the non-deterministic nature of the output

variable, where the same set of attributes can have different output values.

Let represent the average prediction of the projectile over multiple runs,

and denote the average target value. The difference between and

is known as the bias of the model.

In the context of classification, it can be shown that the generalization error of

a classification model m can be decomposed into terms involving the bias,

variance, and noise components of the model in the following way:

where and are constants that depend on the characteristics of training

and test sets. Note that while the noise term is intrinsic to the target class, the

y^

y^avg

yavg y^avg

yavg

gen.error(m)=c1×noise+bias(m)+c2×variance(m)

c1 c2

bias and variance terms depend on the choice of the classification model. The

bias of a model represents how close the average prediction of the model is to

the average target. Models that are able to learn complex decision

boundaries, e.g., models produced by k-nearest neighbor and multi-layer

ANN, generally show low bias. The variance of a model captures the stability

of its predictions in response to minor perturbations in the training set or the

model selection approach.

We can say that a model shows better generalization performance if it has a

lower bias and lower variance. However, if the complexity of a model is high

but the training size is small, we generally expect to see a lower bias but

higher variance, resulting in the phenomena of overfitting. This phenomena is

pictorially represented in Figure 4.45(a) . On the other hand, an overly

simplistic model that suffers from underfitting may show a lower variance but

would suffer from a high bias, as shown in Figure 4.45(b) . Hence, the

trade-off between bias and variance provides a useful way for interpreting the

effects of underfitting and overfitting on the generalization performance of a

model.

Figure 4.45.

Plots showing the behavior of two-dimensional solutions with constant and

norms.

The bias-variance trade-off can be used to explain why ensemble learning

improves the generalization performance of unstable classifiers. If a base

classifier show low bias but high variance, it can become susceptible to

overfitting, as even a small change in the training set will result in different

predictions. However, by combining the responses of multiple base classifiers,

we can expect to reduce the overall variance. Hence, ensemble learning

methods show better performance primarily by lowering the variance in the

predictions, although they can even help in reducing the bias. One of the

simplest approaches for combining predictions and reducing their variance is

to compute their average. This forms the basis of the bagging method,

described in the following subsection.

L2

L1

4.10.4 Bagging

Bagging, which is also known as bootstrap aggregating, is a technique that

repeatedly samples (with replacement) from a data set according to a uniform

probability distribution. Each bootstrap sample has the same size as the

original data. Because the sampling is done with replacement, some

instances may appear several times in the same training set, while others may

be omitted from the training set. On average, a bootstrap sample contains

approximately 63% of the original training data because each sample has a

probability of being selected in each . If N is sufficiently large,

this probability converges to . The basic procedure for bagging is

summarized in Algorithm 4.5 . After training the k classifiers, a test

instance is assigned to the class that receives the highest number of votes.

To illustrate how bagging works, consider the data set shown in Table 4.4 .

Let x denote a one-dimensional attribute and y denote the class label.

Suppose we use only one-level binary decision trees, with a test condition

, where k is a split point chosen to minimize the entropy of the leaf nodes.

Such a tree is also known as a decision stump.

Table 4.4. Example of data set used to construct an ensemble of bagging

classifiers.

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 1 1 1

Without bagging, the best decision stump we can produce splits the instances

at either or . Either way, the accuracy of the tree is at most 70%.

Suppose we apply the bagging procedure on the data set using 10 bootstrap

samples. The examples chosen for training in each bagging round are shown

Di

1−(1−1/N)N Di

1−1/e≃0.632

x≤k

−1 −1 −1 −1

x≤0.35 x≤0.75

in Figure 4.46 . On the right-hand side of each table, we also describe the

decision stump being used in each round.

We classify the entire data set given in Table 4.4 by taking a majority vote

among the predictions made by each base classifier. The results of the

predictions are shown in Figure 4.47 . Since the class labels are either

or , taking the majority vote is equivalent to summing up the predicted

values of y and examining the sign of the resulting sum (refer to the second to

last row in Figure 4.47 ). Notice that the ensemble classifier perfectly

classifies all 10 examples in the original data.

Algorithm 4.5 Bagging algorithm.

∑

⋅

−1

+1

Figure 4.46.

Example of bagging.

The preceding example illustrates another advantage of using ensemble

methods in terms of enhancing the representation of the target function. Even

though each base classifier is a decision stump, combining the classifiers can

lead to a decision boundary that mimics a decision tree of depth 2.

Bagging improves generalization error by reducing the variance of the base

classifiers. The performance of bagging depends on the stability of the base

classifier. If a base classifier is unstable, bagging helps to reduce the errors

associated with random fluctuations in the training data. If a base classifier is

stable, i.e., robust to minor perturbations in the training set, then the error of

the ensemble is primarily caused by bias in the base classifier. In this

situation, bagging may not be able to improve the performance of the base

classifiers significantly. It may even degrade the classifier’s performance

because the effective size of each training set is about 37% smaller than the

original data.

Figure 4.47.

Example of combining classifiers constructed using the bagging approach.

4.10.5 Boosting

Boosting is an iterative procedure used to adaptively change the distribution of

training examples for learning base classifiers so that they increasingly focus

on examples that are hard to classify. Unlike bagging, boosting assigns a

weight to each training example and may adaptively change the weight at the

end of each boosting round. The weights assigned to the training examples

can be used in the following ways:

1. They can be used to inform the sampling distribution used to draw a set

of bootstrap samples from the original data.

2. They can be used to learn a model that is biased toward examples with

higher weight.

This section describes an algorithm that uses weights of examples to

determine the sampling distribution of its training set. Initially, the examples

are assigned equal weights, 1/N, so that they are equally likely to be chosen

for training. A sample is drawn according to the sampling distribution of the

training examples to obtain a new training set. Next, a classifier is built from

the training set and used to classify all the examples in the original data. The

weights of the training examples are updated at the end of each boosting

round. Examples that are classified incorrectly will have their weights

increased, while those that are classified correctly will have their weights

decreased. This forces the classifier to focus on examples that are difficult to

classify in subsequent iterations.

The following table shows the examples chosen during each boosting round,

when applied to the data shown in Table 4.4 .

Boosting (Round 1): 7 3 2 8 7 9 4 10 6 3

Boosting (Round 2): 5 4 9 4 2 5 1 7 4 2

Boosting (Round 3): 4 4 8 10 4 5 4 6 3 4

Initially, all the examples are assigned the same weights. However, some

examples may be chosen more than once, e.g., examples 3 and 7, because

the sampling is done with replacement. A classifier built from the data is then

used to classify all the examples. Suppose example 4 is difficult to classify.

The weight for this example will be increased in future iterations as it gets

misclassified repeatedly. Meanwhile, examples that were not chosen in the

previous round, e.g., examples 1 and 5, also have a better chance of being

selected in the next round since their predictions in the previous round were

likely to be wrong. As the boosting rounds proceed, examples that are the

hardest to classify tend to become even more prevalent. The final ensemble is

obtained by aggregating the base classifiers obtained from each boosting

round.

Over the years, several implementations of the boosting algorithm have been

developed. These algorithms differ in terms of (1) how the weights of the

training examples are updated at the end of each boosting round, and (2) how

the predictions made by each classifier are combined. An implementation

called AdaBoost is explored in the next section.

AdaBoost

Let denote a set of N training examples. In the AdaBoost

algorithm, the importance of a base classifier depends on its

{(xj, yj)|j=1, 2, …, N}

Ci

Figure 4.48.

Plot of as a function of training error .

error rate, which is defined as

where if the predicate p is true, and 0 otherwise. The importance of a

classifier is given by the following parameter,

Note that has a large positive value if the error rate is close to 0 and a large

negative value if the error rate is close to 1, as shown in Figure 4.48 .

The parameter is also used to update the weight of the training examples.

To illustrate, let denote the weight assigned to example ( during the

α ∈

∈i=1N[∑j=1Nwj I(Ci(xj)≠yj) ], (4.102)

I(p)=1

Ci

αi=12ln (1−∈i∈i).

αi

αi

wi(j) xi, yi)

th

j boosting round. The weight update mechanism for AdaBoost is given by the

equation:

where is the normalization factor used to ensure that . The

weight update formula given in Equation 4.103 increases the weights of

incorrectly classified examples and decreases the weights of those classified

correctly.

Instead of using a majority voting scheme, the prediction made by each

classifier is weighted according to . This approach allows AdaBoost to

penalize models that have poor accuracy, e.g., those generated at the earlier

boosting rounds. In addition, if any intermediate rounds produce an error rate

higher than 50%, the weights are reverted back to their original uniform

values, , and the resampling procedure is repeated. The AdaBoost

algorithm is summarized in Algorithm 4.6 .

Algorithm 4.6 AdaBoost algorithm.

∈ ∑

∈

th

wi(j+1)=wi(j)Zj×{e−αjif Cj(xi)=yi,eαjif Cj(xi)≠yi (4.103)

Zj ∑iwi(j+1)=1

Cj αj

wi=1/N

∈ ∈

∑

Let us examine how the boosting approach works on the data set shown in

Table 4.4 . Initially, all the examples have identical weights. After three

boosting rounds, the examples chosen for training are shown in Figure

4.49(a) . The weights for each example are updated at the end of each

boosting round using Equation 4.103 , as shown in Figure 4.50(b) .

Without boosting, the accuracy of the decision stump is, at best, 70%. With

AdaBoost, the results of the predictions are given in Figure 4.50(b) . The

final prediction of the ensemble classifier is obtained by taking a weighted

average of the predictions made by each base classifier, which is shown in the

last row of Figure 4.50(b) . Notice that AdaBoost perfectly classifies all the

examples in the training data.

Figure 4.49.

Example of boosting.

An important analytical result of boosting shows that the training error of the

ensemble is bounded by the following expression:

where is the error rate of each base classifier i. If the error rate of the base

classifier is less than 50%, we can write , where measures how

much better the classifier is than random guessing. The bound on the training

error of the ensemble becomes

eensemble≤∏i[∈i(1−∈i) ], (4.104)

∈i

∈i=0.5 −γi γi

Hence, the training error of the ensemble decreases exponentially, which

leads to the fast convergence of the algorithm. By focusing on examples that

are difficult to classify by base classifiers, it is able to reduce the bias of the

final predictions along with the variance. AdaBoost has been shown to provide

significant improvements in performance over base classifiers on a range of

data sets. Nevertheless, because of its tendency to focus on training

examples that are wrongly classified, the boosting technique can be

susceptible to overfitting, resulting in poor generalization performance in some

scenarios.

Figure 4.50.

Example of combining classifiers constructed using the AdaBoost approach.

eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)

4.10.6 Random Forests

Random forests attempt to improve the generalization performance by

constructing an ensemble of decorrelated decision trees. Random forests

build on the idea of bagging to use a different bootstrap sample of the training

data for learning decision trees. However, a key distinguishing feature of

random forests from bagging is that at every internal node of a tree, the best

splitting criterion is chosen among a small set of randomly selected attributes.

In this way, random forests construct ensembles of decision trees by not only

manipulating training instances (by using bootstrap samples similar to

bagging), but also the input attributes (by using different subsets of attributes

at every internal node).

Given a training set D consisting of n instances and d attributes, the basic

procedure of training a random forest classifier can be summarized using the

following steps:

1. Construct a bootstrap sample of the training set by randomly

sampling n instances (with replacement) from D.

2. Use to learn a decision tree as follows. At every internal node of

, randomly sample a set of p attributes and choose an attribute from

this subset that shows the maximum reduction in an impurity measure

for splitting. Repeat this procedure till every leaf is pure, i.e., containing

instances from the same class.

Once an ensemble of decision trees have been constructed, their average

prediction (majority vote) on a test instance is used as the final prediction of

the random forest. Note that the decision trees involved in a random forest are

unpruned trees, as they are allowed to grow to their largest possible size till

every leaf is pure. Hence, the base classifiers of random forest represent

Di

Di Ti

Ti

unstable classifiers that have low bias but high variance, because of their

large size.

Another property of the base classifiers learned in random forests is the lack

of correlation among their model parameters and test predictions. This can be

attributed to the use of an independently sampled data set for learning

every decision tree , similar to the bagging approach. However, random

forests have the additional advantage of choosing a splitting criterion at every

internal node using a different (and randomly selected) subset of attributes.

This property significantly helps in breaking the correlation structure, if any,

among the decision trees .

To realize this, consider a training set involving a large number of attributes,

where only a small subset of attributes are strong predictors of the target

class, whereas other attributes are weak indicators. Given such a training set,

even if we consider different bootstrap samples for learning , we would

mostly be choosing the same attributes for splitting at internal nodes, because

the weak attributes would be largely overlooked when compared with the

strong predictors. This can result in a considerable correlation among the

trees. However, if we restrict the choice of attributes at every internal node to

a random subset of attributes, we can ensure the selection of both strong and

weak predictors, thus promoting diversity among the trees. This principle is

utilized by random forests for creating decorrelated decision trees.

By aggregating the predictions of an ensemble of strong and decorrelated

decision trees, random forests are able to reduce the variance of the trees

without negatively impacting their low bias. This makes random forests quite

robust to overfitting. Additionally, because of their ability to consider only a

small subset of attributes at every internal node, random forests are

computationally fast and robust even in high-dimensional settings.

Di

Ti

Ti

Di Ti

The number of attributes to be selected at every node, p, is a hyper-parameter

of the random forest classifier. A small value of p can reduce the correlation

among the classifiers but may also reduce their strength. A large value can

improve their strength but may result in correlated trees similar to bagging.

Although common suggestions for p in the literature include and , a

suitable value of p for a given training set can always be selected by tuning it

over a validation set, as described in the previous chapter. However, there is

an alternative way for selecting hyper-parameters in random forests, which

does not require using a separate validation set. It involves computing a

reliable estimate of the generalization error rate directly during training, known

as the out-of-bag (oob) error estimate. The oob estimate can be computed

for any generic ensemble learning method that builds independent base

classifiers using bootstrap samples of the training set, e.g., bagging and

random forests. The approach for computing oob estimate can be described

as follows.

Consider an ensemble learning method that uses an independent base

classifier built on a bootstrap sample of the training set . Since every

training instance will be used for training approximately 63% of base

classifiers, we can call as an out-of-bag sample for the remaining 27% of

base classifiers that did not use it for training. If we use these remaining 27%

classifiers to make predictions on , we can obtain the oob error on by

taking their majority vote and comparing it with its class label. Note that the

oob error estimates the error of 27% classifiers on an instance that was not

used for training those classifiers. Hence, the oob error can be considered as

a reliable estimate of generalization error. By taking the average of oob errors

of all training instances, we can compute the overall oob error estimate. This

can be used as an alternative to the validation error rate for selecting hyper-

parameters. Hence, random forests do not need to use a separate partition of

the training set for validation, as it can simultaneously train the base

classifiers and compute generalization error estimates on the same data set.

d log2d+1

Ti Di

Random forests have been empirically found to provide significant

improvements in generalization performance that are often comparable, if not

superior, to the improvements provided by the AdaBoost algorithm. Random

forests are also more robust to overfitting and run much faster than the

AdaBoost algorithm.

4.10.7 Empirical Comparison among

Ensemble Methods

Table 4.5 shows the empirical results obtained when comparing the

performance of a decision tree classifier against bagging, boosting, and

random forest. The base classifiers used in each ensemble method consist of

50 decision trees. The classification accuracies reported in this table are

obtained from tenfold cross-validation. Notice that the ensemble classifiers

generally outperform a single decision tree classifier on many of the data sets.

Table 4.5. Comparing the accuracy of a decision tree classifier against

three ensemble methods.

Data Set Number of (Attributes,

Classes, Instances)

Decision

Tree (%)

Bagging(%) Boosting(%) RF(%)

Anneal (39, 6, 898) 92.09 94.43 95.43 95.43

Australia (15, 2, 690) 85.51 87.10 85.22 85.80

Auto (26, 7, 205) 81.95 85.37 85.37 84.39

Breast (11, 2, 699) 95.14 96.42 97.28 96.14

Cleve (14, 2, 303) 76.24 81.52 82.18 82.18

Credit (16, 2, 690) 85.8 86.23 86.09 85.8

Diabetes (9, 2, 768) 72.40 76.30 73.18 75.13

German (21, 2, 1000) 70.90 73.40 73.00 74.5

Glass (10, 7, 214) 67.29 76.17 77.57 78.04

Heart (14, 2, 270) 80.00 81.48 80.74 83.33

Hepatitis (20, 2, 155) 81.94 81.29 83.87 83.23

Horse (23, 2, 368) 85.33 85.87 81.25 85.33

Ionosphere (35, 2, 351) 89.17 92.02 93.73 93.45

Iris (5, 3, 150) 94.67 94.67 94.00 93.33

Labor (17, 2, 57) 78.95 84.21 89.47 84.21

Led7 (8, 10, 3200) 73.34 73.66 73.34 73.06

Lymphography (19, 4, 148) 77.03 79.05 85.14 82.43

Pima (9, 2, 768) 74.35 76.69 73.44 77.60

Sonar (61, 2, 208) 78.85 78.85 84.62 85.58

Tic-tac-toe (10, 2, 958) 83.72 93.84 98.54 95.82

Vehicle (19, 4, 846) 71.04 74.11 78.25 74.94

Waveform (22, 3, 5000) 76.44 83.30 83.90 84.04

Wine (14, 3, 178) 94.38 96.07 97.75 97.75

Zoo (17, 7, 101) 93.07 93.07 95.05 97.03

4.11 Class Imbalance Problem

In many data sets there are a disproportionate number of instances that

belong to different classes, a property known as skew or class

imbalance.For example, consider a health-care application where diagnostic

reports are used to decide whether a person has a rare disease. Because of

the infrequent nature of the disease, we can expect to observe a smaller

number of subjects who are positively diagnosed. Similarly, in credit card

fraud detection, fraudulent transactions are greatly outnumbered by legitimate

transactions.

The degree of imbalance between the classes varies across different

applications and even across different data sets from the same application.

For example, the risk for a rare disease may vary across different populations

of subjects depending on their dietary and lifestyle choices. However, despite

their infrequent occurrences, a correct classification of the rare class often has

greater value than a correct classification of the majority class. For example, it

may be more dangerous to ignore a patient suffering from a disease than to

misdiagnose a healthy person.

More generally, class imbalance poses two challenges for classification. First,

it can be difficult to find sufficiently many labeled samples of a rare class. Note

that many of the classification methods discussed so far work well only when

the training set has a balanced representation of both classes. Although some

classifiers are more effective at handling imbalance in the training data than

others, e.g., rule-based classifiers and k-NN, they are all impacted if the

minority class is not well-represented in the training set. In general, a classifier

trained over an imbalanced data set shows a bias toward improving its

performance over the majority class, which is often not the desired behavior.

As a result, many existing classification models, when trained on an

imbalanced data set, may not effectively detect instances of the rare class.

Second, accuracy, which is the traditional measure for evaluating

classification performance, is not well-suited for evaluating models in the

presence of class imbalance in the test data. For example, if 1% of the credit

card transactions are fraudulent, then a trivial model that predicts every

transaction as legitimate will have an accuracy of 99% even though it fails to

detect any of the fraudulent activities. Thus, there is a need to use alternative

evaluation metrics that are sensitive to the skew and can capture different

criteria of performance than accuracy.

In this section, we first present some of the generic methods for building

classifiers when there is class imbalance in the training set. We then discuss

methods for evaluating classification performance and adapting classification

decisions in the presence of a skewed test set. In the remainder of this

section, we will consider binary classification problems for simplicity, where

the minority class is referred as the positive class while the majority class

is referred as the negative class.

4.11.1 Building Classifiers with Class

Imbalance

There are two primary considerations for building classifiers in the presence of

class imbalance in the training set. First, we need to ensure that the learning

algorithm is trained over a data set that has adequate representation of both

the majority as well as the minority classes. Some common approaches for

ensuring this includes the methodologies of oversampling and undersampling

(+)

(−)

the training set. Second, having learned a classification model, we need a way

to adapt its classification decisions (and thus create an appropriately tuned

classifier) to best match the requirements of the imbalanced test set. This is

typically done by converting the outputs of the classification model to real-

valued scores, and then selecting a suitable threshold on the classification

score to match the needs of a test set. Both these considerations are

discussed in detail in the following.

Oversampling and Undersampling

The first step in learning with imbalanced data is to transform the training set

to a balanced training set, where both classes have nearly equal

representation. The balanced training set can then be used with any of the

existing classification techniques (without making any modifications in the

learning algorithm) to learn a model that gives equal emphasis to both

classes. In the following, we present some of the common techniques for

transforming an imbalanced training set to a balanced one.

A basic approach for creating balanced training sets is to generate a sample

of training instances where the rare class has adequate representation. There

are two types of sampling methods that can be used to enhance the

representation of the minority class: (a) undersampling, where the frequency

of the majority class is reduced to match the frequency of the minority class,

and (b) oversampling, where artificial examples of the minority class are

created to make them equal in proportion to the number of negative instances.

To illustrate undersampling, consider a training set that contains 100 positive

examples and 1000 negative examples. To overcome the skew among the

classes, we can select a random sample of 100 examples from the negative

class and use them with the 100 positive examples to create a balanced

training set. A classifier built over the resultant balanced set will then be

unbiased toward both classes. However, one limitation of undersampling is

that some of the useful negative examples (e.g., those closer to the actual

decision boundary) may not be chosen for training, therefore, resulting in an

inferior classification model. Another limitation is that the smaller sample of

100 negative instances may have a higher variance than the larger set of

1000.

Oversampling attempts to create a balanced training set by artificially

generating new positive examples. A simple approach for oversampling is to

duplicate every positive instance times, where and are the

numbers of positive and negative training instances, respectively. Figure

4.51 illustrates the effect of oversampling on the learning of a decision

boundary using a classifier such as a decision tree. Without oversampling,

only the positive examples at the bottom right-hand side of Figure 4.51(a)

are classified correctly. The positive example in the middle of the diagram is

misclassified because there are not enough examples to justify the creation of

a new decision boundary to separate the positive and negative instances.

Oversampling provides the additional examples needed to ensure that the

decision boundary surrounding the positive example is not pruned, as

illustrated in Figure 4.51(b) . Note that duplicating a positive instance is

analogous to doubling its weight during the training stage. Hence, the effect of

oversampling can be alternatively achieved by assigning higher weights to

positive instances than negative instances. This method of weighting

instances can be used with a number of classifiers such as logistic regression,

ANN, and SVM.

n−/n+ n+ n−

Figure 4.51.

Illustrating the effect of oversampling of the rare class.

One limitation of the duplication method for oversampling is that the replicated

positive examples have an artificially lower variance when compared with their

true distribution in the overall data. This can bias the classifier to the specific

distribution of training instances, which may not be representative of the

overall distribution of test instances, leading to poor generalizability. To

overcome this limitation, an alternative approach for oversampling is to

generate synthetic positive instances in the neighborhood of existing positive

instances. In this approach, called the Synthetic Minority Oversampling

Technique (SMOTE), we first determine the k-nearest positive neighbors of

every positive instance , and then generate a synthetic positive instance at

some intermediate point along the line segment joining to one of its

randomly chosen k-nearest neighbor, . This process is repeated until the

desired number of positive instances is reached. However, one limitation of

this approach is that it can only generate new positive instances in the convex

hull of the existing positive class. Hence, it does not help improve the

representation of the positive class outside the boundary of existing positive

xk

instances. Despite their complementary strengths and weaknesses,

undersampling and oversampling provide useful directions for generating

balanced training sets in the presence of class imbalance.

Assigning Scores to Test Instances

If a classifier returns an ordinal score s( )for every test instance such that a

higher score denotes a greater likelihood of belonging to the positive class,

then for every possible value of score threshold, , we can create a new

binary classifier where a test instance is classified positive only if .

Thus, every choice of can potentially lead to a different classifier, and we

are interested in finding the classifier that is best suited for our needs.

Ideally, we would like the classification score to vary monotonically with the

actual posterior probability of the positive class, i.e., if and are the

scores of any two instances, and , then

. However, this is difficult to guarantee in

practice as the properties of the classification score depends on several

factors such as the complexity of the classification algorithm and the

representative power of the training set. In general, we can only expect the

classification score of a reasonable algorithm to be weakly related to the

actual posterior probability of the positive class, even though the relationship

may not be strictly monotonic. Most classifiers can be easily modified to

produce such a real valued score. For example, the signed distance of an

instance from the positive margin hyperplane of SVM can be used as a

classification score. As another example, test instances belonging to a leaf in

a decision tree can be assigned a score based on the fraction of training

instances labeled as positive in the leaf. Also, probabilistic classifiers such as

naïve Bayes, Bayesian networks, and logistic regression naturally output

estimates of posterior probabilities, . Next, we discuss some

sT

s(x)>sT

sT

s(x1) s(x2)

x1 x2

s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2)

P(y=1|x)

evaluation measures for assessing the goodness of a classifier in the

presence of class imbalance.

Table 4.6. A confusion matrix for a binary classification problem in which

the classes are not equally important.

Predicted Class

Actual

class

4.11.2 Evaluating Performance with

Class Imbalance

The most basic approach for representing a classifier’s performance on a test

set is to use a confusion matrix, as shown in Table 4.6 . This table is

essentially the same as Table 3.4 , which was introduced in the context of

evaluating classification performance in Section 3.2 . A confusion matrix

summarizes the number of instances predicted correctly or incorrectly by a

classifier using the following four counts:

True positive (TP) or , which corresponds to the number of positive

examples correctly predicted by the classifier.

False positive (FP) or (also known as Type I error), which corresponds

to the number of negative examples wrongly predicted as positive by the

classifier.

+ −

+ f++ (TP) f+− (FN)

− f−+ (FP) f−− (TN)

f++

f−+

False negative (FN) or (also known as Type II error), which

corresponds to the number of positive examples wrongly predicted as

negative by the classifier.

True negative (TN) or , which corresponds to the number of negative

examples correctly predicted by the classifier.

The confusion matrix provides a concise representation of classification

performance on a given test data set. However, it is often difficult to interpret

and compare the performance of classifiers using the four-dimensional

representations (corresponding to the four counts) provided by their confusion

matrices. Hence, the counts in the confusion matrix are often summarized

using a number of evaluation measures. Accuracy is an example of one

such measure that combines these four counts into a single value, which is

used extensively when classes are balanced. However, the accuracy measure

is not suitable for handling data sets with imbalanced class distributions as it

tends to favor classifiers that correctly classify the majority class. In the

following, we describe other possible measures that capture different criteria

of performance when working with imbalanced classes.

A basic evaluation measure is the true positive rate (TPR), which is defined

as the fraction of positive test instances correctly predicted by the classifier:

In the medical community, TPR is also known as sensitivity, while in the

information retrieval literature, it is also called recall (r). A classifier with a high

TPR has a high chance of correctly identifying the positive instances of the

data.

Analogously to TPR, the true negative rate (TNR) (also known as

specificity) is defined as the fraction of negative test instances correctly

f+−

f−−

TPR=TPTP+FN.

predicted by the classifier, i.e.,

A high TNR value signifies that the classifier correctly classifies any randomly

chosen negative instance in the test set. A commonly used evaluation

measure that is closely related to TNR is the false positive rate (FPR), which

is defined as .

Similarly, we can define false negative rate (FNR) as .

Note that the evaluation measures defined above do not take into account the

skew among the classes, which can be formally defined as , where

P and N denote the number of actual positives and actual negatives,

respectively. As a result, changing the relative numbers of P and N will have

no effect on TPR, TNR, FPR, or FNR, since they depend only on the fraction

of correct classifications for every class, independently of the other class.

Furthermore, knowing the values of TPR and TNR (and consequently FNR

and FPR) does not by itself help us uniquely determine all four entries of the

confusion matrix. However, together with information about the skew factor, ,

and the total number of instances, N, we can compute the entire confusion

matrix using TPR and TNR, as shown in Table 4.7 .

Table 4.7. Entries of the confusion matrix in terms of the TPR, TNR,

skew, , and total number of instances, N.

Predicted Predicted

TNR=TNFP+TN.

1−TNR

FPR=FPFP+TN.

1−TPR

FNR=FNFN+TP.

α=P/(P+N)

α

α

+ −

Actual

Actual

N

An evaluation measure that is sensitive to the skew is precision, which can

be defined as the fraction of correct predictions of the positive class over the

total number of positive predictions, i.e.,

Precision is also referred as the positive predicted value (PPV). A classifier

that has a high precision is likely to have most of its positive predictions

correct. Precision is a useful measure for highly skewed test sets where the

positive predictions, even though small in numbers, are required to be mostly

correct. A measure that is closely related to precision is the false discovery

rate (FDR), which can be defined as .

Although both FDR and FPR focus on FP, they are designed to capture

different evaluation objectives and thus can take quite contrasting values,

especially in the presence of class imbalance. To illustrate this, consider a

classifier with the following confusion matrix.

Predicted Class

Actual

Class

100 0

+ TPR×α×N (1−TPR)×α×N α×N

− (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N

Precision, p=TPTP+FP.

1−p

FDR=FPTP+FP.

+ −

+

100 900

Since half of the positive predictions made by the classifier are incorrect, it

has a FDR value of . However, its FPR is equal to

, which is quite low. This example shows that in the

presence of high skew (i.e., very small value of ), even a small FPR can

result in high FDR. See Section 10.6 for further discussion of this issue.

Note that the evaluation measures defined above provide an incomplete

representation of performance, because they either only capture the effect of

false positives (e.g., FPR and precision) or the effect of false negatives (e.g.,

TPR or recall), but not both. Hence, if we optimize only one of these

evaluation measures, we may end up with a classifier that shows low FN but

high FP, or vice-versa. For example, a classifier that declares every instance

to be positive will have a perfect recall, but high FPR and very poor precision.

On the other hand, a classifier that is very conservative in classifying an

instance as positive (to reduce FP) may end up having high precision but very

poor recall. We thus need evaluation measures that account for both types of

misclassifications, FP and FN. Some examples of such evaluation measures

are summarized by the following definitions.

While some of these evaluation measures are invariant to the skew (e.g., the

positive likelihood ratio), others (e.g., precision and the measure) are

sensitive to skew. Further, different evaluation measures capture the effects of

different types of misclassification errors in various ways. For example, the

measure represents a harmonic mean between recall and precision, i.e.,

−

100/(100+100)=0.5

100/(100+900)=0.1

α

Positive Likelihood Ratio=TPRFPR.F1 measure=2rpr+p=2×TP2×TP+FP+FN.G

(TP+FN).

F1

F1

F1=21r+1p.

Because the harmonic mean of two numbers tends to be closer to the smaller

of the two numbers, a high value of -measure ensures that both precision

and recall are reasonably high. Similarly, the G measure represents the

geometric mean between recall and precision. A comparison among

harmonic, geometric, and arithmetic means is given in the next example.

Example 4.9.

Consider two positive numbers and . Their arithmetic mean is

and their geometric mean is . Their harmonic mean

is , which is closer to the smaller value between a and

b than the arithmetic and geometric means.

A generic extension of the measure is the measure, which can be

defined as follows.

Both precision and recall can be viewed as special cases of by setting

and , respectively. Low values of make closer to precision, and high

values make it closer to recall.

A more general measure that captures as well as accuracy is the weighted

accuracy measure, which is defined by the following equation:

The relationship between weighted accuracy and other performance

measures is summarized in the following table:

Measure

F1

a=1 b=5 μa=

(a+b)/2=3 μg=ab=2.236

μh=(2×1×5)/6=1.667

F1 Fβ

Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)

Fβ β=0

β=∞ β Fβ

Fβ

Weighted accuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)

w1 w2 w3 w4

Recall 1 1 0 0

Precision 1 0 1 0

1 0

Accuracy 1 1 1 1

4.11.3 Finding an Optimal Score

Threshold

Given a suitably chosen evaluation measure E and a distribution of

classification scores, , on a validation set, we can obtain the optimal score

threshold on the validation set using the following approach:

1. Sort the scores in increasing order of their values.

2. For every unique value of score, s, consider the classification model

that assigns an instance as positive only if . Let E(s) denote

the performance of this model on the validation set.

3. Find that maximizes the evaluation measure, E(s).

Note that can be treated as a hyper-parameter of the classification

algorithm that is learned during model selection. Using , we can assign a

positive label to a future test instance only if . If the evaluation

measure E is skew invariant (e.g., Positive Likelihood Ratio), then we can

select without knowing the skew of the test set, and the resultant classifier

formed using can be expected to show optimal performance on the test set

Fβ β2+1 β2

s(x)

s*

s(x)>s

s*

s*=argmaxs E(s).

s*

s*

s(x)>s*

s*

s*

(with respect to the evaluation measure E). On the other hand, if E is sensitive

to the skew (e.g., precision or -measure), then we need to ensure that the

skew of the validation set used for selecting is similar to that of the test set,

so that the classifier formed using shows optimal test performance with

respect to E. Alternatively, given an estimate of the skew of the test data, ,

we can use it along with the TPR and TNR on the validation set to estimate all

entries of the confusion matrix (see Table 4.7 ), and thus the estimate of

any evaluation measure E on the test set. The score threshold selected

using this estimate of E can then be expected to produce optimal test

performance with respect to E. Furthermore, the methodology of selecting

on the validation set can help in comparing the test performance of different

classification algorithms, by using the optimal values of for each algorithm.

4.11.4 Aggregate Evaluation of

Performance

Although the above approach helps in finding a score threshold that

provides optimal performance with respect to a desired evaluation measure

and a certain amount of skew, , sometimes we are interested in evaluating

the performance of a classifier on a number of possible score thresholds,

each corresponding to a different choice of evaluation measure and skew

value. Assessing the performance of a classifier over a range of score

thresholds is called aggregate evaluation of performance. In this style of

analysis, the emphasis is not on evaluating the performance of a single

classifier corresponding to the optimal score threshold, but to assess the

general quality of ranking produced by the classification scores on the test set.

In general, this helps in obtaining robust estimates of classification

performance that are not sensitive to specific choices of score thresholds.

F1

s*

s*

α

s*

s*

s*

s*

α

ROC Curve

One of the widely-used tools for aggregate evaluation is the receiver

operating characteristic (ROC) curve. An ROC curve is a graphical

approach for displaying the trade-off between TPR and FPR of a classifier,

over varying score thresholds. In an ROC curve, the TPR is plotted along the

y-axis and the FPR is shown on the x-axis. Each point along the curve

corresponds to a classification model generated by placing a threshold on the

test scores produced by the classifier. The following procedure describes the

generic approach for computing an ROC curve:

1. Sort the test instances in increasing order of their scores.

2. Select the lowest ranked test instance (i.e., the instance with lowest

score). Assign the selected instance and those ranked above it to the

positive class. This approach is equivalent to classifying all the test

instances as positive class. Because all the positive examples are

classified correctly and the negative examples are misclassified,

.

3. Select the next test instance from the sorted list. Classify the selected

instance and those ranked above it as positive, while those ranked

below it as negative. Update the counts of TP and FP by examining the

actual class label of the selected instance. If this instance belongs to

the positive class, the TP count is decremented and the FP count

remains the same as before. If the instance belongs to the negative

class, the FP count is decremented and TP count remains the same as

before.

4. Repeat Step 3 and update the TP and FP counts accordingly until the

highest ranked test instance is selected. At this final threshold,

, as all instances are labeled as negative.

5. Plot the TPR against FPR of the classifier.

TPR=FPR=1

TPR=FPR=0

Example 4.10. [Generating ROC Curve]

Figure 4.52 shows an example of how to compute the TPR and FPR

values for every choice of score threshold. There are five positive

examples and five negative examples in the test set. The class labels of

the test instances are shown in the first row of the table, while the second

row corresponds to the sorted score values for each instance. The next six

rows contain the counts of TP , FP , TN, and FN, along with their

corresponding TPR and FPR. The table is then filled from left to right.

Initially, all the instances are predicted to be positive. Thus, and

. Next, we assign the test instance with the lowest score as

the negative class. Because the selected instance is actually a positive

example, the TP count decreases from 5 to 4 and the FP count is the

same as before. The FPR and TPR are updated accordingly. This process

is repeated until we reach the end of the list, where and .

The ROC curve for this example is shown in Figure 4.53 .

Figure 4.52.

Computing the TPR and FPR at every score threshold.

TP=FP=5

TPR=FPR=1

TPR=0 FPR=0

Figure 4.53.

ROC curve for the data shown in Figure 4.52 .

Note that in an ROC curve, the TPR monotonically increases with FPR,

because the inclusion of a test instance in the set of predicted positives can

either increase the TPR or the FPR. The ROC curve thus has a staircase

pattern. Furthermore, there are several critical points along an ROC curve that

have well-known interpretations:

: Model predicts every instance to be a negative class.

: Model predicts every instance to be a positive class.

: The perfect model with zero misclassifications.

A good classification model should be located as close as possible to the

upper left corner of the diagram, while a model that makes random guesses

should reside along the main diagonal, connecting the points

and . Random guessing means that an instance is classified

as a positive class with a fixed probability p, irrespective of its attribute set.

(TPR=0, FPR=0)

(TPR=1, FPR=1)

(TPR=1, FPR=0)

(TPR=0, FPR=0)

(TPR=1, FPR=1)

For example, consider a data set that contains positive instances and

negative instances. The random classifier is expected to correctly classify

of the positive instances and to misclassify of the negative instances.

Therefore, the TPR of the classifier is , while its FPR is .

Hence, this random classifier will reside at the point (p, p) in the ROC curve

along the main diagonal.

Figure 4.54.

ROC curves for two different classifiers.

Since every point on the ROC curve represents the performance of a classifier

generated using a particular score threshold, they can be viewed as different

operating points of the classifier. One may choose one of these operating

points depending on the requirements of the application. Hence, an ROC

curve facilitates the comparison of classifiers over a range of operating points.

For example, Figure 4.54 compares the ROC curves of two classifiers,

n+ n−

pn+

pn−

(pn+)/n+=p (pn−)/p=p

M1

and , generated by varying the score thresholds. We can see that is

better than when FPR is less than 0.36, as shows better TPR than

for this range of operating points. On the other hand, is superior when

FPR is greater than 0.36, since the TPR of is higher than that of for

this range. Clearly, neither of the two classifiers dominates (is strictly better

than) the other, i.e., shows higher values of TPR and lower values of FPR

over all operating points.

To summarize the aggregate behavior across all operating points, one of the

commonly used measures is the area under the ROC curve (AUC). If the

classifier is perfect, then its area under the ROC curve will be equal 1. If the

algorithm simply performs random guessing, then its area under the ROC

curve will be equal to 0.5.

Although the AUC provides a useful summary of aggregate performance,

there are certain caveats in using the AUC for comparing classifiers. First,

even if the AUC of algorithm A is higher than the AUC of another algorithm B,

this does not mean that algorithm A is always better than B, i.e., the ROC

curve of A dominates that of B across all operating points. For example, even

though shows a slightly lower AUC than in Figure 4.54 , we can see

that both and are useful over different ranges of operating points and

none of them are strictly better than the other across all possible operating

points. Hence, we cannot use the AUC to determine which algorithm is better,

unless we know that the ROC curve of one of the algorithms dominates the

other.

Second, although the AUC summarizes the aggregate performance over all

operating points, we are often interested in only a small range of operating

points in most applications. For example, even though shows slightly

lower AUC than , it shows higher TPR values than for small FPR

values (smaller than 0.36). In the presence of class imbalance, the behavior of

M2 M1

M2 M1 M2

M2

M2 M1

M1 M2

M1 M2

M1

M2 M2

an algorithm over small FPR values (also termed as early retrieval) is often

more meaningful for comparison than the performance over all FPR values.

This is because, in many applications, it is important to assess the TPR

achieved by a classifier in the first few instances with highest scores, without

incurring a large FPR. Hence, in Figure 4.54 , due to the high TPR values

of during early retrieval , we may prefer over for

imbalanced test sets, despite the lower AUC of . Hence, care must be

taken while comparing the AUC values of different classifiers, usually by

visualizing their ROC curves rather than just reporting their AUC.

A key characteristic of ROC curves is that they are agnostic to the skew in the

test set, because both the evaluation measures used in constructing ROC

curves (TPR and FPR) are invariant to class imbalance. Hence, ROC curves

are not suitable for measuring the impact of skew on classification

performance. In particular, we will obtain the same ROC curve for two test

data sets that have very different skew.

M1 (FPR<0.36) M1 M2

M1

Figure 4.55.

PR curves for two different classifiers.

Precision-Recall Curve

An alternate tool for aggregate evaluation is the precision recall curve (PR

curve). The PR curve plots the precision and recall values of a classifier on

the y and x axes respectively, by varying the threshold on the test scores.

Figure 4.55 shows an example of PR curves for two hypothetical

classifiers, and . The approach for generating a PR curve is similar to

the approach described above for generating an ROC curve. However, there

are some key distinguishing features in the PR curve:

1. PR curves are sensitive to the skew factor , and different PR

curves are generated for different values of .

M1 M2

α=P/(P+N)

α

2. When the score threshold is lowest (every instance is labeled as

positive), the precision is equal to while recall is 1. As we increase

the score threshold, the number of predicted positives can stay the

same or decrease. Hence, the recall monotonically declines as the

score threshold increases. In general, the precision may increase or

decrease for the same value of recall, upon addition of an instance into

the set of predicted positives. For example, if the k ranked instance

belongs to the negative class, then including it will result in a drop in

the precision without affecting the recall. The precision may improve at

the next step, which adds the ranked instance, if this instance

belongs to the positive class. Hence, the PR curve is not a smooth,

monotonically increasing curve like the ROC curve, and generally has a

zigzag pattern. This pattern is more prominent in the left part of the

curve, where even a small change in the number of false positives can

cause a large change in precision.

3. As, as we increase the imbalance among the classes (reduce the value

of ), the rightmost points of all PR curves will move downwards. At

and near the leftmost point on the PR curve (corresponding to larger

values of score threshold), the recall is close to zero, while the

precision is equal to the fraction of positives in the top ranked instances

of the algorithm. Hence, different classifiers can have different values

of precision at the leftmost points of the PR curve. Also, if the

classification score of an algorithm monotonically varies with the

posterior probability of the positive class, we can expect the PR curve

to gradually decrease from a high value of precision on the leftmost

point to a constant value of at the rightmost point, albeit with some

ups and downs. This can be observed in the PR curve of algorithm

in Figure 4.55 , which starts from a higher value of precision on the

left that gradually decreases as we move towards the right. On the

other hand, the PR curve of algorithm starts from a lower value of

precision on the left and shows more drastic ups and downs as we

α

th

(k+1)th

α

α

M1

M2

move right, suggesting that the classification score of shows a

weaker monotonic relationship with the posterior probability of the

positive class.

4. A random classifier that assigns an instance to be positive with a fixed

probability p has a precision of and a recall of p. Hence, a classifier

that performs random guessing has a horizontal PR curve with , as

shown using a dashed line in Figure 4.55 . Note that the random

baseline in PR curves depends on the skew in the test set, in contrast

to the fixed main diagonal of ROC curves that represents random

classifiers.

5. Note that the precision of an algorithm is impacted more strongly by

false positives in the top ranked test instances than the FPR of the

algorithm. For this reason, the PR curve generally helps to magnify the

differences between classifiers in the left portion of the PR curve.

Hence, in the presence of class imbalance in the test data, analyzing

the PR curves generally provides a deeper insight into the performance

of classifiers than the ROC curves, especially in the early retrieval

range of operating points.

6. The classifier corresponding to represents the

perfect classifier. Similar to AUC, we can also compute the area under

the PR curve of an algorithm, known as AUC-PR. The AUC-PR of a

random classifier is equal to , while that of a perfect algorithm is equal

to 1. Note that AUC-PR varies with changing skew in the test set, in

contrast to the area under the ROC curve, which is insensitive to the

skew. The AUC-PR helps in accentuating the differences between

classification algorithms in the early retrieval range of operating points.

Hence, it is more suited for evaluating classification performance in the

presence of class imbalance than the area under the ROC curve.

However, similar to ROC curves, a higher value of AUC-PR does not

guarantee the superiority of a classification algorithm over another. This

is because the PR curves of two algorithms can easily cross each

M2

α

y=α

(precision=1, recall=1)

α

other, such that they both show better performances in different ranges

of operating points. Hence, it is important to visualize the PR curves

before comparing their AUC-PR values, in order to ensure a

meaningful evaluation.

4.12 Multiclass Problem

Some of the classification techniques described in this chapter are originally

designed for binary classification problems. Yet there are many real-world

problems, such as character recognition, face identification, and text

classification, where the input data is divided into more than two categories.

This section presents several approaches for extending the binary classifiers

to handle multiclass problems. To illustrate these approaches, let

be the set of classes of the input data.

The first approach decomposes the multiclass problem into K binary

problems. For each class , a binary problem is created where all

instances that belong to are considered positive examples, while the

remaining instances are considered negative examples. A binary classifier is

then constructed to separate instances of class from the rest of the classes.

This is known as the one-against-rest (1-r) approach.

The second approach, which is known as the one-against-one (1-1) approach,

constructs binary classifiers, where each classifier is used to

distinguish between a pair of classes, . Instances that do not belong to

either or are ignored when constructing the binary classifier for . In

both 1-r and 1-1 approaches, a test instance is classified by combining the

predictions made by the binary classifiers. A voting scheme is typically

employed to combine the predictions, where the class that receives the

highest number of votes is assigned to the test instance. In the 1-r approach,

if an instance is classified as negative, then all classes except for the positive

class receive a vote. This approach, however, may lead to ties among the

different classes. Another possibility is to transform the outputs of the binary

Y=

{y1, y2, … ,yK}

yi∈Y

yi

yi

K(K −1)/2

(yi, yj)

yi yj (yi, yj)

classifiers into probability estimates and then assign the test instance to the

class that has the highest probability.

Example 4.11.

Consider a multiclass problem where . Suppose a test

instance is classified as according to the 1-r approach. In other

words, it is classified as positive when is used as the positive class and

negative when , and are used as the positive class. Using a

simple majority vote, notice that receives the highest number of votes,

which is four, while the remaining classes receive only three votes. The

test instance is therefore classified as .

Example 4.12.

Suppose the test instance is classified using the 1-1 approach as follows:

Binary pair of classes

Classification

The first two rows in this table correspond to the pair of classes

chosen to build the classifier and the last row represents the predicted

class for the test instance. After combining the predictions, and each

receive two votes, while and each receives only one vote. The test

instance is therefore classified as either or , depending on the tie-

breaking procedure.

Error-Correcting Output Coding

Y={y1, y2, y3, y4}

(+, −, −, −)

y1

y2, y3 y4

y1

y1

+:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4

+ + − + − +

(yi, yj)

y1 y4

y2 y3

y1 y4

A potential problem with the previous two approaches is that they may be

sensitive to binary classification errors. For the 1-r approach given in Example

4.12, if at least of one of the binary classifiers makes a mistake in its

prediction, then the classifier may end up declaring a tie between classes or

making a wrong prediction. For example, suppose the test instance is

classified as due to misclassification by the third classifier. In this

case, it will be difficult to tell whether the instance should be classified as or

, unless the probability associated with each class prediction is taken into

account.

The error-correcting output coding (ECOC) method provides a more robust

way for handling multiclass problems. The method is inspired by an

information-theoretic approach for sending messages across noisy channels.

The idea behind this approach is to add redundancy into the transmitted

message by means of a codeword, so that the receiver may detect errors in

the received message and perhaps recover the original message if the

number of errors is small.

For multiclass learning, each class is represented by a unique bit string of

length n known as its codeword. We then train n binary classifiers to predict

each bit of the codeword string. The predicted class of a test instance is given

by the codeword whose Hamming distance is closest to the codeword

produced by the binary classifiers. Recall that the Hamming distance between

a pair of bit strings is given by the number of bits that differ.

Example 4.13.

Consider a multiclass problem where . Suppose we

encode the classes using the following seven bit codewords:

(+, −, +, −)

y1

y3

yi

Y={y1, y2, y3, y4}

Class Codeword

1 1 1 1 1 1 1

0 0 0 0 1 1 1

0 0 1 1 0 0 1

0 1 0 1 0 1 0

Each bit of the codeword is used to train a binary classifier. If a test

instance is classified as (0,1,1,1,1,1,1) by the binary classifiers, then the

Hamming distance between the codeword and is 1, while the Hamming

distance to the remaining classes is 3. The test instance is therefore

classified as .

An interesting property of an error-correcting code is that if the minimum

Hamming distance between any pair of codewords is d, then any

errors in the output code can be corrected using its nearest codeword. In

Example 4.13 , because the minimum Hamming distance between any pair

of codewords is 4, the classifier may tolerate errors made by one of the seven

binary classifiers. If there is more than one classifier that makes a mistake,

then the classifier may not be able to compensate for the error.

An important issue is how to design the appropriate set of codewords for

different classes. From coding theory, a vast number of algorithms have been

developed for generating n-bit codewords with bounded Hamming distance.

However, the discussion of these algorithms is beyond the scope of this book.

It is worthwhile mentioning that there is a significant difference between the

design of error-correcting codes for communication tasks compared to those

used for multiclass learning. For communication, the codewords should

maximize the Hamming distance between the rows so that error correction

y1

y2

y3

y4

y1

y1

⌊ (d−1)/2) ⌋

can be performed. Multiclass learning, however, requires that both the row-

wise and column-wise distances of the codewords must be well separated. A

larger column-wise distance ensures that the binary classifiers are mutually

independent, which is an important requirement for ensemble learning

methods.

4.13 Bibliographic Notes

Mitchell [278] provides excellent coverage on many classification techniques

from a machine learning perspective. Extensive coverage on classification can

also be found in Aggarwal [195], Duda et al. [229], Webb [307], Fukunaga

[237], Bishop [204], Hastie et al. [249], Cherkassky and Mulier [215], Witten

and Frank [310], Hand et al. [247], Han and Kamber [244], and Dunham [230].

Direct methods for rule-based classifiers typically employ the sequential

covering scheme for inducing classification rules. Holte’s 1R [255] is the

simplest form of a rule-based classifier because its rule set contains only a

single rule. Despite its simplicity, Holte found that for some data sets that

exhibit a strong one-to-one relationship between the attributes and the class

label, 1R performs just as well as other classifiers. Other examples of rule-

based classifiers include IREP [234], RIPPER [218], CN2 [216, 217], AQ

[276], RISE [224], and ITRULE [296]. Table 4.8 shows a comparison of the

characteristics of four of these classifiers.

Table 4.8. Comparison of various rule-based classifiers.

RIPPER CN2

(unordered)

CN2

(ordered)

AQR

Rule-growing

strategy

General-to-specific General-to-

specific

General-to-

specific

General-to-specific

(seeded by a positive

example)

Evaluation metric FOIL’s Info gain Laplace Entropy and

likelihood

ratio

Number of true

positives

Stopping

condition forrule-

growing

All examples

belong to the same

class

No

performance

gain

No

performance

gain

Rules cover only

positive class

Rule pruning Reduced error

pruning

None None None

Instance

elimination

Positive and

negative

Positive only Positive only Positive and negative

Stopping

condition for

adding rules

orbased on MDL

No

performance

gain

No

performance

gain

All positive examples

are covered

Rule setp runing Replace or modify

rules

Statistical

tests

None None

Search strategy Greedy Beam

search

Beam

search

Beam search

For rule-based classifiers, the rule antecedent can be generalized to include

any propositional or first-order logical expression (e.g., Horn clauses).

Readers who are interested in first-order logic rule-based classifiers may refer

to references such as [278] or the vast literature on inductive logic

programming [279]. Quinlan [287] proposed the C4.5rules algorithm for

extracting classification rules from decision trees. An indirect method for

extracting rules from artificial neural networks was given by Andrews et al. in

[198].

Cover and Hart [220] presented an overview of the nearest neighbor

classification method from a Bayesian perspective. Aha provided both

theoretical and empirical evaluations for instance-based methods in [196].

PEBLS, which was developed by Cost and Salzberg [219], is a nearest

neighbor classifier that can handle data sets containing nominal attributes.

Error>50%

Each training example in PEBLS is also assigned a weight factor that

depends on the number of times the example helps make a correct prediction.

Han et al. [243] developed a weight-adjusted nearest neighbor algorithm, in

which the feature weights are learned using a greedy, hill-climbing

optimization algorithm. A more recent survey of k-nearest neighbor

classification is given by Steinbach and Tan [298].

Naïve Bayes classifiers have been investigated by many authors, including

Langley et al. [267], Ramoni and Sebastiani [288], Lewis [270], and Domingos

and Pazzani [227]. Although the independence assumption used in naïve

Bayes classifiers may seem rather unrealistic, the method has worked

surprisingly well for applications such as text classification. Bayesian networks

provide a more flexible approach by allowing some of the attributes to be

interdependent. An excellent tutorial on Bayesian networks is given by

Heckerman in [252] and Jensen in [258]. Bayesian networks belong to a

broader class of models known as probabilistic graphical models. A formal

introduction to the relationships between graphs and probabilities was

presented in Pearl [283]. Other great resources on probabilistic graphical

models include books by Bishop [205], and Jordan [259]. Detailed discussions

of concepts such as d-separation and Markov blankets are provided in Geiger

et al. [238] and Russell and Norvig [291].

Generalized linear models (GLM) are a rich class of regression models that

have been extensively studied in the statistical literature. They were

formulated by Nelder and Wedderburn in 1972 [280] to unify a number of

regression models such as linear regression, logistic regression, and Poisson

regression, which share some similarities in their formulations. An extensive

discussion of GLMs is provided in the book by McCullagh and Nelder [274].

Artificial neural networks (ANN) have witnessed a long and winding history of

developments, involving multiple phases of stagnation and resurgence. The

idea of a mathematical model of a neural network was first introduced in 1943

by McCulloch and Pitts [275]. This led to a series of computational machines

to simulate a neural network based on the theory of neural plasticity [289].

The perceptron, which is the simplest prototype of modern ANNs, was

developed by Rosenblatt in 1958 [290]. The perceptron uses a single layer of

processing units that can perform basic mathematical operations such as

addition and multiplication. However, the perceptron can only learn linear

decision boundaries and is guaranteed to converge only when the classes are

linearly separable. Despite the interest in learning multi-layer networks to

overcome the limitations of perceptron, progress in this area remain halted

until the invention of the backpropagation algorithm by Werbos in 1974 [309],

which allowed for the quick training of multi-layer ANNs using the gradient

descent method. This led to an upsurge of interest in the artificial intelligence

(AI) community to develop multi-layer ANN models, a trend that continued for

more than a decade. Historically, ANNs mark a paradigm shift in AI from

approaches based on expert systems (where knowledge is encoded using if-

then rules) to machine learning approaches (where the knowledge is encoded

in the parameters of a computational model). However, there were still a

number of algorithmic and computational challenges in learning large ANN

models, which remained unresolved for a long time. This hindered the

development of ANN models to the scale necessary for solving real-world

problems. Slowly, ANNs started getting outpaced by other classification

models such as support vector machines, which provided better performance

as well as theoretical guarantees of convergence and optimality. It is only

recently that the challenges in learning deep neural networks have been

circumvented, owing to better computational resources and a number of

algorithmic improvements in ANNs since 2006. This re-emergence of ANN

has been dubbed as “deep learning,” which has often outperformed existing

classification models and gained wide-spread popularity.

Deep learning is a rapidly evolving area of research with a number of

potentially impactful contributions being made every year. Some of the

landmark advancements in deep learning include the use of large-scale

restricted Boltzmann machines for learning generative models of data [201,

253], the use of autoencoders and its variants (denoising autoencoders) for

learning robust feature representations [199, 305, 306], and sophistical

architectures to promote sharing of parameters across nodes such as

convolutional neural networks for images [265, 268] and recurrent neural

networks for sequences [241, 242, 277]. Other major improvements include

the approach of unsupervised pretraining for initializing ANN models [232], the

dropout technique for regularization [254, 297], batch normalization for fast

learning of ANN parameters [256], and maxout networks for effective usage of

the dropout technique [240]. Even though the discussions in this chapter on

learning ANN models were centered around the gradient descent method,

most of the modern ANN models involving a large number of hidden layers

are trained using the stochastic gradient descent method since it is highly

scalable [207]. An extensive survey of deep learning approaches has been

presented in review articles by Bengio [200], LeCun et al. [269], and

Schmidhuber [293]. An excellent summary of deep learning approaches can

also be obtained from recent books by Goodfellow et al. [239] and Nielsen

[281].

Vapnik [303, 304] has written two authoritative books on Support Vector

Machines (SVM). Other useful resources on SVM and kernel methods include

the books by Cristianini and Shawe-Taylor [221] and Schölkopf and Smola

[294]. There are several survey articles on SVM, including those written by

Burges [212], Bennet et al. [202], Hearst [251], and Mangasarian [272]. SVM

can also be viewed as an norm regularizer of the hinge loss function, as

described in detail by Hastie et al. [249]. The norm regularizer of the

square loss function can be obtained using the least absolute shrinkage and

selection operator (Lasso), which was introduced by Tibshirani in 1996 [301].

L2

L1

The Lasso has several interesting properties such as the ability to

simultaneously perform feature selection as well as regularization, so that only

a subset of features are selected in the final model. An excellent review of

Lasso can be obtained from a book by Hastie et al. [250].

A survey of ensemble methods in machine learning was given by Diet-terich

[222]. The bagging method was proposed by Breiman [209]. Freund and

Schapire [236] developed the AdaBoost algorithm. Arcing, which stands for

adaptive resampling and combining, is a variant of the boosting algorithm

proposed by Breiman [210]. It uses the non-uniform weights assigned to

training examples to resample the data for building an ensemble of training

sets. Unlike AdaBoost, the votes of the base classifiers are not weighted when

determining the class label of test examples. The random forest method was

introduced by Breiman in [211]. The concept of bias-variance decomposition is

explained in detail by Hastie et al. [249]. While the bias-variance

decomposition was initially proposed for regression problems with squared

loss function, a unified framework for classification problems involving 0–1

losses was introduced by Domingos [226].

Related work on mining rare and imbalanced data sets can be found in the

survey papers written by Chawla et al. [214] and Weiss [308]. Sampling-based

methods for mining imbalanced data sets have been investigated by many

authors, such as Kubat and Matwin [266], Japkowitz [257], and Drummond

and Holte [228]. Joshi et al. [261] discussed the limitations of boosting

algorithms for rare class modeling. Other algorithms developed for mining rare

classes include SMOTE [213], PNrule [260], and CREDOS [262].

Various alternative metrics that are well-suited for class imbalanced problems

are available. The precision, recall, and -measure are widely-used metrics

in information retrieval [302]. ROC analysis was originally used in signal

detection theory for performing aggregate evaluation over a range of score

F1

thresholds. A method for comparing classifier performance using the convex

hull of ROC curves was suggested by Provost and Fawcett in [286]. Bradley

[208] investigated the use of area under the ROC curve (AUC) as a

performance metric for machine learning algorithms. Despite the vast body of

literature on optimizing the AUC measure in machine learning models, it is

well-known that AUC suffers from certain limitations. For example, the AUC

can be used to compare the quality of two classifiers only if the ROC curve of

one classifier strictly dominates the other. However, if the ROC curves of two

classifiers intersect at any point, then it is difficult to assess the relative quality

of classifiers using the AUC measure. An in-depth discussion of the pitfalls in

using AUC as a performance measure can be obtained in works by Hand

[245, 246], and Powers [284]. The AUC has also been considered to be an

incoherent measure of performance, i.e., it uses different scales while

comparing the performance of different classifiers, although a coherent

interpretation of AUC has been provided by Ferri et al. [235]. Berrar and Flach

[203] describe some of the common caveats in using the ROC curve for

clinical microarray research. An alternate approach for measuring the

aggregate performance of a classifier is the precision-recall (PR) curve, which

is especially useful in the presence of class imbalance [292].

An excellent tutorial on cost-sensitive learning can be found in a review article

by Ling and Sheng [271]. The properties of a cost matrix had been studied by

Elkan in [231]. Margineantu and Dietterich [273] examined various methods

for incorporating cost information into the C4.5 learning algorithm, including

wrapper methods, class distribution-based methods, and loss-based methods.

Other cost-sensitive learning methods that are algorithm-independent include

AdaCost [233], MetaCost [225], and costing [312].

Extensive literature is also available on the subject of multiclass learning. This

includes the works of Hastie and Tibshirani [248], Allwein et al. [197], Kong

and Dietterich [264], and Tax and Duin [300]. The error-correcting output

coding (ECOC) method was proposed by Dietterich and Bakiri [223]. They

had also investigated techniques for designing codes that are suitable for

solving multiclass problems.

Apart from exploring algorithms for traditional classification settings where

every instance has a single set of features with a unique categorical label,

there has been a lot of recent interest in non-traditional classification

paradigms, involving complex forms of inputs and outputs. For example, the

paradigm of multi-label learning allows for an instance to be assigned multiple

class labels rather than just one. This is useful in applications such as object

recognition in images, where a photo image may include more than one

classification object, such as, grass, sky, trees, and mountains. A survey on

multi-label learning can be found in [313]. As another example, the paradigm

of multi-instance learning considers the problem where the instances are

available in the form of groups called bags, and training labels are available at

the level of bags rather than individual instances. Multi-instance learning is

useful in applications where an object can exist as multiple instances in

different states (e.g., the different isomers of a chemical compound), and even

if a single instance shows a specific characteristic, the entire bag of instances

associated with the object needs to be assigned the relevant class. A survey

on multi-instance learning is provided in [314].

In a number of real-world applications, it is often the case that the training

labels are scarce in quantity, because of the high costs associated with

obtaining gold-standard supervision. However, we almost always have

abundant access to unlabeled test instances, which do not have supervised

labels but contain valuable information about the structure or distribution of

instances. Traditional learning algorithms, which only make use of the labeled

instances in the training set for learning the decision boundary, are unable to

exploit the information contained in unlabeled instances. In contrast, learning

algorithms that make use of the structure in the unlabeled data for learning the

classification model are known as semi-supervised learning algorithms [315,

316]. The use of unlabeled data is also explored in the paradigm of multi-view

learning [299, 311], where every object is observed in multiple views of the

data, involving diverse sets of features. A common strategy used by multi-view

learning algorithms is co-training [206], where a different model is learned for

every view of the data, but the model predictions from every view are

constrained to be identical to each other on the unlabeled test instances.

Another learning paradigm that is commonly explored in the paucity of training

data is the framework of active learning, which attempts to seek the smallest

set of label annotations to learn a reasonable classification model. Active

learning expects the annotator to be involved in the process of model learning,

so that the labels are requested incrementally over the most relevant set of

instances, given a limited budget of label annotations. For example, it may be

useful to obtain labels over instances closer to the decision boundary that can

play a bigger role in fine-tuning the boundary. A review on active learning

approaches can be found in [285, 295].

In some applications, it is important to simultaneously solve multiple learning

tasks together, where some of the tasks may be similar to one another. For

example, if we are interested in translating a passage written in English into

different languages, the tasks involving lexically similar languages (such as

Spanish and Portuguese) would require similar learning of models. The

paradigm of multi-task learning helps in simultaneously learning across all

tasks while sharing the learning among related tasks. This is especially useful

when some of the tasks do not contain sufficiently many training samples, in

which case borrowing the learning from other related tasks helps in the

learning of robust models. A special case of multi-task learning is transfer

learning, where the learning from a source task (with sufficient number of

training samples) has to be transferred to a destination task (with paucity of

training data). An extensive survey of transfer learning approaches is provided

by Pan et al. [282].

Most classifiers assume every data instance must belong to a class, which is

not always true for some applications. For example, in malware detection, due

to the ease in which new malwares are created, a classifier trained on existing

classes may fail to detect new ones even if the features for the new malwares

are considerably different than those for existing malwares. Another example

is in critical applications such as medical diagnosis, where prediction errors

are costly and can have severe consequences. In this situation, it would be

better for the classifier to refrain from making any prediction on a data

instance if it is unsure of its class. This approach, known as classifier with

reject option, does not need to classify every data instance unless it

determines the prediction is reliable (e.g., if the class probability exceeds a

user-specified threshold). Instances that are unclassified can be presented to

domain experts for further determination of their true class labels.

Classifiers can also be distinguished in terms of how the classification model

is trained. A batch classifier assumes all the labeled instances are available

during training. This strategy is applicable when the training set size is not too

large and for stationary data, where the relationship between the attributes

and classes does not vary over time. An online classifier, on the other hand,

trains an initial model using a subset of the labeled data [263]. The model is

then updated incrementally as more labeled instances become available. This

strategy is effective when the training set is too large or when there is concept

drift due to changes in the distribution of the data over time.

Bibliography

[195] C. C. Aggarwal. Data classification: algorithms and applications. CRC

Press, 2014.

[196] D. W. Aha. A study of instance-based algorithms for supervised learning

tasks: mathematical, empirical, and psychological evaluations. PhD thesis,

University of California, Irvine, 1990.

[197] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing Multiclass to

Binary: A Unifying Approach to Margin Classifiers. Journal of Machine

Learning Research, 1: 113–141, 2000.

[198] R. Andrews, J. Diederich, and A. Tickle. A Survey and Critique of

Techniques For Extracting Rules From Trained Artificial Neural Networks.

Knowledge Based Systems, 8(6):373–389, 1995.

[199] P. Baldi. Autoencoders, unsupervised learning, and deep architectures.

ICML unsupervised and transfer learning, 27(37-50):1, 2012.

[200] Y. Bengio. Learning deep architectures for AI. Foundations and trends R

in Machine Learning, 2(1):1–127, 2009.

[201] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A

review and new perspectives. IEEE transactions on pattern analysis and

machine intelligence, 35(8): 1798–1828, 2013.

[202] K. Bennett and C. Campbell. Support Vector Machines: Hype or

Hallelujah. SIGKDD Explorations, 2(2):1–13, 2000.

[203] D. Berrar and P. Flach. Caveats and pitfalls of ROC analysis in clinical

microarray research (and how to avoid them). Briefings in bioinformatics,

page bbr008, 2011.

[204] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford

University Press, Oxford, U.K., 1995.

[205] C. M. Bishop. Pattern Recognition and Machine Learning. Springer,

2006.

[206] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-

training. In Proceedings of the eleventh annual conference on

Computational learning theory, pages 92–100. ACM, 1998.

[207] L. Bottou. Large-scale machine learning with stochastic gradient

descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer,

2010.

[208] A. P. Bradley. The use of the area under the ROC curve in the

Evaluation of Machine Learning Algorithms. Pattern Recognition,

30(7):1145–1149, 1997.

[209] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–140,

1996.

[210] L. Breiman. Bias, Variance, and Arcing Classifiers. Technical Report

486, University of California, Berkeley, CA, 1996.

[211] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

[212] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern

Recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.

[213] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE:

Synthetic Minority Over-sampling Technique. Journal of Artificial

Intelligence Research, 16: 321–357, 2002.

[214] N. V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: Special Issue on

Learning from Imbalanced Data Sets. SIGKDD Explorations, 6(1):1–6,

2004.

[215] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and

Methods. Wiley Interscience, 1998.

[216] P. Clark and R. Boswell. Rule Induction with CN2: Some Recent

Improvements. In Machine Learning: Proc. of the 5th European Conf.

(EWSL-91), pages 151–163, 1991.

[217] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning,

3(4): 261–283, 1989.

[218] W. W. Cohen. Fast Effective Rule Induction. In Proc. of the 12th Intl.

Conf. on Machine Learning, pages 115–123, Tahoe City, CA, July 1995.

[219] S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm for

Learning with Symbolic Features. Machine Learning, 10:57–78, 1993.

[220] T. M. Cover and P. E. Hart. Nearest Neighbor Pattern Classification.

Knowledge Based Systems, 8(6):373–389, 1995.

[221] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector

Machines and Other Kernel-based Learning Methods. Cambridge

University Press, 2000.

[222] T. G. Dietterich. Ensemble Methods in Machine Learning. In First Intl.

Workshop on Multiple Classifier Systems, Cagliari, Italy, 2000.

[223] T. G. Dietterich and G. Bakiri. Solving Multiclass Learning Problems via

Error-Correcting Output Codes. Journal of Artificial Intelligence Research,

2:263–286, 1995.

[224] P. Domingos. The RISE system: Conquering without separating. In Proc.

of the 6th IEEE Intl. Conf. on Tools with Artificial Intelligence, pages 704–

707, New Orleans, LA, 1994.

[225] P. Domingos. MetaCost: A General Method for Making Classifiers Cost-

Sensitive. In Proc. of the 5th Intl. Conf. on Knowledge Discovery and Data

Mining, pages 155–164, San Diego, CA, August 1999.

[226] P. Domingos. A unified bias-variance decomposition. In Proceedings of

17th International Conference on Machine Learning, pages 231–238, 2000.

[227] P. Domingos and M. Pazzani. On the Optimality of the Simple Bayesian

Classifier under Zero-One Loss. Machine Learning, 29(2-3):103–130,

1997.

[228] C. Drummond and R. C. Holte. C4.5, Class imbalance, and Cost

sensitivity: Why under-sampling beats over-sampling. In ICML’2004

Workshop on Learning from Imbalanced Data Sets II, Washington, DC,

August 2003.

[229] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John

Wiley & Sons, Inc., New York, 2nd edition, 2001.

[230] M. H. Dunham. Data Mining: Introductory and Advanced Topics.

Prentice Hall, 2006.

[231] C. Elkan. The Foundations of Cost-Sensitive Learning. In Proc. of the

17th Intl. Joint Conf. on Artificial Intelligence, pages 973–978, Seattle, WA,

August 2001.

[232] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S.

Bengio. Why does unsupervised pre-training help deep learning? Journal

of Machine Learning Research, 11(Feb):625–660, 2010.

[233] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost:

misclassification cost-sensitive boosting. In Proc. of the 16th Intl. Conf. on

Machine Learning, pages 97–105, Bled, Slovenia, June 1999.

[234] J. Fürnkranz and G. Widmer. Incremental reduced error pruning. In

Proc. of the 11th Intl. Conf. on Machine Learning, pages 70–77, New

Brunswick, NJ, July 1994.

[235] C. Ferri, J. Hernández-Orallo, and P. A. Flach. A coherent interpretation

of AUC as a measure of aggregated classification performance. In

Proceedings of the 28th International Conference on Machine Learning

(ICML-11), pages 657–664, 2011.

[236] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-

line learning and an application to boosting. Journal of Computer and

System Sciences, 55(1): 119–139, 1997.

[237] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic

Press, New York, 1990.

[238] D. Geiger, T. S. Verma, and J. Pearl. d-separation: From theorems to

algorithms. arXiv preprint arXiv:1304.1505, 2013.

[239] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Book in

preparation for MIT Press, 2016.

[240] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y.

Bengio. Maxout networks. ICML (3), 28:1319–1327, 2013.

[241] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J.

Schmidhuber. A novel connectionist system for unconstrained handwriting

recognition. IEEE transactions on pattern analysis and machine

intelligence, 31(5):855–868, 2009.

[242] A. Graves and J. Schmidhuber. Offline handwriting recognition with

multidimensional recurrent neural networks. In Advances in neural

information processing systems, pages 545–552, 2009.

[243] E.-H. Han, G. Karypis, and V. Kumar. Text Categorization Using Weight

Adjusted k-Nearest Neighbor Classification. In Proc. of the 5th Pacific-Asia

Conf. on Knowledge Discovery and Data Mining, Lyon, France, 2001.

[244] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan

Kaufmann Publishers, San Francisco, 2001.

[245] D. J. Hand. Measuring classifier performance: a coherent alternative to

the area under the ROC curve. Machine learning, 77(1):103–123, 2009.

[246] D. J. Hand. Evaluating diagnostic tests: the area under the ROC curve

and the balance of errors. Statistics in medicine, 29(14):1502–1510, 2010.

[247] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT

Press, 2001.

[248] T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals

of Statistics, 26(2):451–471, 1998.

[249] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical

Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition,

2009.

[250] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with

sparsity: the lasso and generalizations. CRC Press, 2015.

[251] M. Hearst. Trends & Controversies: Support Vector Machines. IEEE

Intelligent Systems, 13(4):18–28, 1998.

[252] D. Heckerman. Bayesian Networks for Data Mining. Data Mining and

Knowledge Discovery, 1(1):79–119, 1997.

[253] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of

data with neural networks. Science, 313(5786):504–507, 2006.

[254] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.

Salakhutdinov. Improving neural networks by preventing co-adaptation of

feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[255] R. C. Holte. Very Simple Classification Rules Perform Well on Most

Commonly Used Data sets. Machine Learning, 11:63–91, 1993.

[256] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep

network training by reducing internal covariate shift. arXiv preprint

arXiv:1502.03167, 2015.

[257] N. Japkowicz. The Class Imbalance Problem: Significance and

Strategies. In Proc. of the 2000 Intl. Conf. on Artificial Intelligence: Special

Track on Inductive Learning, volume 1, pages 111–117, Las Vegas, NV,

June 2000.

[258] F. V. Jensen. An introduction to Bayesian networks, volume 210. UCL

press London, 1996.

[259] M. I. Jordan. Learning in graphical models, volume 89. Springer Science

& Business Media, 1998.

[260] M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining Needles in a Haystack:

Classifying Rare Classes via Two-Phase Rule Induction. In Proc. of 2001

ACM-SIGMOD Intl. Conf. on Management of Data, pages 91–102, Santa

Barbara, CA, June 2001.

[261] M. V. Joshi, R. C. Agarwal, and V. Kumar. Predicting rare classes: can

boosting make any weak learner strong? In Proc. of the 8th Intl. Conf. on

Knowledge Discovery and Data Mining, pages 297–306, Edmonton,

Canada, July 2002.

[262] M. V. Joshi and V. Kumar. CREDOS: Classification Using Ripple Down

Structure (A Case for Rare Classes). In Proc. of the SIAM Intl. Conf. on

Data Mining, pages 321–332, Orlando, FL, April 2004.

[263] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with

kernels. IEEE transactions on signal processing, 52(8):2165–2176, 2004.

[264] E. B. Kong and T. G. Dietterich. Error-Correcting Output Coding Corrects

Bias and Variance. In Proc. of the 12th Intl. Conf. on Machine Learning,

pages 313–321, Tahoe City, CA, July 1995.

[265] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification

with deep convolutional neural networks. In Advances in neural information

processing systems, pages 1097–1105, 2012.

[266] M. Kubat and S. Matwin. Addressing the Curse of Imbalanced Training

Sets: One Sided Selection. In Proc. of the 14th Intl. Conf. on Machine

Learning, pages 179–186, Nashville, TN, July 1997.

[267] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian

classifiers. In Proc. of the 10th National Conf. on Artificial Intelligence,

pages 223–228, 1992.

[268] Y. LeCun and Y. Bengio. Convolutional networks for images, speech,

and time series. The handbook of brain theory and neural networks,

3361(10):1995, 1995.

[269] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature,

521(7553):436–444, 2015.

[270] D. D. Lewis. Naive Bayes at Forty: The Independence Assumption in

Information Retrieval. In Proc. of the 10th European Conf. on Machine

Learning (ECML 1998), pages 4–15, 1998.

[271] C. X. Ling and V. S. Sheng. Cost-sensitive learning. In Encyclopedia of

Machine Learning, pages 231–235. Springer, 2011.

[272] O. Mangasarian. Data Mining via Support Vector Machines. Technical

Report Technical Report 01-05, Data Mining Institute, May 2001.

[273] D. D. Margineantu and T. G. Dietterich. Learning Decision Trees for

Loss Minimization in Multi-Class Problems. Technical Report 99-30-03,

Oregon State University, 1999.

[274] P. McCullagh and J. A. Nelder. Generalized linear models, volume 37.

CRC press, 1989.

[275] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent

in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133,

1943.

[276] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The Multi-Purpose

Incremental Learning System AQ15 and Its Testing Application to Three

Medical Domains. In Proc. of 5th National Conf. on Artificial Intelligence,

Orlando, August 1986.

[277] T. Mikolov, M. Karafiát, L. Burget, J. Cernock`y, and S. Khudanpur.

Recurrent neural network based language model. In Interspeech, volume

2, page 3, 2010.

[278] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.

[279] S. Muggleton. Foundations of Inductive Logic Programming. Prentice

Hall, Englewood Cliffs, NJ, 1995.

[280] J. A. Nelder and R. J. Baker. Generalized linear models. Encyclopedia of

statistical sciences, 1972.

[281] M. A. Nielsen. Neural networks and deep learning. Published online:

http: // neuralnetworksanddeeplearning. com/ .( visited: 10. 15. 2016) ,

2015.

[282] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions

on knowledge and data engineering, 22(10):1345–1359, 2010.

[283] J. Pearl. Probabilistic reasoning in intelligent systems: networks of

plausible inference. Morgan Kaufmann, 2014.

[284] D. M. Powers. The problem of area under the curve. In 2012 IEEE

International Conference on Information Science and Technology, pages

567–573. IEEE, 2012.

[285] M. Prince. Does active learning work? A review of the research. Journal

of engineering education, 93(3):223–231, 2004.

[286] F. J. Provost and T. Fawcett. Analysis and Visualization of Classifier

Performance: Comparison under Imprecise Class and Cost Distributions.

In Proc. of the 3rd Intl. Conf. on Knowledge Discovery and Data Mining,

pages 43–48, Newport Beach, CA, August 1997.

[287] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufmann

Publishers, San Mateo, CA, 1993.

[288] M. Ramoni and P. Sebastiani. Robust Bayes classifiers. Artificial

Intelligence, 125: 209–226, 2001.

[289] N. Rochester, J. Holland, L. Haibt, and W. Duda. Tests on a cell

assembly theory of the action of the brain, using a large digital computer.

IRE Transactions on information Theory, 2(3):80–93, 1956.

[290] F. Rosenblatt. The perceptron: a probabilistic model for information

storage and organization in the brain. Psychological review, 65(6):386,

1958.

[291] S. J. Russell, P. Norvig, J. F. Canny, J. M. Malik, and D. D. Edwards.

Artificial intelligence: a modern approach, volume 2. Prentice hall Upper

Saddle River, 2003.

[292] T. Saito and M. Rehmsmeier. The precision-recall plot is more

informative than the ROC plot when evaluating binary classifiers on

imbalanced datasets. PloS one, 10(3): e0118432, 2015.

[293] J. Schmidhuber. Deep learning in neural networks: An overview. Neural

Networks, 61:85–117, 2015.

[294] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector

Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.

[295] B. Settles. Active learning literature survey. University of Wisconsin,

Madison, 52 (55-66):11, 2010.

[296] P. Smyth and R. M. Goodman. An Information Theoretic Approach to

Rule Induction from Databases. IEEE Trans. on Knowledge and Data

Engineering, 4(4):301–316, 1992.

[297] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R.

Salakhutdinov. Dropout: a simple way to prevent neural networks from

overfitting. Journal of Machine Learning Research, 15(1):1929–1958,

2014.

[298] M. Steinbach and P.-N. Tan. kNN: k-Nearest Neighbors. In X. Wu and V.

Kumar, editors, The Top Ten Algorithms in Data Mining. Chapman and

Hall/CRC Reference, 1st edition, 2009.

[299] S. Sun. A survey of multi-view machine learning. Neural Computing and

Applications, 23(7-8):2031–2038, 2013.

[300] D. M. J. Tax and R. P. W. Duin. Using Two-Class Classifiers for

Multiclass Classification. In Proc. of the 16th Intl. Conf. on Pattern

Recognition (ICPR 2002), pages 124–127, Quebec, Canada, August 2002.

[301] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal

of the Royal Statistical Society. Series B (Methodological), pages 267–288,

1996.

[302] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann,

Newton, MA, 1978.

[303] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag,

New York, 1995.

[304] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York,

1998.

[305] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and

composing robust features with denoising autoencoders. In Proceedings of

the 25th international conference on Machine learning, pages 1096–1103.

ACM, 2008.

[306] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.

Stacked denoising autoencoders: Learning useful representations in a

deep network with a local denoising criterion. Journal of Machine Learning

Research, 11(Dec):3371–3408, 2010.

[307] A. R. Webb. Statistical Pattern Recognition. John Wiley & Sons, 2nd

edition, 2002.

[308] G. M. Weiss. Mining with Rarity: A Unifying Framework. SIGKDD

Explorations, 6 (1):7–19, 2004.

[309] P. Werbos. Beyond regression: new fools for prediction and analysis in

the behavioral sciences. PhD thesis, Harvard University, 1974.

[310] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools

and Techniques with Java Implementations. Morgan Kaufmann, 1999.

[311] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv preprint

arXiv:1304.5634, 2013.

[312] B. Zadrozny, J. C. Langford, and N. Abe. Cost-Sensitive Learning by

Cost-Proportionate Example Weighting. In Proc. of the 2003 IEEE Intl.

Conf. on Data Mining, pages 435–442, Melbourne, FL, August 2003.

[313] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms.

IEEE transactions on knowledge and data engineering, 26(8):1819–1837,

2014.

[314] Z.-H. Zhou. Multi-instance learning: A survey. Department of Computer

Science & Technology, Nanjing University, Tech. Rep, 2004.

[315] X. Zhu. Semi-supervised learning. In Encyclopedia of machine learning,

pages 892–897. Springer, 2011.

[316] X. Zhu and A. B. Goldberg. Introduction to semi-supervised learning.

Synthesis lectures on artificial intelligence and machine learning, 3(1):1–

130, 2009.

4.14 Exercises

1. Consider a binary classification problem with the following set of attributes

and attribute values:

Suppose a rule-based classifier produces the following rule set:

a. Are the rules mutually exclusive?

b. Is the rule set exhaustive?

c. Is ordering needed for this set of rules?

d. Do you need a default class for the rule set?

2. The RIPPER algorithm (by Cohen [218]) is an extension of an earlier

algorithm called IREP (by Fürnkranz and Widmer [234]). Both algorithms

apply the reduced-error pruning method to determine whether a rule needs

to be pruned. The reduced error pruning method uses a validation set to

estimate the generalization error of a classifier. Consider the following pair of

rules:

Air Conditioner={Working, Broken}

Engine={Good, Bad}

Mileage={High, Medium, Low}

Rust={Yes, No}

Mileage=High→Mileage=HighMileage=Low→Value=HighAir Conditioner=Working

R1:A→CR2:A∧B→C

is obtained by adding a new conjunct, B, to the left-hand side of . For

this question, you will be asked to determine whether is preferred over

from the perspectives of rule-growing and rule-pruning. To determine whether

a rule should be pruned, IREP computes the following measure:

where P is the total number of positive examples in the validation set, N is the

total number of negative examples in the validation set, p is the number of

positive examples in the validation set covered by the rule, and n is the

number of negative examples in the validation set covered by the rule.

is actually similar to classification accuracy for the validation set. IREP favors

rules that have higher values of . On the other hand, RIPPER applies

the following measure to determine whether a rule should be pruned:

a. Suppose is covered by 350 positive examples and 150 negative

examples, while is covered by 300 positive examples and 50 negative

examples. Compute the FOIL’s information gain for the rule with

respect to .

b. Consider a validation set that contains 500 positive examples and 500

negative examples. For , suppose the number of positive examples

covered by the rule is 200, and the number of negative examples covered

bytheruleis50. For , suppose the number of positive examples covered

by the rule is 100 and the number of negative examples is 5. Compute

for both rules. Which rule does IREP prefer?

c. Compute for the previous problem. Which rule does RIPPER

prefer?

R2 R1

R2 R1

vIREP=p+(N−n)P+N,

vIREP

vIREP

vRIPPER=p−nP+n.

R1

R2

R2

R1

R1

R2

vIREP

vRIPPER

3. C4.5rules is an implementation of an indirect method for generating rules

from a decision tree. RIPPER is an implementation of a direct method for

generating rules directly from data.

a. Discuss the strengths and weaknesses of both methods.

b. Consider a data set that has a large difference in the class size (i.e.,

some classes are much bigger than others). Which method (between

C4.5rules and RIPPER) is better in terms of finding high accuracy rules

for the small classes?

4. Consider a training set that contains 100 positive examples and 400

negative examples. For each of the following candidate rules,

determine which is the best and worst candidate rule according to:

a. Rule accuracy.

b. FOIL’s information gain.

c. The likelihood ratio statistic.

d. The Laplace measure.

e. The m-estimate measure (with and ).

5. Figure 4.3 illustrates the coverage of the classification rules R1, R2, and

R3. Determine which is the best and worst rule according to:

a. The likelihood ratio statistic.

b. The Laplace measure.

R1:A→+(covers 4 positive and 1 negative examples),R2:B→+

(covers 30 positive and 10 negative examples),R3:C→+

(covers 100 positive and 90 negative examples),

k=2 p+=0.2

c. The m-estimate measure (with and ).

d. The rule accuracy after R1 has been discovered, where none of the

examples covered by R1 are discarded.

e. The rule accuracy after R1 has been discovered, where only the positive

examples covered by R1 are discarded.

f. The rule accuracy after R1 has been discovered, where both positive and

negative examples covered by R1 are discarded.

6.

a. Suppose the fraction of undergraduate students who smoke is 15% and

the fraction of graduate students who smoke is 23%. If one-fifth of the

college students are graduate students and the rest are undergraduates,

what is the probability that a student who smokes is a graduate student?

b. Given the information in part (a), is a randomly chosen college student

more likely to be a graduate or undergraduate student?

c. Repeat part (b) assuming that the student is a smoker.

d. Suppose 30% of the graduate students live in a dorm but only 10% of the

undergraduate students live in a dorm. If a student smokes and lives in

the dorm, is he or she more likely to be a graduate or undergraduate

student? You can assume independence between students who live in a

dorm and those who smoke.

7. Consider the data set shown in Table 4.9

Table 4.9. Data set for Exercise 7.

Instance A B C Class

1 0 0 0

k=2 p+=0.58

+

2 0 0 1

3 0 1 1

4 0 1 1

5 0 0 1

6 1 0 1

7 1 0 1

8 1 0 1

9 1 1 1

10 1 0 1

a. Estimate the conditional probabilities for

, and .

b. Use the estimate of conditional probabilities given in the previous

question to predict the class label for a test sample

using the naïve Bayes approach.

c. Estimate the conditional probabilities using the m-estimate approach, with

and .

d. Repeat part (b) using the conditional probabilities given in part (c).

e. Compare the two methods for estimating probabilities. Which method is

better and why?

8. Consider the data set shown in Table 4.10 .

Table 4.10. Data set for Exercise 8.

−

−

−

+

+

−

−

+

+

P(A|+), P(B|+), P(C|+), P(A|

−), P(B|−) P(C|−)

(A=0, B=1, C=0)

p=1/2 m=4

Instance A B C Class

1 0 0 1

2 1 0 1

3 0 1 0

4 1 0 0

5 1 0 1

6 0 0 1

7 1 1 0

8 0 0 0

9 0 1 0

10 1 1 1 +

a. Estimate the conditional probabilities for

, and using

the same approach as in the previous problem.

b. Use the conditional probabilities in part (a) to predict the class label for a

test sample using the naïve Bayes approach.

c. Compare , and . State the relationships

between A and B.

d. Repeat the analysis in part (c) using , and .

e. Compare against and

. Are the variables conditionally independent given the

class?

−

+

−

−

+

+

−

−

+

P(A=1|+), P(B=1|+), P(C=1|+), P(A=1|−), P(B=1|−) P(C=1|−)

(A=1, B=1, C=1)

P(A=1), P(B=1) P(A=1, B=1)

P(A=1), P(B=0) P(A=1, B=0)

P(A=1, B=1|Class=+) P(A=1|Class=+)

P(B=1|Class=+)

9.

a. Explain how naïve Bayes performs on the data set shown in Figure

4.56 .

b. If each class is further divided such that there are four classes (A1, A2,

B1, and B2), will naïve Bayes perform better?

c. How will a decision tree perform on this data set (for the two-class

problem)? What if there are four classes?

10. Figure 4.57 illustrates the Bayesian network for the data set shown in

Table 4.11 . (Assume that all the attributes are binary).

a. Draw the probability table for each node in the network.

b. Use the Bayesian network to compute

.

11. Given the Bayesian network shown in Figure 4.58 , compute the

following probabilities:

P(Engine=Bad, Air Conditioner=Broken)

Figure 4.56.

Data set for Exercise 9.

Figure 4.57.

Bayesian network.

a. .P(B=good,F=empty, G=empty, S=yes)

b. .

c. Given that the battery is bad, compute the probability that the car will

start.

12. Consider the one-dimensional data set shown in Table 4.12 .

a. Classify the data point according to its 1-, 3-, 5-, and 9-nearest

neighbors (using majority vote).

b. Repeat the previous analysis using the distance-weighted voting

approach described in Section 4.3.1 .

Table 4.11. Data set for Exercise 10.

Mileage Engine Air

Conditioner

Number of Instances

with

Number of Instances

with

Hi Good Working 3 4

Hi Good Broken 1 2

Hi Bad Working 1 5

Hi Bad Broken 0 4

Lo Good Working 9 0

Lo Good Broken 5 1

Lo Bad Working 1 2

Lo Bad Broken 0 2

P(B=bad,F=empty, G=not empty, S=no)

x=5.0

Car Value=Hi Car Value=Lo

Figure 4.58.

Bayesian network for Exercise 11.

13. The nearest neighbor algorithm described in Section 4.3 can be

extended to handle nominal attributes. A variant of the algorithm called

PEBLS (Parallel Exemplar-Based Learning System) by Cost and Salzberg

[219] measures the distance between two values of a nominal attribute using

the modified value difference metric (MVDM). Given a pair of nominal attribute

values, and , the distance between them is defined as follows:

where is the number of examples from class i with attribute value and

is the number of examples with attribute value

Table 4.12. Data set for Exercise 12.

x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5

V1 V2

d(V1, V2)=∑i=1k| ni1n1−ni2n2, | (4.108)

nij Vj nj

Vj.

y

Consider the training set for the loan classification problem shown in Figure

4.8 . Use the MVDM measure to compute the distance between every pair

of attribute values for the and attributes.

14. For each of the Boolean functions given below, state whether the problem

is linearly separable.

a. A AND B AND C

b. NOT A AND B

c. (A OR B) AND (A OR C)

d. (A XOR B) AND (A OR B)

15.

a. Demonstrate how the perceptron model can be used to represent the

AND and OR functions between a pair of Boolean variables.

b. Comment on the disadvantage of using linear functions as activation

functions for multi-layer neural networks.

16. You are asked to evaluate the performance of two classification models,

and . The test set you have chosen contains 26 binary attributes,

labeled as A through Z. Table 4.13 shows the posterior probabilities

obtained by applying the models to the test set. (Only the posterior

probabilities for the positive class are shown). As this is a two-class problem,

and . Assume that we are mostly

interested in detecting instances from the positive class.

a. Plot the ROC curve for both and . (You should plot them on the

same graph.) Which model do you think is better? Explain your reasons.

− − + + + − − + − −

M1 M2

P(−)=1−P(+) P(−|A, …, Z)=1−P(+|A, …, Z)

M1 M2

b. For model , suppose you choose the cutoff threshold to be . In

other words, any test instances whose posterior probability is greater than

t will be classified as a positive example. Compute the precision, recall,

and F-measure for the model at this threshold value.

c. Repeat the analysis for part (b) using the same cutoff threshold on model

. Compare the F-measure results for both models. Which model is

better? Are the results consistent with what you expect from the ROC

curve?

d. Repeat part (b) for model using the threshold . Which threshold

do you prefer, or ? Are the results consistent with what you

expect from the ROC curve?

Table 4.13. Posterior probabilities for Exercise 16.

Instance True Class

1 0.73 0.61

2 0.69 0.03

3 0.44 0.68

4 0.55 0.31

5 0.67 0.45

6 0.47 0.09

7 0.08 0.38

8 0.15 0.05

9 0.45 0.01

10 0.35 0.04

M1 t=0.5

M2

M1 t=0.1

t=0.5 t=0.1

P(+|A, …, Z, M1) P(+|A, …, Z, M2)

+

+

−

−

+

+

−

−

+

−

17. Following is a data set that contains two attributes, X and Y , and two

class labels, “ ” and “ ”. Each attribute can take three different values: 0, 1, or

2.

X Y Number of Instances

0 0 0 100

1 0 0 0

2 0 0 100

0 1 10 100

1 1 10 0

2 1 10 100

0 2 0 100

1 2 0 0

2 2 0 100

The concept for the “ ” class is and the concept for the “ ” class is

.

a. Build a decision tree on the data set. Does the tree capture the “ ” and

“ ” concepts?

b. What are the accuracy, precision, recall, and -measure of the decision

tree? (Note that precision, recall, and -measure are defined with

+ −

+ −

+ Y=1 −

X=0∨X=2

+

−

F1

F1

respect to the “ ” class.)

c. Build a new decision tree with the following cost function:

(Hint: only the leaves of the old decision tree need to be changed.) Does

the decision tree capture the “ ” concept?

d. What are the accuracy, precision, recall, and -measure of the new

decision tree?

18. Consider the task of building a classifier from random data, where the

attribute values are generated randomly irrespective of the class labels.

Assume the data set contains instances from two classes, “ ” and “ .” Half of

the data set is used for training while the remaining half is used for testing.

a. Suppose there are an equal number of positive and negative instances in

the data and the decision tree classifier predicts every test instance to be

positive. What is the expected error rate of the classifier on the test data?

b. Repeat the previous analysis assuming that the classifier predicts each

test instance to be positive class with probability 0.8 and negative class

with probability 0.2.

c. Suppose two-thirds of the data belong to the positive class and the

remaining one-third belong to the negative class. What is the expected

error of a classifier that predicts every test instance to be positive?

d. Repeat the previous analysis assuming that the classifier predicts each

test instance to be positive class with probability 2/3 and negative class

with probability 1/3.

+

C(i, j)={ 0,if i=j;1,if i=+, j=−;Number of

− instancesNumber of+ instancesif i=−, j=+;

+

F1

+ −

19. Derive the dual Lagrangian for the linear SVM with non-separable data

where the objective function is

20. Consider the XOR problem where there are four training points:

Transform the data into the following feature space:

Find the maximum margin linear decision boundary in the transformed space.

21. Given the data sets shown in Figures 4.59 , explain how the decision

tree, naïve Bayes, and k-nearest neighbor classifiers would perform on these

data sets.

f(w)=ǁ w ǁ22+C(∑i=1Nξi)2.

(1, 1, −), (1, 0, +), (0, 1, +), (0, 0, −).

φ=(1, 2×1, 2×2, 2x1x2, x12, x22).

Figure 4.59.

Data set for Exercise 21.

5 Association Analysis: Basic

Concepts and Algorithms

Many business enterprises accumulate large quantities

of data from their day-to-day operations. For example,

huge amounts of customer purchase data are collected

daily at the checkout counters of grocery stores. Table

5.1 gives an example of such data, commonly known

as market basket transactions. Each row in this table

corresponds to a transaction, which contains a unique

identifier labeled TID and a set of items bought by a

given customer. Retailers are interested in analyzing

the data to learn about the purchasing behavior of their

customers. Such valuable information can be used to

support a variety of business-related applications such

as marketing promotions, inventory management, and

customer relationship management.

Table 5.1. An example of market basket

transactions.

TID Items

1 {Bread, Milk}

2 {Bread, Diapers, Beer, Eggs}

3 {Milk, Diapers, Beer, Cola}

4 {Bread, Milk, Diapers, Beer}

5 {Bread, Milk, Diapers, Cola}

This chapter presents a methodology known as

association analysis, which is useful for discovering

interesting relationships hidden in large data sets. The

uncovered relationships can be represented in the form

of sets of items present in many transactions, which are

known as frequent itemsets, or association rules,

that represent relationships between two itemsets. For

example, the following rule can be extracted from the

data set shown in Table 5.1 :

The rule suggests a relationship between the sale of

diapers and beer because many customers who buy

diapers also buy beer. Retailers can use these types of

rules to help them identify new opportunities for cross-

selling their products to the customers.

{Diapers}→{Beer}.

Besides market basket data, association analysis is

also applicable to data from other application domains

such as bioinformatics, medical diagnosis, web mining,

and scientific data analysis. In the analysis of Earth

science data, for example, association patterns may

reveal interesting connections among the ocean, land,

and atmospheric processes. Such information may help

Earth scientists develop a better understanding of how

the different elements of the Earth system interact with

each other. Even though the techniques presented here

are generally applicable to a wider variety of data sets,

for illustrative purposes, our discussion will focus

mainly on market basket data.

There are two key issues that need to be addressed

when applying association analysis to market basket

data. First, discovering patterns from a large transaction

data set can be computationally expensive. Second,

some of the discovered patterns may be spurious

(happen simply by chance) and even for non-spurious

patterns, some are more interesting than others. The

remainder of this chapter is organized around these two

issues. The first part of the chapter is devoted to

explaining the basic concepts of association analysis

and the algorithms used to efficiently mine such

patterns. The second part of the chapter deals with the

issue of evaluating the discovered patterns in order to

help prevent the generation of spurious results and to

rank the patterns in terms of some interestingness

measure.

5.1 Preliminaries

This section reviews the basic terminology used in association analysis and

presents a formal description of the task.

Binary Representation

Market basket data can be represented in a binary format as shown in Table

5.2 , where each row corresponds to a transaction and each column

corresponds to an item. An item can be treated as a binary variable whose

value is one if the item is present in a transaction and zero otherwise.

Because the presence of an item in a transaction is often considered more

important than its absence, an item is an asymmetric binary variable. This

representation is a simplistic view of real market basket data because it

ignores important aspects of the data such as the quantity of items sold or the

price paid to purchase them. Methods for handling such non-binary data will

be explained in Chapter 6 .

Table 5.2. A binary 0/1 representation of market basket data.

TID Bread Milk Diapers Beer Eggs Cola

1 1 1 0 0 0 0

2 1 0 1 1 1 0

3 0 1 1 1 0 1

4 1 1 1 1 0 0

5 1 1 1 0 0 1

Itemset and Support Count

Let be the set of all items in a market basket data and

be the set of all transactions. Each transaction, contains a

subset of items chosen from I. In association analysis, a collection of zero or

more items is termed an itemset. If an itemset contains k items, it is called a k-

itemset. For instance, { , , } is an example of a 3-itemset. The

null (or empty) set is an itemset that does not contain any items.

A transaction is said to contain an itemset X if X is a subset of . For

example, the second transaction shown in Table 5.2 contains the itemset

{ , } but not { , }. An important property of an itemset is

its support count, which refers to the number of transactions that contain a

particular itemset. Mathematically, the support count, , for an itemset X

can be stated as follows:

where the symbol denotes the number of elements in a set. In the data set

shown in Table 5.2 , the support count for { , , } is equal to

two because there are only two transactions that contain all three items.

Often, the property of interest is the support, which is fraction of transactions

in which an itemset occurs:

An itemset X is called frequent if s(X) is greater than some user-defined

threshold, minsup.

I={i1, i2, … , id} T=

{t1, t2, …, tN} ti

tj tj

σ(X)

σ(X)=|{ti|X⊆ti, ti∈T}|,

|⋅|

s(X)=σ(X)/N.

Association Rule

An association rule is an implication expression of the form , where X

and Y are disjoint itemsets, i.e., . The strength of an association rule

can be measured in terms of its support and confidence. Support

determines how often a rule is applicable to a given data set, while confidence

determines how frequently items in Y appear in transactions that contain X.

The formal definitions of these metrics are

Example 5.1.

Consider the rule Because the support count for

{ , , } is 2 and the total number of transactions is 5, the

rule’s support is . The rule’s confidence is obtained by dividing the

support count for { , , } by the support count for { ,

}. Since there are 3 transactions that contain milk and diapers, the

confidence for this rule is .

Why Use Support and Confidence?

Support is an important measure because a rule that has very low support

might occur simply by chance. Also, from a business perspective a low

support rule is unlikely to be interesting because it might not be profitable to

promote items that customers seldom buy together (with the exception of the

situation described in Section 5.8 ). For these reasons, we are interested in

finding rules whose support is greater than some user-defined threshold. As

X→Y

X∩Y=∅

Support, s(X→Y)=σ(X∪Y)N; (5.1)

Confidence, c(X→Y)=σ(X∪Y)σ(X). (5.2)

2/5=0.4

2/3=0.67

will be shown in Section 5.2.1 , support also has a desirable property that

can be exploited for the efficient discovery of association rules.

Confidence, on the other hand, measures the reliability of the inference made

by a rule. For a given rule , the higher the confidence, the more likely it is

for Y to be present in transactions that contain X. Confidence also provides an

estimate of the conditional probability of Y given X.

Association analysis results should be interpreted with caution. The inference

made by an association rule does not necessarily imply causality. Instead, it

can sometimes suggest a strong co-occurrence relationship between items in

the antecedent and consequent of the rule. Causality, on the other hand,

requires knowledge about which attributes in the data capture cause and

effect, and typically involves relationships occurring over time (e.g.,

greenhouse gas emissions lead to global warming). See Section 5.7.1 for

additional discussion.

Formulation of the Association Rule Mining Problem

The association rule mining problem can be formally stated as follows:

Definition 5.1. (Association Rule

Discovery.)

Given a set of transactions T , find all the rules having

and , where minsup and

minconf are the corresponding support and confidence

thresholds.

X→Y

support ≥ minsup confidence ≥ minconf

A brute-force approach for mining association rules is to compute the support

and confidence for every possible rule. This approach is prohibitively

expensive because there are exponentially many rules that can be extracted

from a data set. More specifically, assuming that neither the left nor the right-

hand side of the rule is an empty set, the total number of possible rules, R,

extracted from a data set that contains d items is

The proof for this equation is left as an exercise to the readers (see Exercise 5

on page 440). Even for the small data set shown in Table 5.1 , this

approach requires us to compute the support and confidence for

rules. More than 80% of the rules are discarded after applying

and , thus wasting most of the computations. To

avoid performing needless computations, it would be useful to prune the rules

early without having to compute their support and confidence values.

An initial step toward improving the performance of association rule mining

algorithms is to decouple the support and confidence requirements. From

Equation 5.1 , notice that the support of a rule is the same as the

support of its corresponding itemset, . For example, the following rules

have identical support because they involve items from the same itemset,

{ , , }:

R=3d−2d+1+1. (5.3)

36−27+1=602

minsup=20% mincof=50%

X→Y

X∪Y

{Beer, Diapers}→{Milk},{Beer, Milk}→{Diapers},{Diapers, Milk}→{Beer},{Beer}

→{Diapers, Milk},{Milk}→{Beer, Diapers},{Diapers}→{Beer, Milk}.

If the itemset is infrequent, then all six candidate rules can be pruned

immediately without our having to compute their confidence values.

Therefore, a common strategy adopted by many association rule mining

algorithms is to decompose the problem into two major subtasks:

1. Frequent Itemset Generation, whose objective is to find all the

itemsets that satisfy the minsup threshold.

2. Rule Generation, whose objective is to extract all the high confidence

rules from the frequent itemsets found in the previous step. These rules

are called strong rules.

The computational requirements for frequent itemset generation are generally

more expensive than those of rule generation. Efficient techniques for

generating frequent itemsets and association rules are discussed in Sections

5.2 and 5.3 , respectively.

5.2 Frequent Itemset Generation

A lattice structure can be used to enumerate the list of all possible itemsets.

Figure 5.1 shows an itemset lattice for . In general, a data

set that contains k items can potentially generate up to frequent

itemsets, excluding the null set. Because k can be very large in many practical

applications, the search space of itemsets that need to be explored is

exponentially large.

Figure 5.1.

I={a, b, c, d, e}

2k−1

An itemset lattice.

A brute-force approach for finding frequent itemsets is to determine the

support count for every candidate itemset in the lattice structure. To do this,

we need to compare each candidate against every transaction, an operation

that is shown in Figure 5.2 . If the candidate is contained in a transaction,

its support count will be incremented. For example, the support for { ,

} is incremented three times because the itemset is contained in

transactions 1, 4, and 5. Such an approach can be very expensive because it

requires O(NMw) comparisons, where N is the number of transactions,

is the number of candidate itemsets, and w is the maximum transaction

width. Transaction width is the number of items present in a transaction.

Figure 5.2.

Counting the support of candidate itemsets.

There are three main approaches for reducing the computational complexity

of frequent itemset generation.

1. Reduce the number of candidate itemsets (M). The Apriori principle,

described in the next Section, is an effective way to eliminate some of

M=2k

−1

the candidate itemsets without counting their support values.

2. Reduce the number of comparisons. Instead of matching each

candidate itemset against every transaction, we can reduce the number

of comparisons by using more advanced data structures, either to store

the candidate itemsets or to compress the data set. We will discuss

these strategies in Sections 5.2.4 and 5.6 , respectively.

3. Reduce the number of transactions (N). As the size of candidate

itemsets increases, fewer transactions will be supported by the

itemsets. For instance, since the width of the first transaction in Table

5.1 is 2, it would be advantageous to remove this transaction before

searching for frequent itemsets of size 3 and larger. Algorithms that

employ such a strategy are discussed in the Bibliographic Notes.

5.2.1 The Apriori Principle

This Section describes how the support measure can be used to reduce the

number of candidate itemsets explored during frequent itemset generation.

The use of support for pruning candidate itemsets is guided by the following

principle.

Theorem 5.1 (Apriori Principle).

If an itemset is frequent, then all of its subsets must also be

frequent.

To illustrate the idea behind the Apriori principle, consider the itemset lattice

shown in Figure 5.3 . Suppose {c, d, e} is a frequent itemset. Clearly, any

transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e},

{d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c,

d, e} (i.e., the shaded itemsets in this figure) must also be frequent.

Figure 5.3.

An illustration of the Apriori principle. If {c, d, e} is frequent, then all subsets of

this itemset are frequent.

Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets

must be infrequent too. As illustrated in Figure 5.4 , the entire subgraph

containing the supersets of {a, b} can be pruned immediately once {a, b} is

found to be infrequent. This strategy of trimming the exponential search space

based on the support measure is known as support-based pruning. Such a

pruning strategy is made possible by a key property of the support measure,

namely, that the support for an itemset never exceeds the support for its

subsets. This property is also known as the anti-monotone property of the

support measure.

Figure 5.4.

An illustration of support-based pruning. If {a, b} is infrequent, then all

supersets of {a, b} are infrequent.

Definition 5.2. (Anti-monotone Property.)

A measure f possesses the anti-monotone property if for every

itemset X that is a proper subset of itemset Y, i.e. , we have

.

More generally, a large number of measures—see Section 5.7.1 —can be

applied to itemsets to evaluate various properties of itemsets. As will be

shown in the next Section, any measure that has the anti-monotone property

can be incorporated directly into an itemset mining algorithm to effectively

prune the exponential search space of candidate itemsets.

5.2.2 Frequent Itemset Generation in

the Apriori Algorithm

Apriori is the first association rule mining algorithm that pioneered the use of

support-based pruning to systematically control the exponential growth of

candidate itemsets. Figure 5.5 provides a high-level illustration of the

frequent itemset generation part of the Apriori algorithm for the transactions

shown in Table 5.1 . We assume that the support threshold is 60%, which is

equivalent to a minimum support count equal to 3.

X⊂Y

f(Y)≤f(X)

Figure 5.5.

Illustration of frequent itemset generation using the Apriori algorithm.

Initially, every item is considered as a candidate 1-itemset. After counting their

supports, the candidate itemsets { } and { } are discarded because

they appear in fewer than three transactions. In the next iteration, candidate 2-

itemsets are generated using only the frequent 1-itemsets because the Apriori

principle ensures that all supersets of the infrequent 1-itemsets must be

infrequent. Because there are only four frequent 1-itemsets, the number of

candidate 2-itemsets generated by the algorithm is . Two of these six

candidates, { , } and { , }, are subsequently found to be

infrequent after computing their support values. The remaining four

candidates are frequent, and thus will be used to generate candidate 3-

itemsets. Without support-based pruning, there are candidate 3-

itemsets that can be formed using the six items given in this example. With

(42)=6

(63)=20

the Apriori principle, we only need to keep candidate 3-itemsets whose

subsets are frequent. The only candidate that has this property is { ,

, }. However, even though the subsets of { , , }

are frequent, the itemset itself is not.

The effectiveness of the Apriori pruning strategy can be shown by counting

the number of candidate itemsets generated. A brute-force strategy of

enumerating all itemsets (up to size 3) as candidates will produce

candidates. With the Apriori principle, this number decreases to

candidates, which represents a 68% reduction in the number of candidate

itemsets even in this simple example.

The pseudocode for the frequent itemset generation part of the Apriori

algorithm is shown in Algorithm 5.1 . Let denote the set of candidate k-

itemsets and denote the set of frequent k-itemsets:

The algorithm initially makes a single pass over the data set to determine

the support of each item. Upon completion of this step, the set of all

frequent 1-itemsets, , will be known (steps 1 and 2).

Next, the algorithm will iteratively generate new candidate k-itemsets and

prune unnecessary candidates that are guaranteed to be infrequent given

the frequent -itemsets found in the previous iteration (steps 5 and 6).

Candidate generation and pruning is implemented using the functions

candidate-gen and candidate-prune, which are described in Section

5.2.3 .

(61)+(62)+(63)=6+15+20=41

(61)+(42)+1=6+6+1=13

Ck

Fk

F1

(k−1)

To count the support of the candidates, the algorithm needs to make an

additional pass over the data set (steps 7–12). The subset function is used

to determine all the candidate itemsets in that are contained in each

transaction t. The implementation of this function is described in Section

5.2.4 .

After counting their supports, the algorithm eliminates all candidate

itemsets whose support counts are less than (step 13).

The algorithm terminates when there are no new frequent itemsets

generated, i.e., (step 14).

The frequent itemset generation part of the Apriori algorithm has two

important characteristics. First, it is a level-wise algorithm; i.e., it traverses the

itemset lattice one level at a time, from frequent 1-itemsets to the maximum

size of frequent itemsets. Second, it employs a generate-and-test strategy for

finding frequent itemsets. At each iteration (level), new candidate itemsets are

generated from the frequent itemsets found in the previous iteration. The

support for each candidate is then counted and tested against the minsup

threshold. The total number of iterations needed by the algorithm is ,

where is the maximum size of the frequent itemsets.

5.2.3 Candidate Generation and

Pruning

The candidate-gen and candidate-prune functions shown in Steps 5 and 6 of

Algorithm 5.1 generate candidate itemsets and prunes unnecessary ones

by performing the following two operations, respectively:

Ck

N×minsup

Fk=∅

kmax+1

kmax

1. Candidate Generation. This operation generates new candidate k-

itemsets based on the frequent -itemsets found in the previous

iteration.

Algorithm 5.1 Frequent itemset generation of

the Apriori algorithm.

∈ ∧

∈

∈

∈ ∧

∅

∪

2. Candidate Pruning. This operation eliminates some of the candidate

k-itemsets using support-based pruning, i.e. by removing k-itemsets

whose subsets are known to be infrequent in previous iterations. Note

(k−1)

that this pruning is done without computing the actual support of these

k-itemsets (which could have required comparing them against each

transaction).

Candidate Generation

In principle, there are many ways to generate candidate itemsets. An effective

candidate generation procedure must be complete and non-redundant. A

candidate generation procedure is said to be complete if it does not omit any

frequent itemsets. To ensure completeness, the set of candidate itemsets

must subsume the set of all frequent itemsets, i.e., . A candidate

generation procedure is non-redundant if it does not generate the same

candidate itemset more than once. For example, the candidate itemset {a, b,

c, d} can be generated in many ways—by merging {a, b, c} with {d}, {b, d} with

{a, c}, {c} with {a, b, d}, etc. Generation of duplicate candidates leads to

wasted computations and thus should be avoided for efficiency reasons. Also,

an effective candidate generation procedure should avoid generating too

many unnecessary candidates. A candidate itemset is unnecessary if at least

one of its subsets is infrequent, and thus, eliminated in the candidate pruning

step.

Next, we will briefly describe several candidate generation procedures,

including the one used by the candidate-gen function.

Brute-Force Method

The brute-force method considers every k-itemset as a potential candidate

and then applies the candidate pruning step to remove any unnecessary

candidates whose subsets are infrequent (see Figure 5.6 ). The number of

candidate itemsets generated at level k is equal to , where d is the total

number of items. Although candidate generation is rather trivial, candidate

∀k:Fk⊆Ck

(dk)

pruning becomes extremely expensive because a large number of itemsets

must be examined.

Figure 5.6.

A brute-force method for generating candidate 3-itemsets.

Method

An alternative method for candidate generation is to extend each frequent

-itemset with frequent items that are not part of the -itemset. Figure

5.7 illustrates how a frequent 2-itemset such as { , } can be

augmented with a frequent item such as to produce a candidate 3-

itemset { , , }.

Fk−1×F1

(k

−1) (k−1)

Figure 5.7.

Generating and pruning candidate k-itemsets by merging a frequent –

itemset with a frequent item. Note that some of the candidates are

unnecessary because their subsets are infrequent.

The procedure is complete because every frequent k-itemset is composed of

a frequent -itemset and a frequent 1-itemset. Therefore, all frequent k-

itemsets are part of the candidate k-itemsets generated by this procedure.

Figure 5.7 shows that the candidate generation method only

produces four candidate 3-itemsets, instead of the

itemsets produced by the brute-force method. The method

generates lower number of candidates because every candidate is

guaranteed to contain at least one frequent -itemset. While this

procedure is a substantial improvement over the brute-force method, it can

still produce a large number of unnecessary candidates, as the remaining

subsets of a candidate itemset can still be infrequent.

Note that the approach discussed above does not prevent the same candidate

(k−1)

(k−1)

Fk−1×F1

(63)=20 Fk−1×F1

(k−1)

itemset from being generated more than once. For instance, { , ,

} can be generated by merging { , } with { }, { ,

} with { }, or { , } with { }. One way to avoid

generating duplicate candidates is by ensuring that the items in each frequent

itemset are kept sorted in their lexicographic order. For example, itemsets

such as { , }, { , , }, and { , } follow

lexicographic order as the items within every itemset are arranged

alphabetically. Each frequent -itemset X is then extended with frequent

items that are lexicographically larger than the items in X. For example, the

itemset { , } can be augmented with { } because Milk is

lexicographically larger than Bread and Diapers. However, we should not

augment { , } with { } nor { , } with { }

because they violate the lexicographic ordering condition. Every candidate k-

itemset is thus generated exactly once, by merging the lexicographically

largest item with the remaining items in the itemset. If the

method is used in conjunction with lexicographic ordering, then only two

candidate 3-itemsets will be produced in the example illustrated in Figure

5.7 . { , , } and { , , } will not be generated

because { , } is not a frequent 2-itemset.

Method

This candidate generation procedure, which is used in the candidate-gen

function of the Apriori algorithm, merges a pair of frequent -itemsets only

if their first items, arranged in lexicographic order, are identical. Let

and be a pair of frequent –

itemsets, arranged lexicographically. A and B are merged if they satisfy the

following conditions:

(k−1)

k−1 Fk−1×F1

Fk−1×Fk−1

(k−1)

k−2 A=

{a1, a2, …, ak−1} B={b1, b2, …, bk−1} (k−1)

ai=bi (for i=1, 2, …, k−2).

Note that in this case, because A and B are two distinct itemsets.

The candidate k-itemset generated by merging A and B consists of the first

common items followed by and in lexicographic order. This

candidate generation procedure is complete, because for every

lexicographically ordered frequent k-itemset, there exists two lexicographically

ordered frequent -itemsets that have identical items in the first

positions.

In Figure 5.8 , the frequent itemsets { , } and { , } are

merged to form a candidate 3-itemset { , , }. The algorithm

does not have to merge { , } with { , } because the first

item in both itemsets is different. Indeed, if { , , } is a viable

candidate, it would have been obtained by merging { , } with

{ , } instead. This example illustrates both the completeness of the

candidate generation procedure and the advantages of using lexicographic

ordering to prevent duplicate candidates. Also, if we order the frequent –

itemsets according to their lexicographic rank, itemsets with identical first

items would take consecutive ranks. As a result, the candidate

generation method would consider merging a frequent itemset only with ones

that occupy the next few ranks in the sorted list, thus saving some

computations.

ak−1≠bk−1

k

−2 ak−1 bk−1

(k−1) k−2

(k−1)

k−2

Fk−1×Fk−1

Figure 5.8.

Generating and pruning candidate k-itemsets by merging pairs of frequent

-itemsets.

Figure 5.8 shows that the candidate generation procedure

results in only one candidate 3-itemset. This is a considerable reduction from

the four candidate 3-itemsets generated by the method. This is

because the method ensures that every candidate k-itemset

contains at least two frequent -itemsets, thus greatly reducing the

number of candidates that are generated in this step.

Note that there can be multiple ways of merging two frequent -itemsets

in the procedure, one of which is merging if their first items

are identical. An alternate approach could be to merge two frequent –

itemsets A and B if the last items of A are identical to the first

itemsets of B. For example, { , } and { , } could be

merged using this approach to generate the candidate 3-itemset { ,

, }. As we will see later, this alternate procedure is

(k

−1)

Fk−1×Fk−1

Fk−1×F1

Fk−1×Fk−1

(k−1)

(k−1)

Fk−1×Fk−1 k−2

(k−1)

k−2 k−2

Fk−1×Fk−1

useful in generating sequential patterns, which will be discussed in Chapter

6 .

Candidate Pruning

To illustrate the candidate pruning operation for a candidate k-itemset,

, consider its k proper subsets, . If any of

them are infrequent, then X is immediately pruned by using the Apriori

principle. Note that we don’t need to explicitly ensure that all subsets of X of

size less than are frequent (see Exercise 7). This approach greatly

reduces the number of candidate itemsets considered during support

counting. For the brute-force candidate generation method, candidate pruning

requires checking only k subsets of size for each candidate k-itemset.

However, since the candidate generation strategy ensures that at

least one of the -size subsets of every candidate k-itemset is frequent,

we only need to check for the remaining subsets. Likewise, the

strategy requires examining only subsets of every candidate k-itemset,

since two of its -size subsets are already known to be frequent in the

candidate generation step.

5.2.4 Support Counting

Support counting is the process of determining the frequency of occurrence

for every candidate itemset that survives the candidate pruning step. Support

counting is implemented in steps 6 through 11 of Algorithm 5.1 . A brute-

force approach for doing this is to compare each transaction against every

candidate itemset (see Figure 5.2 ) and to update the support counts of

candidates contained in a transaction. This approach is computationally

X=

{i1, i2, …, ik} X−{ij}(∀j=1, 2, …, k)

k−1

k−1

Fk−1×F1

(k−1)

k−1 Fk−1×Fk

−1 k−2

(k−1)

expensive, especially when the numbers of transactions and candidate

itemsets are large.

An alternative approach is to enumerate the itemsets contained in each

transaction and use them to update the support counts of their respective

candidate itemsets. To illustrate, consider a transaction t that contains five

items, {1, 2, 3, 5, 6}. There are itemsets of size 3 contained in this

transaction. Some of the itemsets may correspond to the candidate 3-itemsets

under investigation, in which case, their support counts are incremented.

Other subsets of t that do not correspond to any candidates can be ignored.

Figure 5.9 shows a systematic way for enumerating the 3-itemsets

contained in t. Assuming that each itemset keeps its items in increasing

lexicographic order, an itemset can be enumerated by specifying the smallest

item first, followed by the larger items. For instance, given , all

the 3-itemsets contained in t must begin with item 1, 2, or 3. It is not possible

to construct a 3-itemset that begins with items 5 or 6 because there are only

two items in t whose labels are greater than or equal to 5. The number of

ways to specify the first item of a 3-itemset contained in t is illustrated by the

Level 1 prefix tree structure depicted in Figure 5.9 . For instance, 1

represents a 3-itemset that begins with item 1, followed by two more items

chosen from the set {2, 3, 5, 6}.

(53)=10

t={1, 2, 3, 5, 6}

2 3 5 6

Figure 5.9.

Enumerating subsets of three items from a transaction t.

After fixing the first item, the prefix tree structure at Level 2 represents the

number of ways to select the second item. For example, 1 2

corresponds to itemsets that begin with the prefix (1 2) and are followed by

the items 3, 5, or 6. Finally, the prefix tree structure at Level 3 represents the

complete set of 3-itemsets contained in t. For example, the 3-itemsets that

begin with prefix {1 2} are {1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that

begin with prefix {2 3} are {2, 3, 5} and {2, 3, 6}.

The prefix tree structure shown in Figure 5.9 demonstrates how itemsets

contained in a transaction can be systematically enumerated, i.e., by

specifying their items one by one, from the leftmost item to the rightmost item.

We still have to determine whether each enumerated 3-itemset corresponds

3 5 6

to an existing candidate itemset. If it matches one of the candidates, then the

support count of the corresponding candidate is incremented. In the next

Section, we illustrate how this matching operation can be performed efficiently

using a hash tree structure.

Support Counting Using a Hash Tree*

In the Apriori algorithm, candidate itemsets are partitioned into different

buckets and stored in a hash tree. During support counting, itemsets

contained in each transaction are also hashed into their appropriate buckets.

That way, instead of comparing each itemset in the transaction with every

candidate itemset, it is matched only against candidate itemsets that belong to

the same bucket, as shown in Figure 5.10 .

Figure 5.10.

Counting the support of itemsets using hash structure.

Figure 5.11 shows an example of a hash tree structure. Each internal node

of the tree uses the following hash function, , where modeh(p)=(p−1) mod 3,

refers to the modulo (remainder) operator, to determine which branch of the

current node should be followed next. For example, items 1, 4, and 7 are

hashed to the same branch (i.e., the leftmost branch) because they have the

same remainder after dividing the number by 3. All candidate itemsets are

stored at the leaf nodes of the hash tree. The hash tree shown in Figure

5.11 contains 15 candidate 3-itemsets, distributed across 9 leaf nodes.

Figure 5.11.

Hashing a transaction at the root node of a hash tree.

Consider the transaction, . To update the support counts of

the candidate itemsets, the hash tree must be traversed in such a way that all

t={1, 2, 3, 4, 5, 6}

the leaf nodes containing candidate 3-itemsets belonging to t must be visited

at least once. Recall that the 3-itemsets contained in t must begin with items

1, 2, or 3, as indicated by the Level 1 prefix tree structure shown in Figure

5.9 . Therefore, at the root node of the hash tree, the items 1, 2, and 3 of the

transaction are hashed separately. Item 1 is hashed to the left child of the root

node, item 2 is hashed to the middle child, and item 3 is hashed to the right

child. At the next level of the tree, the transaction is hashed on the second

item listed in the Level 2 tree structure shown in Figure 5.9 . For example,

after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction

are hashed. Based on the hash function, items 2 and 5 are hashed to the

middle child, while item 3 is hashed to the right child, as shown in Figure

5.12 . This process continues until the leaf nodes of the hash tree are

reached. The candidate itemsets stored at the visited leaf nodes are

compared against the transaction. If a candidate is a subset of the transaction,

its support count is incremented. Note that not all the leaf nodes are visited

while traversing the hash tree, which helps in reducing the computational cost.

In this example, 5 out of the 9 leaf nodes are visited and 9 out of the 15

itemsets are compared against the transaction.

Figure 5.12.

Subset operation on the leftmost subtree of the root of a candidate hash tree.

5.2.5 Computational Complexity

The computational complexity of the Apriori algorithm, which includes both its

runtime and storage, can be affected by the following factors.

Support Threshold

Lowering the support threshold often results in more itemsets being declared

as frequent. This has an adverse effect on the computational complexity of the

algorithm because more candidate itemsets must be generated and counted

at every level, as shown in Figure 5.13 . The maximum size of frequent

itemsets also tends to increase with lower support thresholds. This increases

the total number of iterations to be performed by the Apriori algorithm, further

increasing the computational cost.

Figure 5.13.

Effect of support threshold on the number of candidate and frequent itemsets

obtained from a benchmark data set.

Number of Items (Dimensionality)

As the number of items increases, more space will be needed to store the

support counts of items. If the number of frequent items also grows with the

dimensionality of the data, the runtime and storage requirements will increase

because of the larger number of candidate itemsets generated by the

algorithm.

Number of Transactions

Because the Apriori algorithm makes repeated passes over the transaction

data set, its run time increases with a larger number of transactions.

Average Transaction Width

For dense data sets, the average transaction width can be very large. This

affects the complexity of the Apriori algorithm in two ways. First, the maximum

size of frequent itemsets tends to increase as the average transaction width

increases. As a result, more candidate itemsets must be examined during

candidate generation and support counting, as illustrated in Figure 5.14 .

Second, as the transaction width increases, more itemsets are contained in

the transaction. This will increase the number of hash tree traversals

performed during support counting.

A detailed analysis of the time complexity for the Apriori algorithm is presented

next.

Figure 5.14.

Effect of average transaction width on the number of candidate and frequent

itemsets obtained from a synthetic data set.

Generation of frequent 1-itemsets

For each transaction, we need to update the support count for every item

present in the transaction. Assuming that w is the average transaction width,

this operation requires O(Nw) time, where N is the total number of

transactions.

Candidate generation

To generate candidate k-itemsets, pairs of frequent -itemsets are merged

to determine whether they have at least items in common. Each merging

operation requires at most equality comparisons. Every merging step can

produce at most one viable candidate k-itemset, while in the worst-case, the

algorithm must try to merge every pair of frequent -itemsets found in the

previous iteration. Therefore, the overall cost of merging frequent itemsets is

where w is the maximum transaction width. A hash tree is also constructed

during candidate generation to store the candidate itemsets. Because the

maximum depth of the tree is k, the cost for populating the hash tree with

candidate itemsets is . During candidate pruning, we need to

verify that the subsets of every candidate k-itemset are frequent. Since

the cost for looking up a candidate in a hash tree is O(k), the candidate

pruning step requires time.

Support counting

(k−1)

k−2

k−2

(k−1)

∑k=2w(k−2)|Ck|<Cost of merging<∑k=2w(k−2)|Fk−1|2,

O(∑k=2wk|Ck|)

k−2

O(∑k=2wk(k−2)|Ck|)

Each transaction of width produces itemsets of size k. This is also the

effective number of hash tree traversals performed for each transaction. The

cost for support counting is , where w is the maximum

transaction width and is the cost for updating the support count of a

candidate k-itemset in the hash tree.

|t| (|t|k)

O(N∑k(wk)αk)

αk

5.3 Rule Generation

This Section describes how to extract association rules efficiently from a given

frequent itemset. Each frequent k-itemset, Y, can produce up to

association rules, ignoring rules that have empty antecedents or consequents

or ). An association rule can be extracted by partitioning the

itemset Y into two non-empty subsets, X and , such that satisfies

the confidence threshold. Note that all such rules must have already met the

support threshold because they are generated from a frequent itemset.

Example 5.2.

Let be a frequent itemset. There are six candidate association

rules that can be generated from

, and . As each of their support is identical to

the support for X, all the rules satisfy the support threshold.

Computing the confidence of an association rule does not require additional

scans of the transaction data set. Consider the rule , which is

generated from the frequent itemset . The confidence for this rule is

. Because {1, 2, 3} is frequent, the anti-monotone property

of support ensures that {1, 2} must be frequent, too. Since the support counts

for both itemsets were already found during frequent itemset generation, there

is no need to read the entire data set again.

5.3.1 Confidence-Based Pruning

2k−2

∅→Y Y→∅

Y−X X→Y−X

X={a, b, c}

X:{a, b}→{c}, {a, c}→{b}, {b, c}→{a}, {a}

→{b, c}, {b}→{a, c} {c}→{a, b}

{1, 2}→{3}

X={1, 2, 3}

σ{(1, 2, 3})/σ({1, 2})

Confidence does not show the anti-monotone property in the same way as the

support measure. For example, the confidence for can be larger,

smaller, or equal to the confidence for another rule , where and

(see Exercise 3 on page 439). Nevertheless, if we compare rules

generated from the same frequent itemset Y, the following theorem holds for

the confidence measure.

Theorem 5.2.

Let Y be an itemset and X is a subset of Y. If a rule

does not satisfy the confidence threshold, then any rule

, where is a subset of X, must not satisfy the confidence

threshold as well.

To prove this theorem, consider the following two rules: and

, where . The confidence of the rules are and ,

respectively. Since is a subset of X, . Therefore, the former rule

cannot have a higher confidence than the latter rule.

5.3.2 Rule Generation in Apriori

Algorithm

The Apriori algorithm uses a level-wise approach for generating association

rules, where each level corresponds to the number of items that belong to the

X→Y

X˜→Y˜ X˜⊆X

Y˜⊆Y

X→Y−X

X˜→Y

−X˜ X˜

X˜→Y−X˜ X→Y

−X X˜⊂X σ(Y)/σ(X˜) σ(Y)/σ(X)

X˜ σ(X˜)/σ(X)

rule consequent. Initially, all the high confidence rules that have only one item

in the rule consequent are extracted. These rules are then used to generate

new candidate rules. For example, if and are high

confidence rules, then the candidate rule is generated by merging

the consequents of both rules. Figure 5.15 shows a lattice structure for the

association rules generated from the frequent itemset {a, b, c, d}. If any node

in the lattice has low confidence, then according to Theorem 5.2 , the entire

subgraph spanned by the node can be pruned immediately. Suppose the

confidence for is low. All the rules containing item a in its

consequent, including , and can

be discarded.

Figure 5.15.

Pruning of association rules using the confidence measure.

{acd}→{b} {abd}→{c}

{ad}→{bc}

{bcd}→{a}

{cd}→{ab}, {bd}→{ac}, {bc}→{ad} {d}→{abc}

A pseudocode for the rule generation step is shown in Algorithms 5.2 and

5.3 . Note the similarity between the procedure given in

Algorithm 5.3 and the frequent itemset generation procedure given in

Algorithm 5.1 . The only difference is that, in rule generation, we do not

have to make additional passes over the data set to compute the confidence

of the candidate rules. Instead, we determine the confidence of each rule by

using the support counts computed during frequent itemset generation.

Algorithm 5.2 Rule generation of the Apriori

algorithm.

∈

Algorithm 5.3 Procedure ap-genrules .

∈

(fk, Hm)

5.3.3 An Example: Congressional

Voting Records

This Section demonstrates the results of applying association analysis to the

voting records of members of the United States House of Representatives.

The data is obtained from the 1984 Congressional Voting Records Database,

which is available at the UCI machine learning data repository. Each

transaction contains information about the party affiliation for a representative

along with his or her voting record on 16 key issues. There are 435

transactions and 34 items in the data set. The set of items are listed in Table

5.3 .

Table 5.3. List of binary attributes from the 1984 United States

Congressional Voting Records. Source: The UCI machine learning

repository.

1. Republican

2. Democrat

3.

4.

5.

6.

7.

handicapped-infants=yes

handicapped-infants=no

water project cost sharing=yes

water project cost sharing=no

budget-resolution=yes

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

The Apriori algorithm is then applied to the data set with and

. Some of the high confidence rules extracted by the algorithm

are shown in Table 5.4 . The first two rules suggest that most of the

members who voted yes for aid to El Salvador and no for budget resolution

and MX missile are Republicans; while those who voted no for aid to El

Salvador and yes for budget resolution and MX missile are Democrats. These

budget-resolution=no

physician fee freeze=yes

physician fee freeze=no

aid to El Salvador=yes

aid to El Salvador=no

religious groups in schools=yes

religious groups in schools=no

anti-satellite test ban=yes

anti-satellite test ban=no

aid to Nicaragua=yes

aid to Nicaragua=no

MX-missile=yes

MX-missile=no

immigration=yes

immigration=no

synfuel corporation cutback=yes

synfuel corporation cutback=no

education spending=yes

education spending=no

right-to-sue=yes

right-to-sue=no

crime=yes

crime=no

duty-free-exports=yes

duty-free-exports=no

export administration act=yes

export administration act=no

minsup=30%

minconf=90%

high confidence rules show the key issues that divide members from both

political parties.

Table 5.4. Association rules extracted from the 1984 United States

Congressional Voting Records.

Association Rule Confidence

91.0%

97.5%

93.5%

100%

{budget resolution=no, MX-missile=no, aid to El Salvador=yes }→{Republican}

{budget resolution=yes, MX-missile=yes, aid to El Salvador=no }→{Democrat}

{crime=yes, right-to-sue=yes, physician fee freeze=yes }→{Republican}

{crime=no, right-to-sue=no, physician fee freeze=no }→{Democrat}

5.4 Compact Representation of

Frequent Itemsets

In practice, the number of frequent itemsets produced from a transaction data

set can be very large. It is useful to identify a small representative set of

frequent itemsets from which all other frequent itemsets can be derived. Two

such representations are presented in this Section in the form of maximal and

closed frequent itemsets.

5.4.1 Maximal Frequent Itemsets

Definition 5.3. (Maximal Frequent Itemset.)

A frequent itemset is maximal if none of its immediate supersets

are frequent.

To illustrate this concept, consider the itemset lattice shown in Figure 5.16 .

The itemsets in the lattice are divided into two groups: those that are frequent

and those that are infrequent. A frequent itemset border, which is represented

by a dashed line, is also illustrated in the diagram. Every itemset located

above the border is frequent, while those located below the border (the

shaded nodes) are infrequent. Among the itemsets residing near the border,

{a, d}, {a, c, e}, and {b, c, d, e} are maximal frequent itemsets because all of

their immediate supersets are infrequent. For example, the itemset {a, d} is

maximal frequent because all of its immediate supersets, {a, b, d}, {a, c, d},

and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because one of

its immediate supersets, {a, c, e}, is frequent.

Figure 5.16.

Maximal frequent itemset.

Maximal frequent itemsets effectively provide a compact representation of

frequent itemsets. In other words, they form the smallest set of itemsets from

which all frequent itemsets can be derived. For example, every frequent

itemset in Figure 5.16 is a subset of one of the three maximal frequent

itemsets, {a, d}, {a, c, e}, and {b, c, d, e}. If an itemset is not a proper subset of

any of the maximal frequent itemsets, then it is either infrequent (e.g., {a, d,

e}) or maximal frequent itself (e.g., {b, c, d, e}). Hence, the maximal frequent

itemsets {a, c, e}, {a, d}, and {b, c, d, e} provide a compact representation of

the frequent itemsets shown in Figure 5.16 . Enumerating all the subsets of

maximal frequent itemsets generates the complete list of all frequent itemsets.

Maximal frequent itemsets provide a valuable representation for data sets that

can produce very long, frequent itemsets, as there are exponentially many

frequent itemsets in such data. Nevertheless, this approach is practical only if

an efficient algorithm exists to explicitly find the maximal frequent itemsets.

We briefly describe one such approach in Section 5.5 .

Despite providing a compact representation, maximal frequent itemsets do not

contain the support information of their subsets. For example, the support of

the maximal frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e} do not provide

any information about the support of their subsets except that it meets the

support threshold. An additional pass over the data set is therefore needed to

determine the support counts of the non-maximal frequent itemsets. In some

cases, it is desirable to have a minimal representation of itemsets that

preserves the support information. We describe such a representation in the

next Section.

5.4.2 Closed Itemsets

Closed itemsets provide a minimal representation of all itemsets without losing

their support information. A formal definition of a closed itemset is presented

below.

Definition 5.4. (Closed Itemset.)

An itemset X is closed if none of its immediate supersets has

exactly the same support count as X.

Put another way, X is not closed if at least one of its immediate supersets has

the same support count as X. Examples of closed itemsets are shown in

Figure 5.17 . To better illustrate the support count of each itemset, we have

associated each node (itemset) in the lattice with a list of its corresponding

transaction IDs. For example, since the node {b, c} is associated with

transaction IDs 1, 2, and 3, its support count is equal to three. From the

transactions given in this diagram, notice that the support for {b} is identical to

{b, c}.This is because every transaction that contains b also contains c.

Hence, {b} is not a closed itemset. Similarly, since c occurs in every

transaction that contains both a and d, the itemset {a, d} is not closed as it has

the same support as its superset {a, c, d}. On the other hand, {b, c} is a closed

itemset because it does not have the same support count as any of its

supersets.

Figure 5.17.

An example of the closed frequent itemsets (with minimum support equal to

40%).

An interesting property of closed itemsets is that if we know their support

counts, we can derive the support count of every other itemset in the itemset

lattice without making additional passes over the data set. For example,

consider the 2-itemset {b, e} in Figure 5.17 . Since {b, e} is not closed, its

support must be equal to the support of one of its immediate supersets, {a, b,

e}, {b, c, e}, and {b, d, e}. Further, none of the supersets of {b, e} can have a

support greater than the support of {b, e}, due to the anti-monotone nature of

the support measure. Hence, the support of {b, e} can be computed by

examining the support counts of all of its immediate supersets of size three

and taking their maximum value. If an immediate superset is closed (e.g., {b,

c, e}), we would know its support count. Otherwise, we can recursively

compute its support by examining the supports of its immediate supersets of

size four. In general, the support count of any non-closed -itemset can be

determined as long as we know the support counts of all k-itemsets. Hence,

one can devise an iterative algorithm that computes the support counts of

itemsets at level using the support counts of itemsets at level k, starting

from the level , where is the size of the largest closed itemset.

Even though closed itemsets provide a compact representation of the support

counts of all itemsets, they can still be exponentially large in number.

Moreover, for most practical applications, we only need to determine the

support count of all frequent itemsets. In this regard, closed frequent item-sets

provide a compact representation of the support counts of all frequent

itemsets, which can be defined as follows.

Definition 5.5. (Closed Frequent Itemset.)

An itemset is a closed frequent itemset if it is closed and its

support is greater than or equal to minsup.

In the previous example, assuming that the support threshold is 40%, {b, c} is

a closed frequent itemset because its support is 60%. In Figure 5.17 , the

closed frequent itemsets are indicated by the shaded nodes.

Algorithms are available to explicitly extract closed frequent itemsets from a

given data set. Interested readers may refer to the Bibliographic Notes at the

(k−1)

k−1

kmax kmax

end of this chapter for further discussions of these algorithms. We can use

closed frequent itemsets to determine the support counts for all non-closed

frequent itemsets. For example, consider the frequent itemset {a, d} shown in

Figure 5.17 . Because this itemset is not closed, its support count must be

equal to the maximum support count of its immediate supersets, {a, b, d}, {a,

c, d}, and {a, d, e}. Also, since {a, d} is frequent, we only need to consider the

support of its frequent supersets. In general, the support count of every non-

closed frequent k-itemset can be obtained by considering the support of all its

frequent supersets of size . For example, since the only frequent superset

of {a, d} is {a, c, d}, its support is equal to the support of {a, c, d}, which is 2.

Using this methodology, an algorithm can be developed to compute the

support for every frequent itemset. The pseudocode for this algorithm is

shown in Algorithm 5.4 . The algorithm proceeds in a specific-to-general

fashion, i.e., from the largest to the smallest frequent itemsets. This is

because, in order to find the support for a non-closed frequent itemset, the

support for all of its supersets must be known. Note that the set of all frequent

itemsets can be easily computed by taking the union of all subsets of frequent

closed itemsets.

Algorithm 5.4 Support counting using closed

frequent itemsets.

∈

∈

k+1

∈

∉

⋅ ′⋅ ′∈ ⊂ ′

To illustrate the advantage of using closed frequent itemsets, consider the

data set shown in Table 5.5 , which contains ten transactions and fifteen

items. The items can be divided into three groups: (1) Group A, which

contains items through ; (2) Group B, which contains items through

; and (3) Group C, which contains items through . Assuming that the

support threshold is 20%, itemsets involving items from the same group are

frequent, but itemsets involving items from different groups are infrequent.

The total number of frequent itemsets is thus . However, there

are only four closed frequent itemsets in the data:

and . It is

often sufficient to present only the closed frequent itemsets to the analysts

instead of the entire set of frequent itemsets.

Table 5.5. A transaction data set for mining closed itemsets.

TID

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0

5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

a1 a5 b1

b5 c1 c5

3×(25−1)=93

({a3, a4}, {a1, a2, a3, a4, a5}, {b1,b2,b3,b4,b5}, {c1, c2, c3, c4, c5})

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5

6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

Finally, note that all maximal frequent itemsets are closed because none of

the maximal frequent itemsets can have the same support count as their

immediate supersets. The relationships among frequent, closed, closed

frequent, and maximal frequent itemsets are shown in Figure 5.18 .

Figure 5.18.

Relationships among frequent, closed, closed frequent, and maximal frequent

itemsets.

5.5 Alternative Methods for Generating

Frequent Itemsets*

Apriori is one of the earliest algorithms to have successfully addressed the

combinatorial explosion of frequent itemset generation. It achieves this by

applying the Apriori principle to prune the exponential search space. Despite

its significant performance improvement, the algorithm still incurs considerable

I/O overhead since it requires making several passes over the transaction

data set. In addition, as noted in Section 5.2.5 , the performance of the

Apriori algorithm may degrade significantly for dense data sets because of the

increasing width of transactions. Several alternative methods have been

developed to overcome these limitations and improve upon the efficiency of

the Apriori algorithm. The following is a high-level description of these

methods.

Traversal of Itemset Lattice

A search for frequent itemsets can be conceptually viewed as a traversal on

the itemset lattice shown in Figure 5.1 . The search strategy employed by

an algorithm dictates how the lattice structure is traversed during the frequent

itemset generation process. Some search strategies are better than others,

depending on the configuration of frequent itemsets in the lattice. An overview

of these strategies is presented next.

General-to-Specific versus Specific-to-General: The Apriori algorithm

uses a general-to-specific search strategy, where pairs of frequent –

itemsets are merged to obtain candidate k-itemsets. This general-to-

(k−1)

specific search strategy is effective, provided the maximum length of a

frequent itemset is not too long. The configuration of frequent itemsets that

works best with this strategy is shown in Figure 5.19(a) , where the

darker nodes represent infrequent itemsets. Alternatively, a specificto-

general search strategy looks for more specific frequent itemsets first,

before finding the more general frequent itemsets. This strategy is useful

to discover maximal frequent itemsets in dense transactions, where the

frequent itemset border is located near the bottom of the lattice, as shown

in Figure 5.19(b) . The Apriori principle can be applied to prune all

subsets of maximal frequent itemsets. Specifically, if a candidate k-itemset

is maximal frequent, we do not have to examine any of its subsets of size

. However, if the candidate k-itemset is infrequent, we need to check all

of its subsets in the next iteration. Another approach is to combine

both general-to-specific and specific-to-general search strategies. This

bidirectional approach requires more space to store the candidate

itemsets, but it can help to rapidly identify the frequent itemset border,

given the configuration shown in Figure 5.19(c) .

Figure 5.19.

General-to-specific, specific-to-general, and bidirectional search.

k

−1

k−1

Equivalence Classes: Another way to envision the traversal is to first

partition the lattice into disjoint groups of nodes (or equivalence classes). A

frequent itemset generation algorithm searches for frequent itemsets within

a particular equivalence class first before moving to another equivalence

class. As an example, the level-wise strategy used in the Apriori algorithm

can be considered to be partitioning the lattice on the basis of itemset

sizes; i.e., the algorithm discovers all frequent 1-itemsets first before

proceeding to larger-sized itemsets. Equivalence classes can also be

defined according to the prefix or suffix labels of an itemset. In this case,

two itemsets belong to the same equivalence class if they share a

common prefix or suffix of length k. In the prefix-based approach, the

algorithm can search for frequent itemsets starting with the prefix a before

looking for those starting with prefixes b, c, and so on. Both prefix-based

and suffix-based equivalence classes can be demonstrated using the tree-

like structure shown in Figure 5.20 .

Figure 5.20.

Equivalence classes based on the prefix and suffix labels of itemsets.

Breadth-First versus Depth-First: The Apriori algorithm traverses the

lattice in a breadth-first manner, as shown in Figure 5.21(a) . It first

discovers all the frequent 1-itemsets, followed by the frequent 2-itemsets,

and so on, until no new frequent itemsets are generated. The itemset

lattice can also be traversed in a depth-first manner, as shown in Figures

5.21(b) and 5.22 . The algorithm can start from, say, node a in Figure

5.22 , and count its support to determine whether it is frequent. If so, the

algorithm progressively expands the next level of nodes, i.e., ab, abc, and

so on, until an infrequent node is reached, say, abcd. It then backtracks to

another branch, say, abce, and continues the search from there.

Figure 5.21.

Breadth-first and depth-first traversals.

Figure 5.22.

Generating candidate itemsets using the depth-first approach.

The depth-first approach is often used by algorithms designed to find

maximal frequent itemsets. This approach allows the frequent itemset

border to be detected more quickly than using a breadth-first approach.

Once a maximal frequent itemset is found, substantial pruning can be

performed on its subsets. For example, if the node bcde shown in Figure

5.22 is maximal frequent, then the algorithm does not have to visit the

subtrees rooted at bd, be, c, d, and e because they will not contain any

maximal frequent itemsets. However, if abc is maximal frequent, only the

nodes such as ac and bc are not maximal frequent (but the subtrees of ac

and bc may still contain maximal frequent itemsets). The depth-first

approach also allows a different kind of pruning based on the support of

itemsets. For example, suppose the support for {a, b, c} is identical to the

support for {a, b}. The subtrees rooted at abd and abe can be skipped

because they are guaranteed not to have any maximal frequent itemsets.

The proof of this is left as an exercise to the readers.

Representation of Transaction Data Set

There are many ways to represent a transaction data set. The choice of

representation can affect the I/O costs incurred when computing the support

of candidate itemsets. Figure 5.23 shows two different ways of

representing market basket transactions. The representation on the left is

called a horizontal data layout, which is adopted by many association rule

mining algorithms, including Apriori. Another possibility is to store the list of

transaction identifiers (TID-list) associated with each item. Such a

representation is known as the vertical data layout. The support for each

candidate itemset is obtained by intersecting the TID-lists of its subset items.

The length of the TID-lists shrinks as we progress to larger sized itemsets.

However, one problem with this approach is that the initial set of TID-lists

might be too large to fit into main memory, thus requiring more sophisticated

techniques to compress the TID-lists. We describe another effective approach

to represent the data in the next Section.

Figure 5.23.

Horizontal and vertical data format.

Horizontal Data Layout

5.6 FP-Growth Algorithm*

This Section presents an alternative algorithm called FP-growth that takes a

radically different approach to discovering frequent itemsets. The algorithm

does not subscribe to the generate-and-test paradigm of Apriori. Instead, it

encodes the data set using a compact data structure called an FP-tree and

extracts frequent itemsets directly from this structure. The details of this

approach are presented next.

5.6.1 FP-Tree Representation

An FP-tree is a compressed representation of the input data. It is constructed

by reading the data set one transaction at a time and mapping each

transaction onto a path in the FP-tree. As different transactions can have

several items in common, their paths might overlap. The more the paths

overlap with one another, the more compression we can achieve using the

FP-tree structure. If the size of the FP-tree is small enough to fit into main

memory, this will allow us to extract frequent itemsets directly from the

structure in memory instead of making repeated passes over the data stored

on disk.

Figure 5.24 shows a data set that contains ten transactions and five items.

The structures of the FP-tree after reading the first three transactions are also

depicted in the diagram. Each node in the tree contains the label of an item

along with a counter that shows the number of transactions mapped onto the

given path. Initially, the FP-tree contains only the root node represented by the

null symbol. The FP-tree is subsequently extended in the following way:

Figure 5.24.

Construction of an FP-tree.

1. The data set is scanned once to determine the support count of each

item. Infrequent items are discarded, while the frequent items are

sorted in decreasing support counts inside every transaction of the data

set. For the data set shown in Figure 5.24 , a is the most frequent

item, followed by b, c, d, and e.

2. The algorithm makes a second pass over the data to construct the FP-

tree. After reading the first transaction, {a, b}, the nodes labeled as a

and b are created. A path is then formed from to encode

the transaction. Every node along the path has a frequency count of 1.

3. After reading the second transaction, {b, c, d}, a new set of nodes is

created for items b, c, and d. A path is then formed to represent the

transaction by connecting the nodes . Every node

along this path also has a frequency count equal to one. Although the

first two transactions have an item in common, which is b, their paths

are disjoint because the transactions do not share a common prefix.

4. The third transaction, {a, c, d, e}, shares a common prefix item (which

is a) with the first transaction. As a result, the path for the third

transaction, , overlaps with the path for the first

transaction, . Because of their overlapping path, the

frequency count for node a is incremented to two, while the frequency

counts for the newly created nodes, c, d, and e, are equal to one.

5. This process continues until every transaction has been mapped onto

one of the paths given in the FP-tree. The resulting FP-tree after

reading all the transactions is shown at the bottom of Figure 5.24 .

The size of an FP-tree is typically smaller than the size of the uncompressed

data because many transactions in market basket data often share a few

items in common. In the best-case scenario, where all the transactions have

the same set of items, the FP-tree contains only a single branch of nodes. The

worst-case scenario happens when every transaction has a unique set of

items. As none of the transactions have any items in common, the size of the

FP-tree is effectively the same as the size of the original data. However, the

physical storage requirement for the FP-tree is higher because it requires

additional space to store pointers between nodes and counters for each item.

→a→b

→b→c→d

→a→c→d→e

→a→b

The size of an FP-tree also depends on how the items are ordered. The

notion of ordering items in decreasing order of support counts relies on the

possibility that the high support items occur more frequently across all paths

and hence must be used as most commonly occurring prefixes. For example,

if the ordering scheme in the preceding example is reversed, i.e., from lowest

to highest support item, the resulting FP-tree is shown in Figure 5.25 . The

tree appears to be denser because the branching factor at the root node has

increased from 2 to 5 and the number of nodes containing the high support

items such as a and b has increased from 3 to 12. Nevertheless, ordering by

decreasing support counts does not always lead to the smallest tree,

especially when the high support items do not occur frequently together with

the other items. For example, suppose we augment the data set given in

Figure 5.24 with 100 transactions that contain {e}, 80 transactions that

contain {d}, 60 transactions that contain {c}, and 40 transactions that contain

{b}. Item e is now most frequent, followed by d, c, b, and a. With the

augmented transactions, ordering by decreasing support counts will result in

an FP-tree similar to Figure 5.25 , while a scheme based on increasing

support counts produces a smaller FP-tree similar to Figure 5.24(iv) .

Figure 5.25.

An FP-tree representation for the data set shown in Figure 5.24 with a

different item ordering scheme.

An FP-tree also contains a list of pointers connecting nodes that have the

same items. These pointers, represented as dashed lines in Figures 5.24

and 5.25 , help to facilitate the rapid access of individual items in the tree.

We explain how to use the FP-tree and its corresponding pointers for frequent

itemset generation in the next Section.

5.6.2 Frequent Itemset Generation in

FP-Growth Algorithm

FP-growth is an algorithm that generates frequent itemsets from an FP-tree by

exploring the tree in a bottom-up fashion. Given the example tree shown in

Figure 5.24 , the algorithm looks for frequent itemsets ending in e first,

followed by d, c, b, and finally, a. This bottom-up strategy for finding frequent

itemsets ending with a particular item is equivalent to the suffix-based

approach described in Section 5.5 . Since every transaction is mapped onto

a path in the FP-tree, we can derive the frequent itemsets ending with a

particular item, say, e, by examining only the paths containing node e. These

paths can be accessed rapidly using the pointers associated with node e. The

extracted paths are shown in Figure 5.26 (a) . Similar paths for itemsets

ending in d, c, b, and a are shown in Figures 5.26 (b) , (c) , (d) , and

(e) , respectively.

Figure 5.26.

Decomposing the frequent itemset generation problem into multiple

subproblems, where each subproblem involves finding frequent itemsets

ending in e, d, c, b, and a.

FP-growth finds all the frequent itemsets ending with a particular suffix by

employing a divide-and-conquer strategy to split the problem into smaller

subproblems. For example, suppose we are interested in finding all frequent

itemsets ending in e. To do this, we must first check whether the itemset {e}

itself is frequent. If it is frequent, we consider the subproblem of finding

frequent itemsets ending in de,followedby ce, be,and ae. In turn, each of

these subproblems are further decomposed into smaller subproblems. By

merging the solutions obtained from the subproblems, all the frequent

itemsets ending in e can be found. Finally, the set of all frequent itemsets can

be generated by merging the solutions to the subproblems of finding frequent

itemsets ending in e, d, c, b, and a. This divide-and-conquer approach is the

key strategy employed by the FP-growth algorithm.

For a more concrete example on how to solve the subproblems, consider the

task of finding frequent itemsets ending with e.

1. The first step is to gather all the paths containing node e. These initial

paths are called prefix paths and are shown in Figure 5.27(a) .

Figure 5.27.

Example of applying the FP-growth algorithm to find frequent itemsets

ending in e.

2. From the prefix paths shown in Figure 5.27(a) , the support count for

e is obtained by adding the support counts associated with node e.

Assuming that the minimum support count is 2, {e} is declared a

frequent itemset because its support count is 3.

3. Because {e} is frequent, the algorithm has to solve the subproblems of

finding frequent itemsets ending in de, ce, be,and ae. Before solving

these subproblems, it must first convert the prefix paths into a

conditional FP-tree, which is structurally similar to an FP-tree, except

it is used to find frequent itemsets ending with a particular suffix. A

conditional FP-tree is obtained in the following way:

a. First, the support counts along the prefix paths must be updated

because some of the counts include transactions that do not

contain item e. For example, the rightmost path shown in Figure

5.27(a) , , includes a transaction {b, c}

that does not contain item e. The counts along the prefix path

must therefore be adjusted to 1 to reflect the actual number of

transactions containing {b, c, e}.

b. The prefix paths are truncated by removing the nodes for e.

These nodes can be removed because the support counts along

the prefix paths have been updated to reflect only transactions

that contain e and the subproblems of finding frequent itemsets

ending in de, ce, be, and ae no longer need information about

node e.

c. After updating the support counts along the prefix paths, some

of the items may no longer be frequent. For example, the node b

appears only once and has a support count equal to 1, which

means that there is only one transaction that contains both b

and e.Item b can be safely ignored from subsequent analysis

because all itemsets ending in be must be infrequent.

The conditional FP-tree for e is shown in Figure 5.27(b) . The tree

looks different than the original prefix paths because the frequency

counts have been updated and the nodes b and e have been

eliminated.

4. FP-growth uses the conditional FP-tree for e to solve the subproblems

of finding frequent itemsets ending in de, ce,and ae. To find the

frequent itemsets ending in de, the prefix paths for d are gathered from

the conditional FP-tree for e (Figure 5.27(c) ). By adding the

frequency counts associated with node d, we obtain the support count

for {d, e}. Since the support count is equal to 2, {d, e} is declared a

frequent itemset. Next, the algorithm constructs the conditional FP-tree

for de using the approach described in step 3. After updating the

support counts and removing the infrequent item c, the conditional FP-

tree for de is shown in Figure 5.27(d) . Since the conditional FP-tree

contains only one item, a, whose support is equal to minsup, the

algorithm extracts the frequent itemset {a, d, e} and moves on to the

next subproblem, which is to generate frequent itemsets ending in ce.

After processing the prefix paths for c, {c, e} is found to be frequent.

However, the conditional FP-tree for ce will have no frequent items and

thus will be eliminated. The algorithm proceeds to solve the next

subproblem and finds {a, e} to be the only frequent itemset remaining.

This example illustrates the divide-and-conquer approach used in the FP-

growth algorithm. At each recursive step, a conditional FP-tree is constructed

by updating the frequency counts along the prefix paths and removing all

infrequent items. Because the subproblems are disjoint, FP-growth will not

generate any duplicate itemsets. In addition, the counts associated with the

nodes allow the algorithm to perform support counting while generating the

common suffix itemsets.

FP-growth is an interesting algorithm because it illustrates how a compact

representation of the transaction data set helps to efficiently generate frequent

itemsets. In addition, for certain transaction data sets, FP-growth outperforms

the standard Apriori algorithm by several orders of magnitude. The run-time

performance of FP-growth depends on the compaction factor of the data set.

If the resulting conditional FP-trees are very bushy (in the worst case, a full

prefix tree), then the performance of the algorithm degrades significantly

because it has to generate a large number of subproblems and merge the

results returned by each subproblem.

5.7 Evaluation of Association Patterns

Although the Apriori principle significantly reduces the exponential search

space of candidate itemsets, association analysis algorithms still have the

potential to generate a large number of patterns. For example, although the

data set shown in Table 5.1 contains only six items, it can produce

hundreds of association rules at particular support and confidence thresholds.

As the size and dimensionality of real commercial databases can be very

large, we can easily end up with thousands or even millions of patterns, many

of which might not be interesting. Identifying the most interesting patterns from

the multitude of all possible ones is not a trivial task because “one person’s

trash might be another person’s treasure.” It is therefore important to establish

a set of well-accepted criteria for evaluating the quality of association patterns.

The first set of criteria can be established through a data-driven approach to

define objective interestingness measures. These measures can be used

to rank patterns—itemsets or rules—and thus provide a straightforward way of

dealing with the enormous number of patterns that are found in a data set.

Some of the measures can also provide statistical information, e.g., itemsets

that involve a set of unrelated items or cover very few transactions are

considered uninteresting because they may capture spurious relationships in

the data and should be eliminated. Examples of objective interestingness

measures include support, confidence, and correlation.

The second set of criteria can be established through subjective arguments. A

pattern is considered subjectively uninteresting unless it reveals unexpected

information about the data or provides useful knowledge that can lead to

profitable actions. For example, the rule may not be

interesting, despite having high support and confidence values, because the

{Butter}→{Bread}

relationship represented by the rule might seem rather obvious. On the other

hand, the rule is interesting because the relationship is

quite unexpected and may suggest a new cross-selling opportunity for

retailers. Incorporating subjective knowledge into pattern evaluation is a

difficult task because it requires a considerable amount of prior information

from domain experts. Readers interested in subjective interestingness

measures may refer to resources listed in the bibliography at the end of this

chapter.

5.7.1 Objective Measures of

Interestingness

An objective measure is a data-driven approach for evaluating the quality of

association patterns. It is domain-independent and requires only that the user

specifies a threshold for filtering low-quality patterns. An objective measure is

usually computed based on the frequency counts tabulated in a contingency

table. Table 5.6 shows an example of a contingency table for a pair of

binary variables, A and B.We use the notation to indicate that A

(B)isabsent from a transaction. Each entry in this table denotes a

frequency count. For example, is the number of times A and B appear

together in the same transaction, while is the number of transactions that

contain B but not A. The row sum represents the support count for A,

while the column sum represents the support count for B. Finally, even

though our discussion focuses mainly on asymmetric binary variables, note

that contingency tables are also applicable to other attribute types such as

symmetric binary, nominal, and ordinal variables.

Table 5.6. A 2-way contingency table for variables A and B.

{Diapers}→{Beer}

A¯(B¯)

fij 2×2

f11

f01

f1+

f+1

B

A

N

Limitations of the Support-Confidence Framework

The classical association rule mining formulation relies on the support and

confidence measures to eliminate uninteresting patterns. The drawback of

support, which is described more fully in Section 5.8 , is that many

potentially interesting patterns involving low support items might be eliminated

by the support threshold. The drawback of confidence is more subtle and is

best demonstrated with the following example.

Example 5.3.

Suppose we are interested in analyzing the relationship between people

who drink tea and coffee. We may gather information about the beverage

preferences among a group of people and summarize their responses into

a contingency table such as the one shown in Table 5.7 .

Table 5.7. Beverage preferences among a group of 1000 people.

Coffee

Tea 150 50 200

650 150 800

800 200 1000

B¯

f11 f10 f1+

A¯ f01 f00 f0+

f+1 f+0

Coffee¯

Tea¯

The information given in this table can be used to evaluate the association

rule . At first glance, it may appear that people who drink

tea also tend to drink coffee because the rule’s support (15%) and

confidence (75%) values are reasonably high. This argument would have

been acceptable except that the fraction of people who drink coffee,

regardless of whether they drink tea, is 80%, while the fraction of tea

drinkers who drink coffee is only 75%. Thus knowing that a person is a tea

drinker actually decreases her probability of being a coffee drinker from

80% to 75%! The rule is therefore misleading despite its

high confidence value.

Now consider a similar problem where we are interested in analyzing the

relationship between people who drink tea and people who use honey in

their beverage. Table 5.8 summarizes the information gathered over the

same group of people about their preferences for drinking tea and using

honey. If we evaluate the association rule using this

information, we will find that the confidence value of this rule is merely

50%, which might be easily rejected using a reasonable threshold on the

confidence value, say 70%. One thus might consider that the preference of

a person for drinking tea has no influence on her preference for using

honey. However, the fraction of people who use honey, regardless of

whether they drink tea, is only 12%. Hence, knowing that a person drinks

tea significantly increases her probability of using honey from 12% to 50%.

Further, the fraction of people who do not drink tea but use honey is only

2.5%! This suggests that there is definitely some information in the

preference of a person of using honey given that she drinks tea. The rule

may therefore be falsely rejected if confidence is used as

the evaluation measure.

Table 5.8. Information about people who drink tea and people who

use honey in their beverage.

{Tea}→{Coffee}

{Tea}→{Coffee}

{Tea}→{Honey}

{Tea}→{Honey}

Honey

Tea 100 100 200

20 780 800

120 880 1000

Note that if we take the support of coffee drinkers into account, we would not

be surprised to find that many of the people who drink tea also drink coffee,

since the overall number of coffee drinkers is quite large by itself. What is

more surprising is that the fraction of tea drinkers who drink coffee is actually

less than the overall fraction of people who drink coffee, which points to an

inverse relationship between tea drinkers and coffee drinkers. Similarly, if we

account for the fact that the support of using honey is inherently small, it is

easy to understand that the fraction of tea drinkers who use honey will

naturally be small. Instead, what is important to measure is the change in the

fraction of honey users, given the information that they drink tea.

The limitations of the confidence measure are well-known and can be

understood from a statistical perspective as follows. The support of a variable

measures the probability of its occurrence, while the support s(A, B) of a pair

of a variables A and B measures the probability of the two variables occurring

together. Hence, the joint probability P (A, B) can be written as

If we assume A and B are statistically independent, i.e. there is no relationship

between the occurrences of A and B, then . Hence, under

the assumption of statistical independence between A and B, the support

sindep(A, B) of A and B can be written as

Honey¯

Tea¯

P(A, B)=s(A, B)=f11N.

P(A, B)=P(A)×P(B)

If the support between two variables, s(A, B) is equal to , then A

and B can be considered to be unrelated to each other. However, if s(A, B) is

widely different from , then A and B are most likely dependent.

Hence, any deviation of s(A, B) from can be seen as an indication

of a statistical relationship between A and B. Since the confidence measure

only considers the deviance of s(A, B) from s(A) and not from , it

fails to account for the support of the consequent, namely s(B). This results in

the detection of spurious patterns (e.g., ) and the rejection of

truly interesting patterns (e.g., ), as illustrated in the previous

example.

Various objective measures have been used to capture the deviance of s(A,

B) from , that are not susceptible to the limitations of the

confidence measure. Below, we provide a brief description of some of these

measures and discuss some of their properties.

Interest Factor

The interest factor, which is also called as the “lift,” can be defined as follows:

Notice that . Hence, the interest factor measures the

ratio of the support of a pattern s(A, B) against its baseline support (A,

B) computed under the statistical independence assumption. Using

Equations 5.5 and 5.4 , we can interpret the measure as follows:

sindep(A, B)=s(A)×s(B)or equivalently,sindep(A, B)=f1+N×f+1N. (5.4)

sindep(A, B)

sindep(A, B)

s(A)×s(B)

s(A)×s(B)

{Tea}→{Coffee}

{Tea}→{Honey}

sindep(A, B)

I(A, B)=s(A, B)s(A)×s(B)=Nf11f1+f+1. (5.5)

s(A)×s(B)=sindep(A, B)

sindep

I(A, B)={=1,if A and B are independent;>1,if A and B are positively related;

<1,if A and B are negatively related.

(5.6)

For the tea-coffee example shown in Table 5.7 , , thus

suggesting a slight negative relationship between tea drinkers and coffee

drinkers. Also, for the tea-honey example shown in Table 5.8 ,

, suggesting a strong positive relationship between

people who drink tea and people who use honey in their beverage. We can

thus see that the interest factor is able to detect meaningful patterns in the

tea-coffee and tea-honey examples. Indeed, the interest factor has a number

of statistical advantages over the confidence measure that make it a suitable

measure for analyzing statistical independence between variables.

Piatesky-Shapiro (PS) Measure

Instead of computing the ratio between s(A, B) and ,

the PS measure considers the difference between s(A, B) and as

follows.

The PS value is 0 when A and B are mutually independent of each other.

Otherwise, when there is a positive relationship between the two

variables, and when there is a negative relationship.

Correlation Analysis

Correlation analysis is one of the most popular techniques for analyzing

relationships between a pair of variables. For continuous variables, correlation

is defined using Pearson’s correlation coefficient (see Equation 2.10 on

page 83). For binary variables, correlation can be measured using the

, which is defined as

I=0.150.2×0.8=0.9375

I=0.10.12×0.2=4.1667

sindep(A, B)=s(A)×s(B)

s(A)×s(B)

PS=s(A, B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)

PS>0

PS<0

ϕ-

coefficient

ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)

If we rearrange the terms in 5.8, we can show that the can be

rewritten in terms of the support measures of A, B, and {A, B} as follows:

Note that the numerator in the above equation is identical to the PS measure.

Hence, the can be understood as a normalized version of the PS

measure, where that the value of the ranges from to . From

a statistical viewpoint, the correlation captures the normalized difference

between s(A, B) and (A, B). A correlation value of 0 means no

relationship, while a value of suggests a perfect positive relationship and a

value of suggests a perfect negative relationship. The correlation measure

has a statistical meaning and hence is widely used to evaluate the strength of

statistical independence among variables. For instance, the correlation

between tea and coffee drinkers in Table 5.7 is which is slightly

less than 0. On the other hand, the correlation between people who drink tea

and people who use honey in Table 5.8 is 0.5847, suggesting a positive

relationship.

IS Measure

IS is an alternative measure for capturing the relationship between s(A, B) and

. The IS measure is defined as follows:

Although the definition of IS looks quite similar to the interest factor, they

share some interesting differences. Since IS is the geometric mean between

the interest factor and the support of a pattern, IS is large when both the

interest factor and support are large. Hence, if the interest factor of two

patterns are identical, the IS has a preference of selecting the pattern with

higher support. It is also possible to show that IS is mathematically equivalent

ϕ-coefficient

ϕ=s(A, B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)

ϕ-coefficient

ϕ-coefficient −1 +1

sindep

+1

−1

−0.0625

s(A)×s(B)

IS(A, B)=I(A, B)×s(A, B)=s(A, B)s(A)s(B)=f11f1+f+1. (5.10)

to the cosine measure for binary variables (see Equation 2.6 on page

81 ). The value of IS thus varies from 0 to 1, where an IS value of 0

corresponds to no co-occurrence of the two variables, while an IS value of 1

denotes perfect relationship, since they occur in exactly the same

transactions. For the tea-coffee example shown in Table 5.7 , the value of

IS is equal to 0.375, while the value of IS for the tea-honey example in Table

5.8 is 0.6455. The IS measure thus gives a higher value for the

rule than the rule, which is consistent with our

understanding of the two rules.

Alternative Objective Interestingness Measures

Note that all of the measures defined in the previous Section use different

techniques to capture the deviance between s(A, B) and .

Some measures use the ratio between s(A, B) and (A, B), e.g., the

interest factor and IS, while some other measures consider the difference

between the two, e.g., the PS and the . Some measures are

bounded in a particular range, e.g., the IS and the , while others

are unbounded and do not have a defined maximum or minimum value, e.g.,

the Interest Factor. Because of such differences, these measures behave

differently when applied to different types of patterns. Indeed, the measures

defined above are not exhaustive and there exist many alternative measures

for capturing different properties of relationships between pairs of binary

variables. Table 5.9 provides the definitions for some of these measures in

terms of the frequency counts of a contingency table.

Table 5.9. Examples of objective measures for the itemset {A, B}.

Measure (Symbol) Definition

Correlation

{Tea}

→{Honey} {Tea}→{Coffee}

sindep=s(A)×s(B)

sindep

ϕ-coefficient

ϕ-coefficient

2×2

(ϕ) Nf11−f1+f+1f1+f+1f0+f+0

Odds ratio

Kappa

Interest (I)

Cosine (IS)

Piatetsky-Shapiro (PS)

Collective strength (S)

Jaccard

All-confidence (h)

Consistency among Objective Measures

Given the wide variety of measures available, it is reasonable to question

whether the measures can produce similar ordering results when applied to a

set of association patterns. If the measures are consistent, then we can

choose any one of them as our evaluation metric. Otherwise, it is important to

understand what their differences are in order to determine which measure is

more suitable for analyzing certain types of patterns.

Suppose the measures defined in Table 5.9 are applied to rank the ten

contingency tables shown in Table 5.10 . These contingency tables are

chosen to illustrate the differences among the existing measures. The

ordering produced by these measures is shown in Table 5.11 (with 1 as the

most interesting and 10 as the least interesting table). Although some of the

measures appear to be consistent with each other, others produce quite

different ordering results. For example, the rankings given by the

agrees mostly with those provided by and collective strength, but are quite

(α) (f11f00)/(f10f01)

(κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0

(Nf11)/(f1+f+1)

(f11)/(f1+f+1)

f11N−f1+f+1N2

f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00

(ζ) f11/(f1++f+1−f11)

min[f11f1+, f11f+1]

ϕ-coefficient

κ

different than the rankings produced by interest factor. Furthermore, a

contingency table such as is ranked lowest according to the ,

but highest according to interest factor.

Table 5.10. Example of contingency tables.

Example

8123 83 424 1370

8330 2 622 1046

3954 3080 5 2961

2886 1363 1320 4431

1500 2000 500 6000

4000 2000 1000 3000

9481 298 127 94

4000 2000 2000 2000

7450 2483 4 63

61 2483 4 7452

Table 5.11. Rankings of contingency tables using the measures given in

Table 5.9 .

I IS PS S h

1 3 1 6 2 2 1 2 2

2 1 2 7 3 5 2 3 3

E10 ϕ-coefficient

f11 f10 f01 f00

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

ϕ α κ ζ

E1

E2

3 2 4 4 5 1 3 6 8

4 8 3 3 7 3 4 7 5

5 7 6 2 9 6 6 9 9

6 9 5 5 6 4 5 5 7

7 6 7 9 1 8 7 1 1

8 10 8 8 8 7 8 8 7

9 4 9 10 4 9 9 4 4

10 5 10 1 10 10 10 10 10

Properties of Objective Measures

The results shown in Table 5.11 suggest that the measures greatly differ

from each other and can provide conflicting information about the quality of a

pattern. In fact, no measure is universally best for all applications. In the

following, we describe some properties of the measures that play an important

role in determining if they are suited for a certain application.

Inversion Property

Consider the binary vectors shown in Figure 5.28 . The 0/1 value in each

column vector indicates whether a transaction (row) contains a particular item

(column). For example, the vector A indicates that the item appears in the first

and last transactions, whereas the vector B indicates that the item is

contained only in the fifth transaction. The vectors and are the inverted

versions of A and B, i.e., their 1 values have been changed to 0 values

(absence to presence) and vice versa. Applying this transformation to a binary

E3

E4

E5

E6

E7

E8

E9

E10

A¯ B¯

vector is called inversion. If a measure is invariant under the inversion

operation, then its value for the vector pair should be identical to its

value for {A, B}. The inversion property of a measure can be tested as follows.

Figure 5.28.

Effect of the inversion operation. The vectors and are inversions of

vectors A and B, respectively.

Definition 5.6. (Inversion Property.)

An objective measure M is invariant under the inversion

operation if its value remains the same when exchanging the

frequency counts with and with .

Measures that are invariant to the inversion property include the correlation

( ), odds ratio, , and collective strength. These measures are

especially useful in scenarios where the presence (1’s) of a variable is as

{A¯, B¯}

A¯ E¯

f11 f00 f10 f01

ϕ-coefficient κ

important as its absence (0’s). For example, if we compare two sets of

answers to a series of true/false questions where 0’s (true) and 1’s (false) are

equally meaningful, we should use a measure that gives equal importance to

occurrences of 0–0’s and 1–1’s. For the vectors shown in Figure 5.28 , the

is equal to -0.1667 regardless of whether we consider the pair {A,

B} or pair . Similarly, the odds ratio for both pairs of vectors is equal to

a constant value of 0. Note that even though the and the odds

ratio are invariant to inversion, they can still show different results, as will be

shown later.

Measures that do not remain invariant under the inversion operation include

the interest factor and the IS measure. For example, the IS value for the pair

in Figure 5.28 is 0.825, which reflects the fact that the 1’s in

and occur frequently together. However, the IS value of its inverted pair {A,

B} is equal to 0, since A and B do not have any co-occurrence of 1’s. For

asymmetric binary variables, e.g., the occurrence of words in documents, this

is indeed the desired behavior. A desired similarity measure between

asymmetric variables should not be invariant to inversion, since for these

variables, it is more meaningful to capture relationships based on the

presence of a variable rather than its absence. On the other hand, if we are

dealing with symmetric binary variables where the relationships between 0’s

and 1’s are equally meaningful, care should be taken to ensure that the

chosen measure is invariant to inversion.

Although the values of the interest factor and IS change with the inversion

operation, they can still be inconsistent with each other. To illustrate this,

consider Table 5.12 , which shows the contingency tables for two pairs of

variables, {p, q} and {r, s}. Note that r and s are inverted transformations of p

and q, respectively, where the roles of 0’s and 1’s have just been reversed.

The interest factor for {p, q} is 1.02 and for {r, s} is 4.08, which means that the

interest factor finds the inverted pair {r, s} more related than the original pair

ϕ-coefficient

{A¯, B¯}

ϕ-coefficient

{A¯, B¯} A¯

B¯

{p, q}. On the contrary, the IS value decreases upon inversion from 0.9346 for

{p, q} to 0.286 for {r, s}, suggesting quite an opposite trend to that of the

interest factor. Even though these measures conflict with each other for this

example, they may be the right choice of measure in different applications.

Table 5.12. Contingency tables for the pairs {p,q} and {r,s}.

p

q 880 50 930

50 20 70

930 70 1000

r

s 20 50 70

50 880 930

70 930 1000

Scaling Property

Table 5.13 shows two contingency tables for gender and the grades

achieved by students enrolled in a particular course. These tables can be

used to study the relationship between gender and performance in the course.

The second contingency table has data from the same population but has

twice as many males and three times as many females. The actual number of

males or females can depend upon the samples available for study, but the

relationship between gender and grade should not change just because of

differences in sample sizes. Similarly, if the number of students with high and

low grades are changed in a new study, a measure of association between

p¯

q¯

r¯

s¯

gender and grades should remain unchanged. Hence, we need a measure

that is invariant to scaling of rows or columns. The process of multiplying a

row or column of a contingency table by a constant value is called a row or

column scaling operation. A measure that is invariant to scaling does not

change its value after any row or column scaling operation.

Table 5.13. The grade-gender example. (a) Sample data of size 100.

Male Female

High 30 20 50

Low 40 10 50

70 30 100

(b) Sample data of size 230.

Male Female

High 60 60 120

Low 80 30 110

140 90 230

Definition 5.7. (Scaling Invariance

Property.)

Let T be a contingency table with frequency counts

. Let be the transformed a contingency table[f11; f10; f01; f00] T′

with scaled frequency counts

, where are

positive constants used to scale the two rows and the two

columns of T . An objective measure M is invariant under the

row/column scaling operation if .

Note that the use of the term ‘scaling’ here should not be confused with the

scaling operation for continuous variables introduced in Chapter 2 on page

23, where all the values of a variable were being multiplied by a constant

factor, instead of scaling a row or column of a contingency table.

Scaling of rows and columns in contingency tables occurs in multiple ways in

different applications. For example, if we are measuring the effect of a

particular medical procedure on two sets of subjects, healthy and diseased,

the ratio of healthy and diseased subjects can widely vary across different

studies involving different groups of participants. Further, the fraction of

healthy and diseased subjects chosen for a controlled study can be quite

different from the true fraction observed in the complete population. These

differences can result in a row or column scaling in the contingency tables for

different populations of subjects. In general, the frequencies of items in a

contingency table closely depends on the sample of transactions used to

generate the table. Any change in the sampling procedure may affect a row or

column scaling transformation. A measure that is expected to be invariant to

differences in the sampling procedure must not change with row or column

scaling.

Of all the measures introduced in Table 5.9 , only the odds ratio is

invariant to row and column scaling operations. For example, the value of

odds ratio for both the tables in Table 5.13 is equal to 0.375. All other

[k1k3f11; k2k3f10; k1k4f01; k2k4f00] k1, k2, k3, k4

M(T)=M(T′)

(α)

measures such as the , IS, interest factor, and collective

strength (S) change their values when the rows and columns of the

contingency table are rescaled. Indeed, the odds ratio is a preferred choice of

measure in the medical domain, where it is important to find relationships that

do not change with differences in the population sample chosen for a study.

Null Addition Property

Suppose we are interested in analyzing the relationship between a pair of

words, such as and , in a set of documents. If a collection of

articles about ice fishing is added to the data set, should the association

between and be affected? This process of adding unrelated data

(in this case, documents) to a given data set is known as the null addition

operation.

Definition 5.8. (Null Addition Property.)

An objective measure M is invariant under the null addition

operation if it is not affected by increasing , while all other

frequencies in the contingency table stay the same.

For applications such as document analysis or market basket analysis, we

would like to use a measure that remains invariant under the null addition

operation. Otherwise, the relationship between words can be made to change

simply by adding enough documents that do not contain both words!

Examples of measures that satisfy this property include cosine (IS) and

ϕ-coefficient, κ

f00

Jaccard measures, while those that violate this property include interest

factor, PS, odds ratio, and the .

To demonstrate the effect of null addition, consider the two contingency tables

and shown in Table 5.14 . Table has been obtained from by

adding 1000 extra transactions with both A and B absent. This operation only

affects the entry of Table , which has increased from 100 to 1100,

whereas all the other frequencies in the table , and remain the

same. Since IS is invariant to null addition, it gives a constant value of 0.875

to both the tables. However, the addition of 1000 extra transactions with

occurrences of 0–0’s changes the value of interest factor from 0.972 for

(denoting a slightly negative correlation) to 1.944 for (positive correlation).

Similarly, the value of odds ratio increases from 7 for to 77 for . Hence,

when the interest factor or odds ratio are used as the association measure,

the relationships between variables changes by the addition of null

transactions where both the variables are absent. In contrast, the IS measure

is invariant to null addition, since it considers two variables to be related only if

they frequently occur together. Indeed, the IS measure (cosine measure) is

widely used to measure similarity among documents, which is expected to

depend only on the joint occurrences (1’s) of words in documents, but not

their absences (0’s).

Table 5.14. An example demonstrating the effect of null addition.

(a) Table .

B

A 700 100 800

100 100 200

800 200 1000

(ξ)

ϕ-coefficient

T1 T2 T2 T1

f00 T2

(f11, f10 f01)

T1

T2

T1 T2

T1

B¯

A¯

(b) Table .

B

A 700 100 800

10 1100 1200

800 1200 2000

Table 5.15 provides a summary of properties for the measures defined in

Table 5.9 . Even though this list of properties is not exhaustive, it can serve

as a useful guide for selecting the right choice of measure for an application.

Ideally, if we know the specific requirements of a certain application, we can

ensure that the selected measure shows properties that adhere to those

requirements. For example, if we are dealing with asymmetric variables, we

would prefer to use a measure that is not invariant to null addition or inversion.

On the other hand, if we require the measure to remain invariant to changes in

the sample size, we would like to use a measure that does not change with

scaling.

Table 5.15. Properties of symmetric measures.

Symbol Measure Inversion Null Addition Scaling

Yes No No

odds ratio Yes No Yes

Cohen’s Yes No No

I Interest No No No

IS Cosine No Yes No

PS Piatetsky-Shapiro’s Yes No No

T2

B¯

A¯

ϕ ϕ-coefficient

α

κ

S Collective strength Yes No No

Jaccard No Yes No

h All-confidence No Yes No

s Support No No No

Asymmetric Interestingness Measures

Note that in the discussion so far, we have only considered measures that do

not change their value when the order of the variables are reversed. More

specifically, if M is a measure and A and B are two variables, then M(A, B) is

equal to M(B, A) if the order of the variables does not matter. Such measures

are called symmetric. On the other hand, measures that depend on the order

of variables are called asymmetric measures. For

example, the interest factor is a symmetric measure because its value is

identical for the rules and . In contrast, confidence is an

asymmetric measure since the confidence for and may not be the

same. Note that the use of the term ‘asymmetric’ to describe a particular type

of measure of relationship—one in which the order of the variables is

important—should not be confused with the use of ‘asymmetric’ to describe a

binary variable for which only 1’s are important. Asymmetric measures are

more suitable for analyzing association rules, since the items in a rule do have

a specific order. Even though we only considered symmetric measures to

discuss the different properties of association measures, the above discussion

is also relevant for the asymmetric measures. See Bibliographic Notes for

more information about different kinds of asymmetric measures and their

properties.

ζ

(M(A, B)≠M(B, A))

A→B B→A

A→B B→A

5.7.2 Measures beyond Pairs of Binary

Variables

The measures shown in Table 5.9 are defined for pairs of binary variables

(e.g., 2-itemsets or association rules). However, many of them, such as

support and all-confidence, are also applicable to larger-sized itemsets. Other

measures, such as interest factor, IS, PS, and Jaccard coefficient, can be

extended to more than two variables using the frequency tables tabulated in a

multidimensional contingency table. An example of a three-dimensional

contingency table for a, b, and c is shown in Table 5.16 . Each entry in

this table represents the number of transactions that contain a particular

combination of items a, b, and c. For example, is the number of

transactions that contain a and c, but not b. On the other hand, a marginal

frequency such as is the number of transactions that contain a and c,

irrespective of whether b is present in the transaction.

Table 5.16. Example of a three-dimensional contingency table.

c b

a

c b

a

fijk

f101

f1+1

b¯

f111 f101 f1+1

a¯ f011 f001 f0+1

f+11 f+01 f++1

b¯

f110 f100 f1+0

a¯ f010 f000 f0+0

Given a k-itemset , the condition for statistical independence can

be stated as follows:

With this definition, we can extend objective measures such as interest factor

and PS, which are based on deviations from statistical independence, to more

than two variables:

Another approach is to define the objective measure as the maximum,

minimum, or average value for the associations between pairs of items in a

pattern. For example, given a k-itemset , we may define the

for X as the average between every pair of items

in X. However, because the measure considers only pairwise

associations, it may not capture all the underlying relationships within a

pattern. Also, care should be taken in using such alternate measures for more

than two variables, since they may not always show the anti-monotone

property in the same way as the support measure, making them unsuitable for

mining patterns using the Apriori principle.

Analysis of multidimensional contingency tables is more complicated because

of the presence of partial associations in the data. For example, some

associations may appear or disappear when conditioned upon the value of

certain variables. This problem is known as Simpson’s paradox and is

described in Section 5.7.3 . More sophisticated statistical techniques are

f+10 f+00 f++0

{i1, i2, …, ik}

fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)

I=Nk−1×fi1i2…ikfi1+…+×f+i2…+×…×f++…ikPS=fi1i2…ikN−fi1+

…+×f+i2…+×…×f++…ikNk

X={i1, i2, …, ik} ϕ-

coefficient ϕ-coefficient

(ip, iq)

available to analyze such relationships, e.g., loglinear models, but these

techniques are beyond the scope of this book.

5.7.3 Simpson’s Paradox

It is important to exercise caution when interpreting the association between

variables because the observed relationship may be influenced by the

presence of other confounding factors, i.e., hidden variables that are not

included in the analysis. In some cases, the hidden variables may cause the

observed relationship between a pair of variables to disappear or reverse its

direction, a phenomenon that is known as Simpson’s paradox. We illustrate

the nature of this paradox with the following example.

Consider the relationship between the sale of high-definition televisions

(HDTV) and exercise machines, as shown in Table 5.17 . The rule

has a confidence of and

the rule has a confidence of

. Together, these rules suggest that customers who buy high-

definition televisions are more likely to buy exercise machines than those who

do not buy high-definition televisions.

Table 5.17. A two-way contingency table between the sale of high-

definition television and exercise machine.

Buy

HDTV

Buy Exercise Machine

Yes No

Yes 99 81 180

No 54 66 120

{HDTV=Yes}→{Exercise machine=Yes} 99/180=55%

{HDTV=No}→{Exercise machine=Yes}

54/120=45%

153 147 300

However, a deeper analysis reveals that the sales of these items depend on

whether the customer is a college student or a working adult. Table 5.18

summarizes the relationship between the sale of HDTVs and exercise

machines among college students and working adults. Notice that the support

counts given in the table for college students and working adults sum up to

the frequencies shown in Table 5.17 . Furthermore, there are more working

adults than college students who buy these items. For college students:

Table 5.18. Example of a three-way contingency table.

Customer

Group

Buy

HDTV

Buy Exercise Machine Total

Yes No

College Students Yes 1 9 10

No 4 30 34

Working Adult Yes 98 72 170

No 50 36 86

while for working adults:

c({HDTV=Yes}→{Exercise machine=Yes})=1/10=10%,c({HDTV=No}

→{Exercise machine=Yes})=4/34=11.8%.

c({HDTV=Yes}→{Exercise machine=Yes})=98/170=57.7%,c({HDTV=No}

→{Exercise machine=Yes})=50/86=58.1%.

The rules suggest that, for each group, customers who do not buy high-

definition televisions are more likely to buy exercise machines, which

contradicts the previous conclusion when data from the two customer groups

are pooled together. Even if alternative measures such as correlation, odds

ratio, or interest are applied, we still find that the sale of HDTV and exercise

machine is positively related in the combined data but is negatively related in

the stratified data (see Exercise 21 on page 449). The reversal in the direction

of association is known as Simpson’s paradox.

The paradox can be explained in the following way. First, notice that most

customers who buy HDTVs are working adults. This is reflected in the high

confidence of the rule .

Second, the high confidence of the rule

suggests that most customers who buy

exercise machines are also working adults. Since working adults form the

largest fraction of customers for both HDTVs and exercise machines, they

both look related and the rule turns

out to be stronger in the combined data than what it would have been if the

data is stratified. Hence, customer group acts as a hidden variable that affects

both the fraction of customers who buy HDTVs and those who buy exercise

machines. If we factor out the effect of the hidden variable by stratifying the

data, we see that the relationship between buying HDTVs and buying exercise

machines is not direct, but shows up as an indirect consequence of the effect

of the hidden variable.

The Simpson’s paradox can also be illustrated mathematically as follows.

Suppose

{HDTV=Yes}→{Working Adult}(170/180=94.4%)

{Exercise machine=Yes}

→{Working Adult}(148/153=96.7%)

{HDTV=Yes}→{Exercise machine=Yes}

a/b<c/dandp/q<r/s,

where a/b and p/q may represent the confidence of the rule in two

different strata, while c/d and r/s may represent the confidence of the rule

in the two strata. When the data is pooled together, the confidence

values of the rules in the combined data are and ,

respectively. Simpson’s paradox occurs when

thus leading to the wrong conclusion about the relationship between the

variables. The lesson here is that proper stratification is needed to avoid

generating spurious patterns resulting from Simpson’s paradox. For example,

market basket data from a major supermarket chain should be stratified

according to store locations, while medical records from various patients

should be stratified according to confounding factors such as age and gender.

A→B

A¯→B

(a+p)/(b+q) (c+r)/(d+s)

a+pb+q>c+rd+s,

5.8 Effect of Skewed Support

Distribution

The performances of many association analysis algorithms are influenced by

properties of their input data. For example, the computational complexity of

the Apriori algorithm depends on properties such as the number of items in

the data, the average transaction width, and the support threshold used. This

Section examines another important property that has significant influence on

the performance of association analysis algorithms as well as the quality of

extracted patterns. More specifically, we focus on data sets with skewed

support distributions, where most of the items have relatively low to moderate

frequencies, but a small number of them have very high frequencies.

Figure 5.29.

A transaction data set containing three items, p, q, and r, where p is a high

support item and q and r are low support items.

Figure 5.29 shows an illustrative example of a data set that has a skewed

support distribution of its items. While p has a high support of 83.3% in the

data, q and r are low-support items with a support of 16.7%. Despite their low

support, q and r always occur together in the limited number of transactions

that they appear and hence are strongly related. A pattern mining algorithm

therefore should report {q, r} as interesting.

However, note that choosing the right support threshold for mining item-sets

such as {q, r} can be quite tricky. If we set the threshold too high (e.g., 20%),

then we may miss many interesting patterns involving low-support items such

as {q, r}. Conversely, setting the support threshold too low can be detrimental

to the pattern mining process for the following reasons. First, the

computational and memory requirements of existing association analysis

algorithms increase considerably with low support thresholds. Second, the

number of extracted patterns also increases substantially with low support

thresholds, which makes their analysis and interpretation difficult. In particular,

we may extract many spurious patterns that relate a high-frequency item such

as p to a low-frequency item such as q. Such patterns, which are called

cross-support patterns, are likely to be spurious because the association

between p and q is largely influenced by the frequent occurrence of p instead

of the joint occurrence of p and q together. Because the support of {p, q} is

quite close to the support of {q, r}, we may easily select {p, q} if we set the

support threshold low enough to include {q, r}.

An example of a real data set that exhibits a skewed support distribution is

shown in Figure 5.30 . The data, taken from the PUMS (Public Use

Microdata Sample) census data, contains 49,046 records and 2113

asymmetric binary variables. We shall treat the asymmetric binary variables

as items and records as transactions. While more than 80% of the items have

support less than 1%, a handful of them have support greater than 90%. To

understand the effect of skewed support distribution on frequent itemset

mining, we divide the items into three groups, , and , according to

their support levels, as shown in Table 5.19 . We can see that more than

82% of items belong to and have a support less than 1%. In market basket

analysis, such low support items may correspond to expensive products (such

as jewelry) that are seldom bought by customers, but whose patterns are still

interesting to retailers. Patterns involving such low-support items, though

meaningful, can easily be rejected by a frequent pattern mining algorithm with

a high support threshold. On the other hand, setting a low support threshold

may result in the extraction of spurious patterns that relate a high-frequency

G1, G2 G3

G1

item in to a low-frequency item in . For example, at a support threshold

equal to 0.05%, there are 18,847 frequent pairs involving items from and

. Out of these, 93% of them are cross-support patterns; i.e., the patterns

contain items from both and .

Figure 5.30.

Support distribution of items in the census data set.

Table 5.19. Grouping the items in the census data set based on their

support values.

Group

Support

Number of Items 1735 358 20

G3 G1

G1

G3

G1 G3

G1 G2 G3

<1% 1%−90% >90%

This example shows that a large number of weakly related cross-support

patterns can be generated when the support threshold is sufficiently low. Note

that finding interesting patterns in data sets with skewed support distributions

is not just a challenge for the support measure, but similar statements can be

made about many other objective measures discussed in the previous

Sections. Before presenting a methodology for finding interesting patterns and

pruning spurious ones, we formally define the concept of cross-support

patterns.

Definition 5.9. (Cross-support Pattern.)

Let us define the support ratio, r(X), of an itemset

as

Given a user-specified threshold , an itemset X is a cross-

support pattern if .

Example 5.4.

Suppose the support for milk is 70%, while the support for sugar is 10%

and caviar is 0.04%. Given , the frequent itemset {milk, sugar,

caviar} is a cross-support pattern because its support ratio is

X=

{i1, i2, …, ik}

r(X)=min[s(i1), s(i2), …, s(ik)}max[s(i1), s(i2), …, s(ik)} (5.12)

hc

r(X)<hc

hc=0.01

r=min[0.7, 0.1, 0.0004]max[0.7, 0.1, 0.0004]=0.0040.7=0.00058<0.01.

Existing measures such as support and confidence may not be sufficient to

eliminate cross-support patterns. For example, if we assume for the

data set presented in Figure 5.29 , the itemsets {p, q}, {p, r}, and {p, q, r}

are cross-support patterns because their support ratios, which are equal to

0.2, are less than the threshold . However, their supports are comparable to

that of {q, r}, making it difficult to eliminate cross-support patterns without

loosing interesting ones using a support-based pruning strategy. Confidence

pruning also does not help because the confidence of the rules extracted from

cross-support patterns can be very high. For example, the confidence for

is 80% even though {p, q} is a cross-support pattern. The fact that the

cross-support pattern can produce a high confidence rule should not come as

a surprise because one of its items (p) appears very frequently in the data.

Therefore, p is expected to appear in many of the transactions that contain q.

Meanwhile, the rule also has high confidence even though {q, r} is not

a cross-support pattern. This example demonstrates the difficulty of using the

confidence measure to distinguish between rules extracted from cross-support

patterns and interesting patterns involving strongly connected but low-support

items.

Even though the rule has very high confidence, notice that the rule

has very low confidence because most of the transactions that contain p

do not contain q. In contrast, the rule , which is derived from {q, r}, has

very high confidence. This observation suggests that cross-support patterns

can be detected by examining the lowest confidence rule that can be

extracted from a given itemset. An approach for finding the rule with the

lowest confidence given an itemset can be described as follows.

1. Recall the following anti-monotone property of confidence:

hc=0.3

hc

{q}

→{p}

{q}→{r}

{q}→{p} {p}

→{q}

{r}→{q}

conf({i1i2}→{i3, i4, …, ik})≤conf({i1i2i3}→{i4, i5, …, ik}).

This property suggests that confidence never increases as we shift

more items from the left- to the right-hand side of an association rule.

Because of this property, the lowest confidence rule extracted from a

frequent itemset contains only one item on its left-hand side. We

denote the set of all rules with only one item on its left-hand side as .

2. Given a frequent itemset , the rule

has the lowest confidence in if .This

follows directly from the definition of confidence as the ratio between

the rule’s support and the support of the rule antecedent. Hence, the

confidence of a rule will be lowest when the support of the antecedent

is highest.

3. Summarizing the previous points, the lowest confidence attainable from

a frequent itemset is

This expression is also known as the h-confidence or all-confidence

measure. Because of the anti-monotone property of support, the

numerator of the h-confidence measure is bounded by the minimum

support of any item that appears in the frequent itemset. In other

words, the h-confidence of an itemset must not exceed

the following expression:

Note that the upper bound of h-confidence in the above equation is exactly

same as support ratio (r) given in Equation 5.12 . Because the support ratio

for a cross-support pattern is always less than , the h-confidence of the

pattern is also guaranteed to be less than . Therefore, cross-support

patterns can be eliminated by ensuring that the h-confidence values for the

patterns exceed . As a final note, the advantages of using h-confidence go

R1

{i1, i2, …, ik}

{ij}→{i1, i2, …, ij−1, ij+1, …,ik}

R1 s(ij)=max[s(i1), s(i2), …, s(ik)]

{i1, i2, …, ik}

s({i1, i2, …, ik})max[s(i1), s(i2), …, s(ik)].

X={i1, i2, …, ik}

h-confidence(X)≤min[s(i1), s(i2), …, s(ik)]max[s(i1), s(i2), …, s(ik)].

hc

hc

hc

beyond eliminating cross-support patterns. The measure is also anti-

monotone, i.e.,

and thus can be incorporated directly into the mining algorithm. Furthermore,

h-confidence ensures that the items contained in an itemset are strongly

associated with each other. For example, suppose the h-confidence of an

itemset X is 80%. If one of the items in X is present in a transaction, there is at

least an 80% chance that the rest of the items in X also belong to the same

transaction. Such strongly associated patterns involving low-support items are

called hyperclique patterns.

Definition 5.10. (Hyperclique Pattern.)

An itemset X is a hyperclique pattern if h-confidence ,

where is a user-specified threshold.

h-confidence({i1, i2, …, ik})≥h-confidence({i1, i2, …, ik+1}),

(X)>hc

hc

5.9 Bibliographic Notes

The association rule mining task was first introduced by Agrawal et al. [324,

325] to discover interesting relationships among items in market basket

transactions. Since its inception, extensive research has been conducted to

address the various issues in association rule mining, from its fundamental

concepts to its implementation and applications. Figure 5.31 shows a

taxonomy of the various research directions in this area, which is generally

known as association analysis. As much of the research focuses on finding

patterns that appear significantly often in the data, the area is also known as

frequent pattern mining. A detailed review on some of the research topics in

this area can be found in [362] and in [319].

Figure 5.31.

An overview of the various research directions in association analysis.

Conceptual Issues

Research on the conceptual issues of association analysis has focused on

developing a theoretical formulation of association analysis and extending the

formulation to new types of patterns and going beyond asymmetric binary

attributes.

Following the pioneering work by Agrawal et al. [324, 325], there has been a

vast amount of research on developing a theoretical formulation for the

association analysis problem. In [357], Gunopoulos et al. showed the

connection between finding maximal frequent itemsets and the hypergraph

transversal problem. An upper bound on the complexity of the association

analysis task was also derived. Zaki et al. [454, 456] and Pasquier et al. [407]

have applied formal concept analysis to study the frequent itemset generation

problem. More importantly, such research has led to the development of a

class of patterns known as closed frequent itemsets [456]. Friedman et al.

[355] have studied the association analysis problem in the context of bump

hunting in multidimensional space. Specifically, they consider frequent

itemset generation as the task of finding high density regions in

multidimensional space. Formalizing association analysis in a statistical

learning framework is another active research direction [414, 435, 444] as it

can help address issues related to identifying statistically significant patterns

and dealing with uncertain data [320, 333, 343].

Over the years, the association rule mining formulation has been expanded to

encompass other rule-based patterns, such as, profile association rules [321],

cyclic association rules [403], fuzzy association rules [379], exception rules

[431], negative association rules [336, 418], weighted association rules [338,

413], dependence rules [422], peculiar rules[462], inter-transaction

association rules [353, 440], and partial classification rules [327, 397].

Additionally, the concept of frequent itemset has been extended to other types

of patterns including closed itemsets [407, 456], maximal itemsets [330],

hyperclique patterns [449], support envelopes [428], emerging patterns [347],

contrast sets [329], high-utility itemsets [340, 390], approximate or error-

tolerant item-sets [358, 389, 451], and discriminative patterns [352, 401, 430].

Association analysis techniques have also been successfully applied to

sequential [326, 426], spatial [371], and graph-based [374, 380, 406, 450,

455] data.

Substantial research has been conducted to extend the original association

rule formulation to nominal [425], ordinal [392], interval [395], and ratio [356,

359, 425, 443, 461] attributes. One of the key issues is how to define the

support measure for these attributes. A methodology was proposed by

Steinbach et al. [429] to extend the traditional notion of support to more

general patterns and attribute types.

Implementation Issues

Research activities in this area revolve around (1) integrating the mining

capability into existing database technology, (2) developing efficient and

scalable mining algorithms, (3) handling user-specified or domain-specific

constraints, and (4) post-processing the extracted patterns.

There are several advantages to integrating association analysis into existing

database technology. First, it can make use of the indexing and query

processing capabilities of the database system. Second, it can also exploit the

DBMS support for scalability, check-pointing, and parallelization [415]. The

SETM algorithm developed by Houtsma et al. [370] was one of the earliest

algorithms to support association rule discovery via SQL queries. Since then,

numerous methods have been developed to provide capabilities for mining

association rules in database systems. For example, the DMQL [363] and M-

SQL [373] query languages extend the basic SQL with new operators for

mining association rules. The Mine Rule operator [394] is an expressive SQL

operator that can handle both clustered attributes and item hierarchies. Tsur et

al. [439] developed a generate-and-test approach called query flocks for

mining association rules. A distributed OLAP-based infrastructure was

developed by Chen et al. [341] for mining multilevel association rules.

Despite its popularity, the Apriori algorithm is computationally expensive

because it requires making multiple passes over the transaction database. Its

runtime and storage complexities were investigated by Dunkel and Soparkar

[349]. The FP-growth algorithm was developed by Han et al. in [364]. Other

algorithms for mining frequent itemsets include the DHP (dynamic hashing

and pruning) algorithm proposed by Park et al. [405] and the Partition

algorithm developed by Savasere et al [417]. A sampling-based frequent

itemset generation algorithm was proposed by Toivonen [436]. The algorithm

requires only a single pass over the data, but it can produce more candidate

item-sets than necessary. The Dynamic Itemset Counting (DIC) algorithm

[337] makes only 1.5 passes over the data and generates less candidate

itemsets than the sampling-based algorithm. Other notable algorithms include

the tree-projection algorithm [317] and H-Mine [408]. Survey articles on

frequent itemset generation algorithms can be found in [322, 367]. A

repository of benchmark data sets and software implementation of association

rule mining algorithms is available at the Frequent Itemset Mining

Implementations (FIMI) repository (http://fimi.cs.helsinki.fi).

Parallel algorithms have been developed to scale up association rule mining

for handling big data [318, 360, 399, 420, 457]. A survey of such algorithms

can be found in [453]. Online and incremental association rule mining

algorithms have also been proposed by Hidber [365] and Cheung et al. [342].

More recently, new algorithms have been developed to speed up frequent

itemset mining by exploiting the processing power of GPUs [459] and the

MapReduce/Hadoop distributed computing framework [382, 384, 396]. For

example, an implementation of frequent itemset mining for the Hadoop

framework is available in the Apache Mahout software .

1

1 http://mahout.apache.org

Srikant et al. [427] have considered the problem of mining association rules in

the presence of Boolean constraints such as the following:

Given such a constraint, the algorithm looks for rules that contain both cookies

and milk, or rules that contain the descendent items of cookies but not

ancestor items of wheat bread. Singh et al. [424] and Ng et al. [400] had also

developed alternative techniques for constrained-based association rule

mining. Constraints can also be imposed on the support for different itemsets.

This problem was investigated by Wang et al. [442], Liu et al. in [387], and

Seno et al. [419]. In addition, constraints arising from privacy concerns of

mining sensitive data have led to the development of privacy-preserving

frequent pattern mining techniques [334, 350, 441, 458].

One potential problem with association analysis is the large number of

patterns that can be generated by current algorithms. To overcome this

problem, methods to rank, summarize, and filter patterns have been

developed. Toivonen et al. [437] proposed the idea of eliminating redundant

rules using structural rule covers and grouping the remaining rules using

clustering. Liu et al. [388] applied the statistical chi-square test to prune

spurious patterns and summarized the remaining patterns using a subset of

the patterns called direction setting rules. The use of objective measures to

filter patterns has been investigated by many authors, including Brin et al.

[336], Bayardo and Agrawal [331], Aggarwal and Yu [323], and DuMouchel

and Pregibon[348]. The properties for many of these measures were analyzed

by Piatetsky-Shapiro [410], Kamber and Singhal [376], Hilderman and

Hamilton [366], and Tan et al. [433]. The grade-gender example used to

highlight the importance of the row and column scaling invariance property

(Cookies∧Milk)∨(descendants(Cookies)∧¬ancestors(Wheat Bread))

was heavily influenced by the discussion given in [398] by Mosteller.

Meanwhile, the tea-coffee example illustrating the limitation of confidence was

motivated by an example given in [336] by Brin et al. Because of the limitation

of confidence, Brin et al. [336] had proposed the idea of using interest factor

as a measure of interestingness. The all-confidence measure was proposed

by Omiecinski [402]. Xiong et al. [449] introduced the cross-support property

and showed that the all-confidence measure can be used to eliminate cross-

support patterns. A key difficulty in using alternative objective measures

besides support is their lack of a monotonicity property, which makes it difficult

to incorporate the measures directly into the mining algorithms. Xiong et al.

[447] have proposed an efficient method for mining correlations by introducing

an upper bound function to the . Although the measure is non-

monotone, it has an upper bound expression that can be exploited for the

efficient mining of strongly correlated item pairs.

Fabris and Freitas [351] have proposed a method for discovering interesting

associations by detecting the occurrences of Simpson’s paradox [423].

Megiddo and Srikant [393] described an approach for validating the extracted

patterns using hypothesis testing methods. A resampling-based technique

was also developed to avoid generating spurious patterns because of the

multiple comparison problem. Bolton et al. [335] have applied the Benjamini-

Hochberg [332] and Bonferroni correction methods to adjust the p-values of

discovered patterns in market basket data. Alternative methods for handling

the multiple comparison problem were suggested by Webb [445], Zhang et al.

[460], and Llinares-Lopez et al. [391].

Application of subjective measures to association analysis has been

investigated by many authors. Silberschatz and Tuzhilin [421] presented two

principles in which a rule can be considered interesting from a subjective point

of view. The concept of unexpected condition rules was introduced by Liu et

al. in [385]. Cooley et al. [344] analyzed the idea of combining soft belief sets

ϕ-coefficient

using the Dempster-Shafer theory and applied this approach to identify

contradictory and novel association patterns in web data. Alternative

approaches include using Bayesian networks [375] and neighborhood-based

information [346] to identify subjectively interesting patterns.

Visualization also helps the user to quickly grasp the underlying structure of

the discovered patterns. Many commercial data mining tools display the

complete set of rules (which satisfy both support and confidence threshold

criteria) as a two-dimensional plot, with each axis corresponding to the

antecedent or consequent itemsets of the rule. Hofmann et al. [368] proposed

using Mosaic plots and Double Decker plots to visualize association rules.

This approach can visualize not only a particular rule, but also the overall

contingency table between itemsets in the antecedent and consequent parts

of the rule. Nevertheless, this technique assumes that the rule consequent

consists of only a single attribute.

Application Issues

Association analysis has been applied to a variety of application domains

such as web mining [409, 432], document analysis [369], telecommunication

alarm diagnosis [377], network intrusion detection [328, 345, 381], and

bioinformatics [416, 446]. Applications of association and correlation pattern

analysis to Earth Science studies have been investigated in [411, 412, 434].

Trajectory pattern mining [339, 372, 438] is another application of spatio-

temporal association analysis to identify frequently traversed paths of moving

objects.

Association patterns have also been applied to other learning problems such

as classification [383, 386], regression [404], and clustering [361, 448, 452]. A

comparison between classification and association rule mining was made by

Freitas in his position paper [354]. The use of association patterns for

clustering has been studied by many authors including Han et al.[361],

Kosters et al. [378], Yang et al. [452] and Xiong et al. [448].

Bibliography

[317] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A Tree Projection

Algorithm for Generation of Frequent Itemsets. Journal of Parallel and

Distributed Computing (Special Issue on High Performance Data Mining),

61(3):350–371, 2001.

[318] R. C. Agarwal and J. C. Shafer. Parallel Mining of Association Rules.

IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969,

March 1998.

[319] C. Aggarwal and J. Han. Frequent Pattern Mining. Springer, 2014.

[320] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining

with uncertain data. In Proceedings of the 15th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, pages 29–38,

Paris, France, 2009.

[321] C. C. Aggarwal, Z. Sun, and P. S. Yu. Online Generation of Profile

Association Rules. In Proc. of the 4th Intl. Conf. on Knowledge Discovery

and Data Mining, pages 129— 133, New York, NY, August 1996.

[322] C. C. Aggarwal and P. S. Yu. Mining Large Itemsets for Association

Rules. Data Engineering Bulletin, 21(1):23–31, March 1998.

[323] C. C. Aggarwal and P. S. Yu. Mining Associations with the Collective

Strength Approach. IEEE Trans. on Knowledge and Data Engineering,

13(6):863–873, January/February 2001.

[324] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A

performance perspective. IEEE Transactions on Knowledge and Data

Engineering, 5:914–925, 1993.

[325] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules

between sets of items in large databases. In Proc. ACM SIGMOD Intl.

Conf. Management of Data, pages 207–216, Washington, DC, 1993.

[326] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of Intl.

Conf. on Data Engineering, pages 3–14, Taipei, Taiwan, 1995.

[327] K. Ali, S. Manganaris, and R. Srikant. Partial Classification using

Association Rules. In Proc. of the 3rd Intl. Conf. on Knowledge Discovery

and Data Mining, pages 115— 118, Newport Beach, CA, August 1997.

[328] D. Barbará, J. Couto, S. Jajodia, and N. Wu. ADAM: A Testbed for

Exploring the Use of Data Mining in Intrusion Detection. SIGMOD Record,

30(4):15–24, 2001.

[329] S. D. Bay and M. Pazzani. Detecting Group Differences: Mining Contrast

Sets. Data Mining and Knowledge Discovery, 5(3):213–246, 2001.

[330] R. Bayardo. Efficiently Mining Long Patterns from Databases. In Proc. of

1998 ACM-SIGMOD Intl. Conf. on Management of Data, pages 85–93,

Seattle, WA, June 1998.

[331] R. Bayardo and R. Agrawal. Mining the Most Interesting Rules. In Proc.

of the 5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 145–

153, San Diego, CA, August 1999.

[332] Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A

Practical and Powerful Approach to Multiple Testing. Journal Royal

Statistical Society B, 57 (1):289–300, 1995.

[333] T. Bernecker, H. Kriegel, M. Renz, F. Verhein, and A. Züle. Probabilistic

frequent itemset mining in uncertain databases. In Proceedings of the 15th

ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining, pages 119–128, Paris, France, 2009.

[334] R. Bhaskar, S. Laxman, A. D. Smith, and A. Thakurta. Discovering

frequent patterns in sensitive data. In Proceedings of the 16th ACM

SIGKDD International Conference on Knowledge Discovery and Data

Mining, pages 503–512, Washington, DC, 2010.

[335] R. J. Bolton, D. J. Hand, and N. M. Adams. Determining Hit Rate in

Pattern Search. In Proc. of the ESF Exploratory Workshop on Pattern

Detection and Discovery in Data Mining, pages 36–48, London, UK,

September 2002.

[336] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets:

Generalizing association rules to correlations. In Proc. ACM SIGMOD Intl.

Conf. Management of Data, pages 265–276, Tucson, AZ, 1997.

[337] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic Itemset Counting

and Implication Rules for market basket data. In Proc. of 1997 ACM-

SIGMOD Intl. Conf. on Management of Data, pages 255–264, Tucson, AZ,

June 1997.

[338] C. H. Cai, A. Fu, C. H. Cheng, and W. W. Kwong. Mining Association

Rules with Weighted Items. In Proc. of IEEE Intl. Database Engineering

and Applications Symp., pages 68–77, Cardiff, Wales, 1998.

[339] H. Cao, N. Mamoulis, and D. W. Cheung. Mining Frequent Spatio-

Temporal Sequential Patterns. In Proceedings of the 5th IEEE International

Conference on Data Mining, pages 82–89, Houston, TX, 2005.

[340] R. Chan, Q. Yang, and Y. Shen. Mining High Utility Itemsets. In

Proceedings of the 3rd IEEE International Conference on Data Mining,

pages 19–26, Melbourne, FL, 2003.

[341] Q. Chen, U. Dayal, and M. Hsu. A Distributed OLAP infrastructure for E-

Commerce. In Proc. of the 4th IFCIS Intl. Conf. on Cooperative Information

Systems, pages 209— 220, Edinburgh, Scotland, 1999.

[342] D. C. Cheung, S. D. Lee, and B. Kao. A General Incremental Technique

for Maintaining Discovered Association Rules. In Proc. of the 5th Intl. Conf.

on Database Systems for Advanced Applications, pages 185–194,

Melbourne, Australia, 1997.

[343] C. K. Chui, B. Kao, and E. Hung. Mining Frequent Itemsets from

Uncertain Data. In Proceedings of the 11th Pacific-Asia Conference on

Knowledge Discovery and Data Mining, pages 47–58, Nanjing, China,

2007.

[344] R. Cooley, P. N. Tan, and J. Srivastava. Discovery of Interesting Usage

Patterns from Web Data. In M. Spiliopoulou and B. Masand, editors,

Advances in Web Usage Analysis and User Profiling, volume 1836, pages

163–182. Lecture Notes in Computer Science, 2000.

[345] P. Dokas, L. Ertöz, V. Kumar, A. Lazarevic, J. Srivastava, and P. N. Tan.

Data Mining for Network Intrusion Detection. In Proc. NSF Workshop on

Next Generation Data Mining, Baltimore, MD, 2002.

[346] G. Dong and J. Li. Interestingness of discovered association rules in

terms of neighborhood-based unexpectedness. In Proc. of the 2nd Pacific-

Asia Conf. on Knowledge Discovery and Data Mining, pages 72–86,

Melbourne, Australia, April 1998.

[347] G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering

Trends and Differences. In Proc. of the 5th Intl. Conf. on Knowledge

Discovery and Data Mining, pages 43–52, San Diego, CA, August 1999.

[348] W. DuMouchel and D. Pregibon. Empirical Bayes Screening for Multi-

Item Associations. In Proc. of the 7th Intl. Conf. on Knowledge Discovery

and Data Mining, pages 67–76, San Francisco, CA, August 2001.

[349] B. Dunkel and N. Soparkar. Data Organization and Access for Efficient

Data Mining. In Proc. of the 15th Intl. Conf. on Data Engineering, pages

522–529, Sydney, Australia, March 1999.

[350] A. V. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy

preserving mining of association rules. In Proceedings of the Eighth ACM

SIGKDD International Conference on Knowledge Discovery and Data

Mining, pages 217–228, Edmonton, Canada, 2002.

[351] C. C. Fabris and A. A. Freitas. Discovering surprising patterns by

detecting occurrences of Simpson’s paradox. In Proc. of the 19th SGES

Intl. Conf. on Knowledge-Based Systems and Applied Artificial

Intelligence), pages 148–160, Cambridge, UK, December 1999.

[352] G. Fang, G. Pandey, W. Wang, M. Gupta, M. Steinbach, and V. Kumar.

Mining Low-Support Discriminative Patterns from Dense and High-

Dimensional Data. IEEE Trans. Knowl. Data Eng., 24(2):279–294, 2012.

[353] L. Feng, H. J. Lu, J. X. Yu, and J. Han. Mining inter-transaction

associations with templates. In Proc. of the 8th Intl. Conf. on Information

and Knowledge Management, pages 225–233, Kansas City, Missouri, Nov

1999.

[354] A. A. Freitas. Understanding the crucial differences between

classification and discovery of association rules—a position paper.

SIGKDD Explorations, 2(1):65–69, 2000.

[355] J. H. Friedman and N. I. Fisher. Bump hunting in high-dimensional data.

Statistics and Computing, 9(2):123–143, April 1999.

[356] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining

Optimized Association Rules for Numeric Attributes. In Proc. of the 15th

Symp. on Principles of Database Systems, pages 182–191, Montreal,

Canada, June 1996.

[357] D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data Mining,

Hypergraph Transversals, and Machine Learning. In Proc. of the 16th

Symp. on Principles of Database Systems, pages 209–216, Tucson, AZ,

May 1997.

[358] R. Gupta, G. Fang, B. Field, M. Steinbach, and V. Kumar. Quantitative

evaluation of approximate frequent pattern mining algorithms. In

Proceedings of the 14th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 301–309, Las Vegas, NV,

2008.

[359] E. Han, G. Karypis, and V. Kumar. Min-apriori: An algorithm for finding

association rules in data with continuous attributes. Department of

Computer Science and Engineering, University of Minnesota, Tech. Rep,

1997.

[360] E.-H. Han, G. Karypis, and V. Kumar. Scalable Parallel Data Mining for

Association Rules. In Proc. of 1997 ACM-SIGMOD Intl. Conf. on

Management of Data, pages 277–288, Tucson, AZ, May 1997.

[361] E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering Based on

Association Rule Hypergraphs. In Proc. of the 1997 ACM SIGMOD

Workshop on Research Issues in Data Mining and Knowledge Discovery,

Tucson, AZ, 1997.

[362] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current

status and future directions. Data Mining and Knowledge Discovery,

15(1):55–86, 2007.

[363] J. Han, Y. Fu, K. Koperski, W. Wang, and O. R. Zaïane. DMQL: A data

mining query language for relational databases. In Proc. of the 1996 ACM

SIGMOD Workshop on Research Issues in Data Mining and Knowledge

Discovery, Montreal, Canada, June 1996.

[364] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate

Generation. In Proc. ACM-SIGMOD Int. Conf. on Management of Data

(SIGMOD’00), pages 1–12, Dallas, TX, May 2000.

[365] C. Hidber. Online Association Rule Mining. In Proc. of 1999 ACM-

SIGMOD Intl. Conf. on Management of Data, pages 145–156, Philadelphia,

PA, 1999.

[366] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and

Measures of Interest. Kluwer Academic Publishers, 2001.

[367] J. Hipp, U. Guntzer, and G. Nakhaeizadeh. Algorithms for Association

Rule Mining— A General Survey. SigKDD Explorations, 2(1):58–64, June

2000.

[368] H. Hofmann, A. P. J. M. Siebes, and A. F. X. Wilhelm. Visualizing

Association Rules with Interactive Mosaic Plots. In Proc. of the 6th Intl.

Conf. on Knowledge Discovery and Data Mining, pages 227–235, Boston,

MA, August 2000.

[369] J. D. Holt and S. M. Chung. Efficient Mining of Association Rules in Text

Databases. In Proc. of the 8th Intl. Conf. on Information and Knowledge

Management, pages 234–242, Kansas City, Missouri, 1999.

[370] M. Houtsma and A. Swami. Set-oriented Mining for Association Rules in

Relational Databases. In Proc. of the 11th Intl. Conf. on Data Engineering,

pages 25–33, Taipei, Taiwan, 1995.

[371] Y. Huang, S. Shekhar, and H. Xiong. Discovering Co-location Patterns

from SpatialDatasets: A General Approach. IEEE Trans. on Knowledge

and Data Engineering, 16 (12):1472–1485, December 2004.

[372] S. Hwang, Y. Liu, J. Chiu, and E. Lim. Mining Mobile Group Patterns: A

Trajectory-Based Approach. In Proceedings of the 9th Pacific-Asia

Conference on Knowledge Discovery and Data Mining, pages 713–718,

Hanoi, Vietnam, 2005.

[373] T. Imielinski, A. Virmani, and A. Abdulghani. DataMine: Application

Programming Interface and Query Language for Database Mining. In Proc.

of the 2nd Intl. Conf. on Knowledge Discovery and Data Mining, pages

256–262, Portland, Oregon, 1996.

[374] A. Inokuchi, T. Washio, and H. Motoda. An Apriori-based Algorithm for

Mining Frequent Substructures from Graph Data. In Proc. of the 4th

European Conf. of Principles and Practice of Knowledge Discovery in

Databases, pages 13–23, Lyon, France, 2000.

[375] S. Jaroszewicz and D. Simovici. Interestingness of Frequent Itemsets

Using Bayesian Networks as Background Knowledge. In Proc. of the 10th

Intl. Conf. on Knowledge Discovery and Data Mining, pages 178–186,

Seattle, WA, August 2004.

[376] M. Kamber and R. Shinghal. Evaluating the Interestingness of

Characteristic Rules. In Proc. of the 2nd Intl. Conf. on Knowledge

Discovery and Data Mining, pages 263–266, Portland, Oregon, 1996.

[377] M. Klemettinen. A Knowledge Discovery Methodology for

Telecommunication Network Alarm Databases. PhD thesis, University of

Helsinki, 1999.

[378] W. A. Kosters, E. Marchiori, and A. Oerlemans. Mining Clusters with

Association Rules. In The 3rd Symp. on Intelligent Data Analysis (IDA99),

pages 39–50, Amsterdam, August 1999.

[379] C. M. Kuok, A. Fu, and M. H. Wong. Mining Fuzzy Association Rules in

Databases. ACM SIGMOD Record, 27(1):41–46, March 1998.

[380] M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. In Proc. of

the 2001 IEEE Intl. Conf. on Data Mining, pages 313–320, San Jose, CA,

November 2001.

[381] W. Lee, S. J. Stolfo, and K. W. Mok. Adaptive Intrusion Detection: A

Data Mining Approach. Artificial Intelligence Review, 14(6):533–567, 2000.

[382] N. Li, L. Zeng, Q. He, and Z. Shi. Parallel Implementation of Apriori

Algorithm Based on MapReduce. In Proceedings of the 13th ACIS

International Conference on Software Engineering, Artificial Intelligence,

Networking and Parallel/Distributed Computing, pages 236–241, Kyoto,

Japan, 2012.

[383] W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification

Based on Multiple Class-association Rules. In Proc. of the 2001 IEEE Intl.

Conf. on Data Mining, pages 369–376, San Jose, CA, 2001.

[384] M. Lin, P. Lee, and S. Hsueh. Apriori-based frequent itemset mining

algorithms on MapReduce. In Proceedings of the 6th International

Conference on Ubiquitous Information Management and Communication,

pages 26–30, Kuala Lumpur, Malaysia, 2012.

[385] B. Liu, W. Hsu, and S. Chen. Using General Impressions to Analyze

Discovered Classification Rules. In Proc. of the 3rd Intl. Conf. on

Knowledge Discovery and Data Mining, pages 31–36, Newport Beach, CA,

August 1997.

[386] B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association

Rule Mining. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and

Data Mining, pages 80–86, New York, NY, August 1998.

[387] B. Liu, W. Hsu, and Y. Ma. Mining association rules with multiple

minimum supports. In Proc. of the 5th Intl. Conf. on Knowledge Discovery

and Data Mining, pages 125— 134, San Diego, CA, August 1999.

[388] B. Liu, W. Hsu, and Y. Ma. Pruning and Summarizing the Discovered

Associations. In Proc. of the 5th Intl. Conf. on Knowledge Discovery and

Data Mining, pages 125–134, San Diego, CA, August 1999.

[389] J. Liu, S. Paulsen, W. Wang, A. B. Nobel, and J. Prins. Mining

Approximate Frequent Itemsets from Noisy Data. In Proceedings of the 5th

IEEE International Conference on Data Mining, pages 721–724, Houston,

TX, 2005.

[390] Y. Liu, W.-K. Liao, and A. Choudhary. A two-phase algorithm for fast

discovery of high utility itemsets. In Proceedings of the 9th Pacific-Asia

Conference on Knowledge Discovery and Data Mining, pages 689–695,

Hanoi, Vietnam, 2005.

[391] F. Llinares-López, M. Sugiyama, L. Papaxanthos, and K. M. Borgwardt.

Fast and Memory-Efficient Significant Pattern Mining via Permutation

Testing. In Proceedings of the 21th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, pages 725–734,

Sydney, Australia, 2015.

[392] A. Marcus, J. I. Maletic, and K.-I. Lin. Ordinal association rules for error

identification in data sets. In Proc. of the 10th Intl. Conf. on Information and

Knowledge Management, pages 589–591, Atlanta, GA, October 2001.

[393] N. Megiddo and R. Srikant. Discovering Predictive Association Rules. In

Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining,

pages 274–278, New York, August 1998.

[394] R. Meo, G. Psaila, and S. Ceri. A New SQL-like Operator for Mining

Association Rules. In Proc. of the 22nd VLDB Conf., pages 122–133,

Bombay, India, 1996.

[395] R. J. Miller and Y. Yang. Association Rules over Interval Data. In Proc. of

1997 ACM-SIGMOD Intl. Conf. on Management of Data, pages 452–461,

Tucson, AZ, May 1997.

[396] S. Moens, E. Aksehirli, and B. Goethals. Frequent Itemset Mining for Big

Data. In Proceedings of the 2013 IEEE International Conference on Big

Data, pages 111–118, Santa Clara, CA, 2013.

[397] Y. Morimoto, T. Fukuda, H. Matsuzawa, T. Tokuyama, and K. Yoda.

Algorithms for mining association rules for binary segmentations of huge

categorical databases. In Proc. of the 24th VLDB Conf., pages 380–391,

New York, August 1998.

[398] F. Mosteller. Association and Estimation in Contingency Tables. JASA,

63:1–28, 1968.

[399] A. Mueller. Fast sequential and parallel algorithms for association rule

mining: A comparison. Technical Report CS-TR-3515, University of

Maryland, August 1995.

[400] R. T. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory Mining

and Pruning Optimizations of Constrained Association Rules. In Proc. of

1998 ACM-SIGMOD Intl. Conf. on Management of Data, pages 13–24,

Seattle, WA, June 1998.

[401] P. K. Novak, N. Lavrač, and G. I. Webb. Supervised descriptive rule

discovery: A unifying survey of contrast set, emerging pattern and

subgroup mining. Journal of Machine Learning Research, 10(Feb):377–

403, 2009.

[402] E. Omiecinski. Alternative Interest Measures for Mining Associations in

Databases. IEEE Trans. on Knowledge and Data Engineering, 15(1):57–

69, January/February 2003.

[403] B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic Association

Rules. In Proc. of the 14th Intl. Conf. on Data Eng., pages 412–421,

Orlando, FL, February 1998.

[404] A. Ozgur, P. N. Tan, and V. Kumar. RBA: An Integrated Framework for

Regression based on Association Rules. In Proc. of the SIAM Intl. Conf. on

Data Mining, pages 210–221, Orlando, FL, April 2004.

[405] J. S. Park, M.-S. Chen, and P. S. Yu. An effective hash-based algorithm

for mining association rules. SIGMOD Record, 25(2):175–186, 1995.

[406] S. Parthasarathy and M. Coatney. Efficient Discovery of Common

Substructures in Macromolecules. In Proc. of the 2002 IEEE Intl. Conf. on

Data Mining, pages 362— 369, Maebashi City, Japan, December 2002.

[407] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent

closed itemsets for association rules. In Proc. of the 7th Intl. Conf. on

Database Theory (ICDT’99), pages 398–416, Jerusalem, Israel, January

1999.

[408] J. Pei, J. Han, H. J. Lu, S. Nishio, and S. Tang. H-Mine: Hyper-Structure

Mining of Frequent Patterns in Large Databases. In Proc. of the 2001 IEEE

Intl. Conf. on Data Mining, pages 441–448, San Jose, CA, November

2001.

[409] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu. Mining Access Patterns

Efficiently from Web Logs. In Proc. of the 4th Pacific-Asia Conf. on

Knowledge Discovery and Data Mining, pages 396–407, Kyoto, Japan,

April 2000.

[410] G. Piatetsky-Shapiro. Discovery, Analysis and Presentation of Strong

Rules. In G. Piatetsky-Shapiro and W. Frawley, editors, Knowledge

Discovery in Databases, pages 229–248. MIT Press, Cambridge, MA,

1991.

[411] C. Potter, S. Klooster, M. Steinbach, P. N. Tan, V. Kumar, S. Shekhar,

and C. Carvalho. Understanding Global Teleconnections of Climate to

Regional Model Estimates of Amazon Ecosystem Carbon Fluxes. Global

Change Biology, 10(5):693— 703, 2004.

[412] C. Potter, S. Klooster, M. Steinbach, P. N. Tan, V. Kumar, S. Shekhar, R.

Myneni, and R. Nemani. Global Teleconnections of Ocean Climate to

Terrestrial Carbon Flux. Journal of Geophysical Research, 108(D17),

2003.

[413] G. D. Ramkumar, S. Ranka, and S. Tsur. Weighted association rules:

Model and algorithm. In Proc. ACM SIGKDD, 1998.

[414] M. Riondato and F. Vandin. Finding the True Frequent Itemsets. In

Proceedings of the 2014 SIAM International Conference on Data Mining,

pages 497–505, Philadelphia, PA, 2014.

[415] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating Mining with

Relational Database Systems: Alternatives and Implications. In Proc. of

1998 ACM-SIGMOD Intl. Conf. on Management of Data, pages 343–354,

Seattle, WA, 1998.

[416] K. Satou, G. Shibayama, T. Ono, Y. Yamamura, E. Furuichi, S. Kuhara,

and T. Takagi. Finding Association Rules on Heterogeneous Genome

Data. In Proc. of the Pacific Symp. on Biocomputing, pages 397–408,

Hawaii, January 1997.

[417] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for

mining association rules in large databases. In Proc. of the 21st Int. Conf.

on Very Large Databases (VLDB’95), pages 432–444, Zurich, Switzerland,

September 1995.

[418] A. Savasere, E. Omiecinski, and S. Navathe. Mining for Strong Negative

Associations in a Large Database of Customer Transactions. In Proc. of

the 14th Intl. Conf. on Data Engineering, pages 494–502, Orlando, Florida,

February 1998.

[419] M. Seno and G. Karypis. LPMiner: An Algorithm for Finding Frequent

Itemsets Using Length-Decreasing Support Constraint. In Proc. of the

2001 IEEE Intl. Conf. on Data Mining, pages 505–512, San Jose, CA,

November 2001.

[420] T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for

mining association rules. In Proc of the 4th Intl. Conf. on Parallel and

Distributed Info. Systems, pages 19–30, Miami Beach, FL, December

1996.

[421] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in

knowledge discovery systems. IEEE Trans. on Knowledge and Data

Engineering, 8(6):970–974, 1996.

[422] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets:

Generalizing association rules to dependence rules. Data Mining and

Knowledge Discovery, 2(1): 39–68, 1998.

[423] E.-H. Simpson. The Interpretation of Interaction in Contingency Tables.

Journal of the Royal Statistical Society, B(13):238–241, 1951.

[424] L. Singh, B. Chen, R. Haight, and P. Scheuermann. An Algorithm for

Constrained Association Rule Mining in Semi-structured Data. In Proc. of

the 3rd Pacific-Asia Conf. on Knowledge Discovery and Data Mining,

pages 148–158, Beijing, China, April 1999.

[425] R. Srikant and R. Agrawal. Mining Quantitative Association Rules in

Large Relational Tables. In Proc. of 1996 ACM-SIGMOD Intl. Conf. on

Management of Data, pages 1–12, Montreal, Canada, 1996.

[426] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations

and Performance Improvements. In Proc. of the 5th Intl Conf. on Extending

Database Technology (EDBT’96), pages 18–32, Avignon, France, 1996.

[427] R. Srikant, Q. Vu, and R. Agrawal. Mining Association Rules with Item

Constraints. In Proc. of the 3rd Intl. Conf. on Knowledge Discovery and

Data Mining, pages 67–73, Newport Beach, CA, August 1997.

[428] M. Steinbach, P. N. Tan, and V. Kumar. Support Envelopes: A Technique

for Exploring the Structure of Association Patterns. In Proc. of the 10th Intl.

Conf. on Knowledge Discovery and Data Mining, pages 296–305, Seattle,

WA, August 2004.

[429] M. Steinbach, P. N. Tan, H. Xiong, and V. Kumar. Extending the Notion

of Support. In Proc. of the 10th Intl. Conf. on Knowledge Discovery and

Data Mining, pages 689–694, Seattle, WA, August 2004.

[430] M. Steinbach, H. Yu, G. Fang, and V. Kumar. Using constraints to

generate and explore higher order discriminative patterns. Advances in

Knowledge Discovery and Data Mining, pages 338–350, 2011.

[431] E. Suzuki. Autonomous Discovery of Reliable Exception Rules. In Proc.

of the 3rd Intl. Conf. on Knowledge Discovery and Data Mining, pages 259–

262, Newport Beach, CA, August 1997.

[432] P. N. Tan and V. Kumar. Mining Association Patterns in Web Usage

Data. In Proc. of the Intl. Conf. on Advances in Infrastructure for e-

Business, e-Education, e-Science and e-Medicine on the Internet,

L’Aquila, Italy, January 2002.

[433] P. N. Tan, V. Kumar, and J. Srivastava. Selecting the Right

Interestingness Measure for Association Patterns. In Proc. of the 8th Intl.

Conf. on Knowledge Discovery and Data Mining, pages 32–41, Edmonton,

Canada, July 2002.

[434] P. N. Tan, M. Steinbach, V. Kumar, S. Klooster, C. Potter, and A.

Torregrosa. FindingSpatio-Temporal Patterns in Earth Science Data. In

KDD 2001 Workshop on Temporal Data Mining, San Francisco, CA, 2001.

[435] N. Tatti. Probably the best itemsets. In Proceedings of the 16th ACM

SIGKDD International Conference on Knowledge Discovery and Data

Mining, pages 293–302, Washington, DC, 2010.

[436] H. Toivonen. Sampling Large Databases for Association Rules. In Proc.

of the 22nd VLDB Conf., pages 134–145, Bombay, India, 1996.

[437] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, and H. Mannila.

Pruning and Grouping Discovered Association Rules. In ECML-95

Workshop on Statistics, Machine Learning and Knowledge Discovery in

Databases, pages 47–52, Heraklion, Greece, April 1995.

[438] I. Tsoukatos and D. Gunopulos. Efficient mining of spatiotemporal

patterns. In Proceedings of the 7th International Symposium on Advances

in Spatial and Temporal Databases, pages 425–442, 2001.

[439] S. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and

A. Rosenthal. Query Flocks: A Generalization of Association Rule Mining.

In Proc. of 1998 ACM-SIGMOD Intl. Conf. on Management of Data, pages

1–12, Seattle, WA, June 1998.

[440] A. Tung, H. J. Lu, J. Han, and L. Feng. Breaking the Barrier of

Transactions: Mining Inter-Transaction Association Rules. In Proc. of the

5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 297–301,

San Diego, CA, August 1999.

[441] J. Vaidya and C. Clifton. Privacy preserving association rule mining in

vertically partitioned data. In Proceedings of the Eighth ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, pages

639–644, Edmonton, Canada, 2002.

[442] K. Wang, Y. He, and J. Han. Mining Frequent Itemsets Using Support

Constraints. In Proc. of the 26th VLDB Conf., pages 43–52, Cairo, Egypt,

September 2000.

[443] K. Wang, S. H. Tay, and B. Liu. Interestingness-Based Interval Merger

for Numeric Association Rules. In Proc. of the 4th Intl. Conf. on Knowledge

Discovery and Data Mining, pages 121–128, New York, NY, August 1998.

[444] L. Wang, R. Cheng, S. D. Lee, and D. W. Cheung. Accelerating

probabilistic frequent itemset mining: a model-based approach. In

Proceedings of the 19th ACM Conference on Information and Knowledge

Management, pages 429–438, 2010.

[445] G. I. Webb. Preliminary investigations into statistically valid exploratory

rule discovery. In Proc. of the Australasian Data Mining Workshop

(AusDM03), Canberra, Australia, December 2003.

[446] H. Xiong, X. He, C. Ding, Y. Zhang, V. Kumar, and S. R. Holbrook.

Identification of Functional Modules in Protein Complexes via Hyperclique

Pattern Discovery. In Proc. of the Pacific Symposium on Biocomputing,

(PSB 2005), Maui, January 2005.

[447] H. Xiong, S. Shekhar, P. N. Tan, and V. Kumar. Exploiting a Support-

based Upper Bound of Pearson’s Correlation Coefficient for Efficiently

Identifying Strongly Correlated Pairs. In Proc. of the 10th Intl. Conf. on

Knowledge Discovery and Data Mining, pages 334–343, Seattle, WA,

August 2004.

[448] H. Xiong, M. Steinbach, P. N. Tan, and V. Kumar. HICAP: Hierarchial

Clustering with Pattern Preservation. In Proc. of the SIAM Intl. Conf. on

Data Mining, pages 279–290, Orlando, FL, April 2004.

[449] H. Xiong, P. N. Tan, and V. Kumar. Mining Strong Affinity Association

Patterns in Data Sets with Skewed Support Distribution. In Proc. of the

2003 IEEE Intl. Conf. on Data Mining, pages 387–394, Melbourne, FL,

2003.

[450] X. Yan and J. Han. gSpan: Graph-based Substructure Pattern Mining. In

Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 721–724,

Maebashi City, Japan, December 2002.

[451] C. Yang, U. M. Fayyad, and P. S. Bradley. Efficient discovery of error-

tolerant frequent itemsets in high dimensions. In Proceedings of the

seventh ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 194–203, , San Francisco, CA, 2001.

[452] C. Yang, U. M. Fayyad, and P. S. Bradley. Efficient discovery of error-

tolerant frequent itemsets in high dimensions. In Proc. of the 7th Intl. Conf.

on Knowledge Discovery and Data Mining, pages 194–203, San

Francisco, CA, August 2001.

[453] M. J. Zaki. Parallel and Distributed Association Mining: A Survey. IEEE

Concurrency, special issue on Parallel Mechanisms for Data Mining,

7(4):14–25, December 1999.

[454] M. J. Zaki. Generating Non-Redundant Association Rules. In Proc. of

the 6th Intl. Conf. on Knowledge Discovery and Data Mining, pages 34–43,

Boston, MA, August 2000.

[455] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proc. of the 8th

Intl. Conf. on Knowledge Discovery and Data Mining, pages 71–80,

Edmonton, Canada, July 2002.

[456] M. J. Zaki and M. Orihara. Theoretical foundations of association rules.

In Proc. of the 1998 ACM SIGMOD Workshop on Research Issues in Data

Mining and Knowledge Discovery, Seattle, WA, June 1998.

[457] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New Algorithms for

Fast Discovery of Association Rules. In Proc. of the 3rd Intl. Conf. on

Knowledge Discovery and Data Mining, pages 283–286, Newport Beach,

CA, August 1997.

[458] C. Zeng, J. F. Naughton, and J. Cai. On differentially private frequent

itemset mining. Proceedings of the VLDB Endowment, 6(1):25–36, 2012.

[459] F. Zhang, Y. Zhang, and J. Bakos. GPApriori: GPU-Accelerated

Frequent Itemset Mining. In Proceedings of the 2011 IEEE International

Conference on Cluster Computing, pages 590–594, Austin, TX, 2011.

[460] H. Zhang, B. Padmanabhan, and A. Tuzhilin. On the Discovery of

Significant Statistical Quantitative Rules. In Proc. of the 10th Intl. Conf. on

Knowledge Discovery and Data Mining, pages 374–383, Seattle, WA,

August 2004.

[461] Z. Zhang, Y. Lu, and B. Zhang. An Effective Partioning-Combining

Algorithm for Discovering Quantitative Association Rules. In Proc. of the

1st Pacific-Asia Conf. on Knowledge Discovery and Data Mining,

Singapore, 1997.

[462] N. Zhong, Y. Y. Yao, and S. Ohsuga. Peculiarity Oriented Multi-database

Mining. InProc. of the 3rd European Conf. of Principles and Practice of

Knowledge Discovery in Databases, pages 136–146, Prague, Czech

Republic, 1999.

5.10 Exercises

1. For each of the following questions, provide an example of an association

rule from the market basket domain that satisfies the following conditions.

Also, describe whether such rules are subjectively interesting.

a. A rule that has high support and high confidence.

b. A rule that has reasonably high support but low confidence.

c. A rule that has low support and low confidence.

d. A rule that has low support and high confidence.

2. Consider the data set shown in Table 5.20 .

Table 5.20. Example of market basket transactions.

Customer ID Transaction ID Items Bought

1 0001 {a, d, e}

1 0024 {a, b, c, e}

2 0012 {a, b, d, e}

2 0031 {a, c, d, e}

3 0015 {b, c e}

3 0022 {b, d, e}

4 0029 {c d}

4 0040 {a, b, c}

5 0033 {a, d, e}

5 0038 {a, b, e}

a. Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each

transaction ID as a market basket.

b. Use the results in part (a) to compute the confidence for the association

rules and . Is confidence a symmetric measure?

c. Repeat part (a) by treating each customer ID as a market basket. Each

item should be treated as a binary variable (1 if an item appears in at

least one transaction bought by the customer, and 0 otherwise).

d. Use the results in part (c) to compute the confidence for the association

rules and .

e. Suppose and are the support and confidence values of an

association rule r when treating each transaction ID as a market basket.

Also, let and be the support and confidence values of r when

treating each customer ID as a market basket. Discuss whether there are

any relationships between and or and .

3.

a. What is the confidence for the rules and ?

b. Let , and be the confidence values of the rules

, and , respectively. If we assume that , and

have different values, what are the possible relationships that may exist

among , and ? Which rule has the lowest confidence?

c. Repeat the analysis in part (b) assuming that the rules have identical

support. Which rule has the highest confidence?

{b, d}→{e} {e}→{b, d}

{b, d}→{e} {e}→{b, d}

s1 c1

s2 c2

s1 s2 c1 c2

∅→A A→∅

c1, c2 c3 {p}→{q}, {p}

→{q, r} {p, r}→{q} c1, c2 c3

c1, c2 c3

d. Transitivity: Suppose the confidence of the rules and are

larger than some threshold, minconf. Is it possible that has a

confidence less than minconf?

4. For each of the following measures, determine whether it is monotone, anti-

monotone, or non-monotone (i.e., neither monotone nor anti-monotone).

Example: Support, is anti-monotone because whenever

.

a. A characteristic rule is a rule of the form , where the

rule antecedent contains only a single item. An itemset of size k can

produce up to k characteristic rules. Let be the minimum confidence of

all characteristic rules generated from a given itemset:

Is monotone, anti-monotone, or non-monotone?

b. A discriminant rule is a rule of the form , where the

rule consequent contains only a single item. An itemset of size k can

produce up to k discriminant rules. Let be the minimum confidence of all

discriminant rules generated from a given itemset:

Is monotone, anti-monotone, or non-monotone?

c. Repeat the analysis in parts (a) and (b) by replacing the min function with

a max function.

A→B B→C

A→C

s=σ(x)|T| s(X)≥s(Y)

X⊂Y

{p}→{q1, q2, …, qn}

ζ

ζ({p1, p2, …, pk})=min[c({p1}→{p2, p3, …, pk}), …c({pk}→{p1, p2, …, pk

−1})]

ζ

{p1, p2, …, pn}→{q}

η

η({p1, p2, …, pk})=min[c({p2, p3, …, pk}→{p1}), …c({p1, p2, …, pk−1}

→{pk})]

η

5. Prove Equation 5.3 . (Hint: First, count the number of ways to create an

itemset that forms the left-hand side of the rule. Next, for each size k itemset

selected for the left-hand side, count the number of ways to choose the

remaining items to form the right-hand side of the rule.) Assume that

neither of the itemsets of a rule are empty.

6. Consider the market basket transactions shown in Table 5.21 .

a. What is the maximum number of association rules that can be extracted

from this data (including rules that have zero support)?

b. What is the maximum size of frequent itemsets that can be extracted

(assuming )?

Table 5.21. Market basket transactions.

Transaction ID Items Bought

1 {Milk, Beer, Diapers}

2 {Bread, Butter, Milk}

3 {Milk, Diapers, Cookies}

4 {Bread, Butter, Cookies}

5 {Beer, Cookies, Diapers}

6 {Milk, Diapers, Bread, Butter}

7 {Bread, Butter, Diapers}

8 {Beer, Diapers}

9 {Milk, Diapers, Bread, Butter}

10 {Beer, Cookies}

d−k

minsup>0

c. Write an expression for the maximum number of size-3 itemsets that can

be derived from this data set.

d. Find an itemset (of size 2 or larger) that has the largest support.

e. Find a pair of items, a and b, such that the rules and

have the same confidence.

7. Show that if a candidate k-itemset X has a subset of size less than that

is infrequent, then at least one of the -size subsets of X is necessarily

infrequent.

8. Consider the following set of frequent 3-itemsets:

{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}, {2, 3, 5}, {3, 4, 5}.

Assume that there are only five items in the data set.

a. List all candidate 4-itemsets obtained by a candidate generation

procedure using the merging strategy.

b. List all candidate 4-itemsets obtained by the candidate generation

procedure in Apriori.

c. List all candidate 4-itemsets that survive the candidate pruning step of the

Apriori algorithm.

9. The Apriori algorithm uses a generate-and-count strategy for deriving

frequent itemsets. Candidate itemsets of size are created by joining a pair

of frequent itemsets of size k (this is known as the candidate generation step).

A candidate is discarded if any one of its subsets is found to be infrequent

during the candidate pruning step. Suppose the Apriori algorithm is applied to

the data set shown in Table 5.22 with , i.e., any itemset

occurring in less than 3 transactions is considered to be infrequent.

{a}→{b} {b}→{a}

k−1

(k−1)

Fk−1×F1

k+1

minsup=30%

Table 5.22. Example of market basket transactions.

Transaction ID Items Bought

1 {a, b, d, e}

2 {b, c d}

3 {a, b, d, e}

4 {a, c, d, e}

5 {b, c, d, e}

6 {b, d, e}

7 {c, d}

8 {a, b, c}

9 {a, d, e}

10 {b, d}

a. Draw an itemset lattice representing the data set given in Table 5.22 .

Label each node in the lattice with the following letter(s):

N: If the itemset is not considered to be a candidate itemset by the

Apriori algorithm. There are two reasons for an itemset not to be

considered as a candidate itemset: (1) it is not generated at all during

the candidate generation step, or (2) it is generated during the

candidate generation step but is subsequently removed during the

candidate pruning step because one of its subsets is found to be

infrequent.

F: If the candidate itemset is found to be frequent by the Apriori

algorithm.

I: If the candidate itemset is found to be infrequent after support

counting.

b. What is the percentage of frequent itemsets (with respect to all itemsets

in the lattice)?

c. What is the pruning ratio of the Apriori algorithm on this data set?

(Pruning ratio is defined as the percentage of itemsets not considered to

be a candidate because (1) they are not generated during candidate

generation or (2) they are pruned during the candidate pruning step.)

d. What is the false alarm rate (i.e., percentage of candidate itemsets that

are found to be infrequent after performing support counting)?

10. The Apriori algorithm uses a hash tree data structure to efficiently count

the support of candidate itemsets. Consider the hash tree for candidate 3-

itemsets shown in Figure 5.32 .

Figure 5.32.

An example of a hash tree structure.

a. Given a transaction that contains items {1, 3, 4, 5, 8}, which of the hash

tree leaf nodes will be visited when finding the candidates of the

transaction?

b. Use the visited leaf nodes in part (a) to determine the candidate itemsets

that are contained in the transaction {1, 3, 4, 5, 8}.

11. Consider the following set of candidate 3-itemsets:

{1, 2, 3}, {1, 2, 6}, {1, 3, 4}, {2, 3, 4}, {2, 4, 5}, {3, 4, 6}, {4, 5, 6}

a. Construct a hash tree for the above candidate 3-itemsets. Assume the

tree uses a hash function where all odd-numbered items are hashed to

the left child of a node, while the even-numbered items are hashed to the

right child. A candidate k-itemset is inserted into the tree by hashing on

each successive item in the candidate and then following the appropriate

branch of the tree according to the hash value. Once a leaf node is

reached, the candidate is inserted based on one of the following

conditions:

Condition 1: If the depth of the leaf node is equal to k (the root is

assumed to be at depth 0), then the candidate is inserted regardless of

the number of itemsets already stored at the node.

Condition 2: If the depth of the leaf node is less than k, then the

candidate can be inserted as long as the number of itemsets stored at the

node is less than maxsize. Assume for this question.

Condition 3: If the depth of the leaf node is less than k and the number

of itemsets stored at the node is equal to maxsize, then the leaf node is

converted into an internal node. New leaf nodes are created as children

maxsize=2

of the old leaf node. Candidate itemsets previously stored in the old leaf

node are distributed to the children based on their hash values. The new

candidate is also hashed to its appropriate leaf node.

b. How many leaf nodes are there in the candidate hash tree? How many

internal nodes are there?

c. Consider a transaction that contains the following items: {1, 2, 3, 5, 6}.

Using the hash tree constructed in part (a), which leaf nodes will be

checked against the transaction? What are the candidate 3-itemsets

contained in the transaction?

12. Given the lattice structure shown in Figure 5.33 and the transactions

given in Table 5.22 , label each node with the following letter(s):

Figure 5.33.

An itemset lattice

M if the node is a maximal frequent itemset,

C if it is a closed frequent itemset,

N if it is frequent but neither maximal nor closed, and

I if it is infrequent

Assume that the support threshold is equal to 30%.

13. The original association rule mining formulation uses the support and

confidence measures to prune uninteresting rules.

a. Draw a contingency table for each of the following rules using the

transactions shown in Table 5.23 .

Table 5.23. Example of market basket transactions.

Transaction ID Items Bought

1 {a, b, d, e}

2 {b, c, d}

3 {a, b, d, e}

4 {a, c, d, e}

5 {b, c, d, e}

6 {b, d, e}

7 {c, d}

8 {a, b, c}

9 {a, d, e}

10 {b, d}

Rules: .

b. Use the contingency tables in part (a) to compute and rank the rules in

decreasing order according to the following measures.

i. Support.

ii. Confidence.

iii. Interest

iv.

v. , where

.

vi.

14. Given the rankings you had obtained in Exercise 13, compute the

correlation between the rankings of confidence and the other five measures.

Which measure is most highly correlated with confidence? Which measure is

least correlated with confidence?

15. Answer the following questions using the data sets shown in Figure

5.34 . Note that each data set contains 1000 items and 10,000 transactions.

Dark cells indicate the presence of items and white cells indicate the absence

of items. We will apply the Apriori algorithm to extract frequent itemsets with

(i.e., itemsets must be contained in at least 1000 transactions).

{b}→{c}, {a}→{d}, {b}→{d}, {e}→{c}, {c}→{a}

(X→Y)=P(X, Y)P(X)P(Y).

IS(X→Y)=P(X, Y)P(X)P(Y).

Klosgen(X→Y)=P(X, Y )×max(P(Y|X))−P(Y), P(X|Y)−P(X))

P(Y|X)=P(X, Y)P(X)

Odds ratio(X→Y)=P(X, Y)P(X¯, Y¯)P(X, Y¯)P(X¯, Y).

minsup=10%

Figure 5.34.

Figures for Exercise 15.

a. Which data set(s) will produce the most number of frequent itemsets?

b. Which data set(s) will produce the fewest number of frequent itemsets?

c. Which data set(s) will produce the longest frequent itemset?

d. Which data set(s) will produce frequent itemsets with highest maximum

support?

e. Which data set(s) will produce frequent itemsets containing items with

wide-varying support levels (i.e., items with mixed support, ranging from

less than 20% to more than 70%)?

16.

a. Prove that the coefficient is equal to 1 if and only if .

b. Show that if A and B are independent, then

.

c. Show that Yule’s Q and Y coefficients

are normalized versions of the odds ratio.

d. Write a simplified expression for the value of each measure shown in

Table 5.9 when the variables are statistically independent.

17. Consider the interestingness measure, , for an

association rule .

ϕ f11=f1+=f+1

P(A, B)×P(A¯, B¯)=P(A, B¯)×P(A¯, B)

Q=[f11f00−f10f01f11f00+f10f01]Y=[f11f00−f10f01f11f00+f10f01]

M=P(B|A)−P(B)1−P(B)

A→B

a. What is the range of this measure? When does the measure attain its

maximum and minimum values?

b. How does M behave when P (A, B) is increased while P (A) and P (B)

remain unchanged?

c. How does M behave when P (A) is increased while P (A, B) and P (B)

remain unchanged?

d. How does M behave when P (B) is increased while P (A, B) and P (A)

remain unchanged?

e. Is the measure symmetric under variable permutation?

f. What is the value of the measure when A and B are statistically

independent?

g. Is the measure null-invariant?

h. Does the measure remain invariant under row or column scaling

operations?

i. How does the measure behave under the inversion operation?

18. Suppose we have market basket data consisting of 100 transactions and

20 items. Assume the support for item a is 25%, the support for item b is 90%

and the support for itemset {a, b} is 20%. Let the support and confidence

thresholds be 10% and 60%, respectively.

a. Compute the confidence of the association rule . Is the rule

interesting according to the confidence measure?

b. Compute the interest measure for the association pattern {a, b}. Describe

the nature of the relationship between item a and item b in terms of the

interest measure.

{a}→{b}

c. What conclusions can you draw from the results of parts (a) and (b)?

d. Prove that if the confidence of the rule is less than the support of

{b}, then:

i.

ii.

where denote the rule confidence and denote the support of

an itemset.

19. Table 5.24 shows a contingency table for the binary variables A

and B at different values of the control variable C.

Table 5.24. A Contingency Table.

A

1 0

B 1 0 15

0 15 30

B 1 5 0

0 0 15

a. Compute the coefficient for A and B when and or 1.

Note that .

b. What conclusions can you draw from the above result?

20. Consider the contingency tables shown in Table 5.25 .

{a}→{b}

c({a¯}→{b})>c({a}→{b}),

c({a¯}→{b})>s({b}),

c(⋅) s(⋅)

2×2×2

C=0

C=1

ϕ C=0, C=1, C=0

ϕ=({A, B})=P(A, B)−P(A)P(B)P(A)P(B)(1−P(A))(1−P(B))

a. For table I, compute support, the interest measure, and the correlation

coefficient for the association pattern {A, B}. Also, compute the

confidence of rules and .

b. For table II, compute support, the interest measure, and the correlation

coefficient for the association pattern {A, B}. Also, compute the

confidence of rules and .