Data Mining Jiawei Han 91358 | 55b5522e1a2bb1c1a4f4cbd9e179e5fc6b92

Partial capture of text on file.
                                              Slides related to:                                                                                                           Why Data Mining? 
                                                         Data Mining:                                                                               The Explosive Growth of Data: from terabytes to petabytes
                                                                                                                                                        Data collection and data availability
                                  Concepts and Techniques                                                                                                    Automated data collection tools, database systems, Web, 
                                                                                                                                                              computerized society
                                                             —Chapter 1 and 2 —                                                                         Major sources of abundant data
                                          —Introduction  and Data preprocessing —                                                                            Business: Web, e-commerce, transactions, stocks, …
                                                   Jiawei Han and Micheline Kamber                                                                           Science: Remote sensing, bioinformatics, scientific simulation, …
                                                   Department of Computer Science                                                                            Society and everyone: news, digital cameras, YouTube   
                                            University of Illinois at Urbana-Champaign                                                              We are drowning in data, but starving for knowledge!
                                                           www.cs.uiuc.edu/~hanj                                                                    “Necessity is the mother of invention”—Data mining—Automated 
                                     ©2006 Jiawei Han and Micheline Kamber.  All rights reserved.                                                    analysis of massive data sets
                                                                 Data Mining: Concepts and Techniques                           1                                                  Data Mining: Concepts and Techniques                           2
                                Ex. 1: Market Analysis and Management                                                                             Ex. 2: Corporate Analysis & Risk Management
                                  Where does the data come from?—Credit card transactions, loyalty cards,                                          Finance planning and asset evaluation
                                   discount coupons, customer complaint calls, plus (public) lifestyle studies
                                  Target marketing                                                                                                     cash flow analysis and prediction
                                      Find clusters of “model” customers who share the same characteristics: interest,                                 contingent claim analysis to evaluate assets 
                                       income level, spending habits, etc.
                                      Determine customer purchasing patterns over time                                                                 cross-sectional and time series analysis (financial-ratio, trend 
                                  Cross-market analysis—Find associations/co-relations between product sales,                                            analysis, etc.)
                                   & predict based on such association                                                                              Resource planning
                                  Customer profiling—What types of customers buy what products (clustering 
                                   or classification)                                                                                                   summarize and compare the resources and spending
                                  Customer requirement analysis                                                                                    Competition
                                      Identify the best products for different groups of customers                                                     monitor competitors and market directions 
                                      Predict what factors will attract new customers
                                  Provision of summary information                                                                                     group customers into classes and a class-based pricing procedure
                                      Multidimensional summary reports                                                                                 set pricing strategy in a highly competitive market
                                      Statistical summary information (data central tendency and variation)
                                                                 Data Mining: Concepts and Techniques                           3                                                  Data Mining: Concepts and Techniques                           4
                            Ex. 3: Fraud Detection & Mining Unusual Patterns                                                                             Evolution of Database Technology
                                   Approaches: Clustering & model construction for frauds, outlier analysis                                        1960s:
                                   Applications: Health care, retail, credit card service, telecomm.                                                   Data collection, database creation, IMS and network DBMS
                                      Auto insurance: ring of collisions                                                                           1970s: 
                                      Money laundering: suspicious monetary transactions                                                               Relational data model, relational DBMS implementation
                                      Medical insurance                                                                                            1980s: 
                                           Professional patients, ring of doctors, and ring of references                                              Advanced data models (extended-relational, OO, deductive, etc.) 
                                           Unnecessary or correlated screening tests                                                                   Application-oriented DBMS (spatial, temporal, multimedia, etc.)
                                      Telecommunications: phone-call fraud                                                                         1990s: 
                                           Phone call model: destination of the call, duration, time of day or                                         Data mining, data warehousing, multimedia databases, and Web 
                                             week.  Analyze patterns that deviate from an expected norm                                                   databases
                                      Retail industry                                                                                              2000s
                                           Analysts estimate that 38% of retail shrink is due to dishonest                                             Stream data management and mining
                                             employees                                                                                                  Data mining and its applications
                                      Anti-terrorism                                                                                                   Web technology (XML, data integration) and global information systems
                                                                 Data Mining: Concepts and Techniques                           5                                                  Data Mining: Concepts and Techniques                           6
                                                                                                                                                                                                                                                                1
                                            What Is Data Mining?                                                                  Knowledge Discovery (KDD) Process
                             Data mining (knowledge discovery from data)                                                          Data mining—core of 
                                 Extraction of interesting (non-trivial, implicit, previously                                      knowledge discovery                  Pattern evaluation and presentation
                                   unknown and potentially useful) patterns or knowledge from                                       process
                                   huge amount of data                                                                                                               Data Mining
                                 Data mining: a misnomer?                                                                                          Task-relevant Data
                             Alternative names
                                 Knowledge discovery (mining) in databases (KDD), knowledge                                         Data Warehouse            Selection and transformation
                                   extraction, data/pattern analysis, data archeology, data 
                                   dredging, information harvesting, business intelligence, etc.
                             Watch out: Is everything “data mining”?                                                         Data Cleaning
                                 Simple search and query processing                                                                       Data Integration
                                 (Deductive) expert systems
                                                                                                                                        Databases
                                                         Data Mining: Concepts and Techniques                  7                                           Data Mining: Concepts and Techniques                   8
                                       Why Data Preprocessing?                                                                                    Why Is Data Dirty?
                             Data in the real world is dirty                                                                   Incomplete data may come from
                                  incomplete: lacking attribute values, lacking certain                                            “Not applicable” data value when collected
                                                                                                                                    Different considerations between the time when the data was collected 
                                    attributes of interest, or containing only aggregate                                              and when it is analyzed.
                                    data                                                                                            Human/hardware/software problems
                                      e.g., occupation=“ ”                                                                     Noisy data (incorrect values) may come from
                                  noisy: containing errors or outliers                                                             Faulty data collection instruments
                                                                                                                                    Human or computer error at data entry
                                      e.g., Salary=“-10”                                                                           Errors in data transmission
                                  inconsistent: containing discrepancies in codes or                                           Inconsistent data may come from
                                    names                                                                                           Different data sources
                                                                                                                                    Functional dependency violation (e.g., modify some linked data)
                                      e.g., Age=“42” Birthdate=“03/07/1997”                                                    Duplicate records also need data cleaning
                                      e.g., Was rating “1,2,3”, now rating “A, B, C”
                                      e.g., discrepancy between duplicate records
                                                         Data Mining: Concepts and Techniques                  9                                           Data Mining: Concepts and Techniques                  10
                                  Why Is Data Preprocessing Important?                                                                   Forms of Data Preprocessing
                             No quality data, no quality mining results!
                                 Quality decisions must be based on quality data
                                      e.g., duplicate or missing data may cause incorrect or even 
                                       misleading statistics.
                                 Data warehouse needs consistent integration of quality data
                             Data extraction, cleaning, and transformation comprises 
                               the majority of the work of building a data warehouse
                                                         Data Mining: Concepts and Techniques                 11                                           Data Mining: Concepts and Techniques                  12
                                                                                                                                                                                                                              2
                                    Architecture: Typical Data Mining System                                                                          Why Not Traditional Data Analysis?
                                                                                                                                                    Tremendous amount of data
                                                        Graphical User Interface                                                                       Algorithms must be highly scalable to handle large amounts of data
                                                                                                                                                    High-dimensionality of data 
                                                           Pattern Evaluation                                                                          Micro-array may have tens of thousands of dimensions
                                                                                                         Knowl                                      High complexity of data
                                                          Data Mining Engine                             edge-                                         Data streams and sensor data
                                                                                                         Base
                                                            Database or Data                                                                           Time-series data, temporal data, sequence data 
                                                           Warehouse Server                                                                            Structure data, graphs, social networks and multi-linked data
                                                                                                                                                       Heterogeneous databases and legacy databases
                                                     data cleaning, integration, and selection                                                         Spatial, spatiotemporal, multimedia, text and Web data
                                                                                                                                                    New and sophisticated applications
                                                Database        Data     World-Wide Other Info
                                                             Warehouse       Web       Repositories
                                                                 Data Mining: Concepts and Techniques                         13                                                  Data Mining: Concepts and Techniques                          14
                                    Data Mining: Classification Schemes                                                                              Data Mining: on what kinds of data?
                                 General functionality                                                                                             Database-oriented data sets and applications
                                                                                                                                                       Relational database, data warehouse, transactional database
                                      Descriptive data mining                                                                                      Advanced data sets and advanced applications 
                                      Predictive data mining                                                                                          Object-relational databases
                                 Different views lead to different classifications                                                                    Time-series data, temporal data, sequence data (incl. bio-sequences) 
                                                                                                                                                       Spatial data and spatiotemporal data
                                      Data view: Kinds of data to be mined                                                                            Text databases and Multimedia databases
                                      Knowledge view: Kinds of knowledge to be discovered                                                             Data streams and sensor data
                                      Method view: Kinds of techniques utilized                                                                       The World-Wide Web
                                      Application view: Kinds of applications adapted                                                                 Heterogeneous databases and legacy databases
                                                                 Data Mining: Concepts and Techniques                         15                                                  Data Mining: Concepts and Techniques                          16
                                Data Mining – what kinds of patterns?                                                                             Data Mining – what kinds of patterns?
                                 Concept/class description:                                                                                        Frequent patterns, association, correlations
                                      Characterization: summarizing the data of the class under study                                                 Frequent itemset
                                        in general terms                                                                                               Frequent sequential pattern
                                           E.g. Characteristics of customers spending more than 10000                                                 Frequent structured pattern
                                            sek per year
                                      Discrimination: comparing target class with other (contrasting)                                                  E.g. buy(X, “Diaper” Æ buy(X, “Beer”)  [support=0.5%, confidence=75%]
                                        classes                                                                                                           confidence: if X buys a diaper, then there is 75% chance that X buys beer
                                           E.g. Compare the characteristics of products that had a sales                                                 support: of all transactions under consideration 0.5% showed that diaper and         
                                            increase to products that had a sales decrease last year                                                                beer were bought together
                                                                                                                                                        E.g. Age(X, ”20..29”) and income(X, ”20k..29k”) Æ buys(X, ”cd-player”) 
                                                                                                                                                         [support=2%, confidence=60%]
                                                                 Data Mining: Concepts and Techniques                         17                                                  Data Mining: Concepts and Techniques                          18
                                                                                                                                                                                                                                                               3
                             Data Mining – what kinds of patterns?                                                               Data Mining – what kinds of patterns?
                             Classification and prediction                                                                       Cluster analysis
                                  Construct models (functions) that describe and                                                    Class label is unknown: Group data to form new classes, e.g., 
                                                                                                                                       cluster customers to find target groups for marketing
                                   distinguish classes or concepts for future prediction.                                            Maximizing intra-class similarity & minimizing interclass similarity
                                   The derived model is based on analyzing training data                                          Outlier analysis
                                   – data whose class labels are known.                                                              Outlier: Data object that does not comply with the general behavior 
                                      E.g., classify countries based on (climate), or                                                 of the data
                                       classify cars based on (gas mileage)                                                          Noise or exception? Useful in fraud detection, rare events analysis
                                  Predict some unknown or missing numerical values                                               Trend and evolution analysis
                                                                                                                                     Trend and deviation
                                                         Data Mining: Concepts and Techniques                   19                                           Data Mining: Concepts and Techniques                   20
                             Are All the “Discovered” Patterns Interesting?                                                       Find All and Only Interesting Patterns?
                              Data mining may generate thousands of patterns: Not all of them                                    Find all the interesting patterns: Completeness
                               are interesting                                                                                       Can a data mining system find all the interesting patterns? Do we 
                                  Suggested approach: Human-centered, query-based, focused mining                                     need to find all of the interesting patterns?
                              Interestingness measures                                                                              Heuristic vs. exhaustive search
                                  A pattern is interesting if it is easily understood by humans, valid on new                       Association vs. classification vs. clustering
                                   or test data with some degree of certainty, potentially useful, novel, or                      Search for only interesting patterns: An optimization problem
                                   validates some hypothesis that a user seeks to confirm                                            Can a data mining system find only the interesting patterns?
                              Objective vs. subjective interestingness measures                                                     Approaches
                                  Objective: based on statistics and structures of patterns, e.g., support,                              First generate all the patterns and then filter out the 
                                   confidence, etc.                                                                                        uninteresting ones
                                  Subjective: based on user’s belief in the data, e.g., unexpectedness,                                  Generate only the interesting patterns—mining query 
                                   novelty, actionability, etc.                                                                            optimization
                                                         Data Mining: Concepts and Techniques                   21                                           Data Mining: Concepts and Techniques                   22
                                Data Mining – what techniques used?                                                                Top-10 Most Popular DM Algorithms:
                                                                                                                                            18 Identified Candidates (I)
                                            Database                                                                               Classification
                                                                               Statistics                                             #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan 
                                           Technology                                                                                   Kaufmann., 1993.
                                                                                                                                      #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification 
                                                                                                                                        and Regression Trees. Wadsworth, 1984.
                                                                                                                                      #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. 
                               Machine                                                                                                  Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
                                                          Data Mining                        Visualization                            #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid 
                               Learning                                                                                                 After All? Internat. Statist. Rev. 69, 385-398.
                                                                                                                                   Statistical Learning
                                                                                                                                      #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. 
                                  Pattern                                                                                               Springer-Verlag.
                                                                                              Other                                    #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. 
                               Recognition                                                Disciplines                                   Wiley, New York. Association Analysis
                                                              Algorithm                                                               #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms 
                                                                                                                                        for Mining Association Rules. In VLDB '94.
                                                                                                                                      #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns 
                                                                                                                                        without candidate generation. In SIGMOD '00.
                                                         Data Mining: Concepts and Techniques                   23                                           Data Mining: Concepts and Techniques                   24
                                                                                                                                                                                                                                 4
The words contained in this file might help you see if this file matches what you are looking for:

...Slides related to why data mining the explosive growth of from terabytes petabytes collection and availability concepts techniques automated tools database systems web computerized society chapter major sources abundant introduction preprocessing business e commerce transactions stocks jiawei han micheline kamber science remote sensing bioinformatics scientific simulation department computer everyone news digital cameras youtube university illinois at urbana champaign we are drowning in but starving for knowledge www cs uiuc edu hanj necessity is mother invention all rights reserved analysis massive sets ex market management corporate risk where does come credit card loyalty cards finance planning asset evaluation discount coupons customer complaint calls plus public lifestyle studies target marketing cash flow prediction find clusters model customers who share same characteristics interest contingent claim evaluate assets income level spending habits etc determine purchasing patterns ...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area