292x Filetype PDF File size 0.13 MB Source: pdfs.semanticscholar.org
Slides related to: Why Data Mining?
Data Mining: The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Concepts and Techniques Automated data collection tools, database systems, Web,
computerized society
—Chapter 1 and 2 — Major sources of abundant data
—Introduction and Data preprocessing — Business: Web, e-commerce, transactions, stocks, …
Jiawei Han and Micheline Kamber Science: Remote sensing, bioinformatics, scientific simulation, …
Department of Computer Science Society and everyone: news, digital cameras, YouTube
University of Illinois at Urbana-Champaign We are drowning in data, but starving for knowledge!
www.cs.uiuc.edu/~hanj “Necessity is the mother of invention”—Data mining—Automated
©2006 Jiawei Han and Micheline Kamber. All rights reserved. analysis of massive data sets
Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques 2
Ex. 1: Market Analysis and Management Ex. 2: Corporate Analysis & Risk Management
Where does the data come from?—Credit card transactions, loyalty cards, Finance planning and asset evaluation
discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing cash flow analysis and prediction
Find clusters of “model” customers who share the same characteristics: interest, contingent claim analysis to evaluate assets
income level, spending habits, etc.
Determine customer purchasing patterns over time cross-sectional and time series analysis (financial-ratio, trend
Cross-market analysis—Find associations/co-relations between product sales, analysis, etc.)
& predict based on such association Resource planning
Customer profiling—What types of customers buy what products (clustering
or classification) summarize and compare the resources and spending
Customer requirement analysis Competition
Identify the best products for different groups of customers monitor competitors and market directions
Predict what factors will attract new customers
Provision of summary information group customers into classes and a class-based pricing procedure
Multidimensional summary reports set pricing strategy in a highly competitive market
Statistical summary information (data central tendency and variation)
Data Mining: Concepts and Techniques 3 Data Mining: Concepts and Techniques 4
Ex. 3: Fraud Detection & Mining Unusual Patterns Evolution of Database Technology
Approaches: Clustering & model construction for frauds, outlier analysis 1960s:
Applications: Health care, retail, credit card service, telecomm. Data collection, database creation, IMS and network DBMS
Auto insurance: ring of collisions 1970s:
Money laundering: suspicious monetary transactions Relational data model, relational DBMS implementation
Medical insurance 1980s:
Professional patients, ring of doctors, and ring of references Advanced data models (extended-relational, OO, deductive, etc.)
Unnecessary or correlated screening tests Application-oriented DBMS (spatial, temporal, multimedia, etc.)
Telecommunications: phone-call fraud 1990s:
Phone call model: destination of the call, duration, time of day or Data mining, data warehousing, multimedia databases, and Web
week. Analyze patterns that deviate from an expected norm databases
Retail industry 2000s
Analysts estimate that 38% of retail shrink is due to dishonest Stream data management and mining
employees Data mining and its applications
Anti-terrorism Web technology (XML, data integration) and global information systems
Data Mining: Concepts and Techniques 5 Data Mining: Concepts and Techniques 6
1
What Is Data Mining? Knowledge Discovery (KDD) Process
Data mining (knowledge discovery from data) Data mining—core of
Extraction of interesting (non-trivial, implicit, previously knowledge discovery Pattern evaluation and presentation
unknown and potentially useful) patterns or knowledge from process
huge amount of data Data Mining
Data mining: a misnomer? Task-relevant Data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge Data Warehouse Selection and transformation
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”? Data Cleaning
Simple search and query processing Data Integration
(Deductive) expert systems
Databases
Data Mining: Concepts and Techniques 7 Data Mining: Concepts and Techniques 8
Why Data Preprocessing? Why Is Data Dirty?
Data in the real world is dirty Incomplete data may come from
incomplete: lacking attribute values, lacking certain “Not applicable” data value when collected
Different considerations between the time when the data was collected
attributes of interest, or containing only aggregate and when it is analyzed.
data Human/hardware/software problems
e.g., occupation=“ ” Noisy data (incorrect values) may come from
noisy: containing errors or outliers Faulty data collection instruments
Human or computer error at data entry
e.g., Salary=“-10” Errors in data transmission
inconsistent: containing discrepancies in codes or Inconsistent data may come from
names Different data sources
Functional dependency violation (e.g., modify some linked data)
e.g., Age=“42” Birthdate=“03/07/1997” Duplicate records also need data cleaning
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques 9 Data Mining: Concepts and Techniques 10
Why Is Data Preprocessing Important? Forms of Data Preprocessing
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
Data Mining: Concepts and Techniques 11 Data Mining: Concepts and Techniques 12
2
Architecture: Typical Data Mining System Why Not Traditional Data Analysis?
Tremendous amount of data
Graphical User Interface Algorithms must be highly scalable to handle large amounts of data
High-dimensionality of data
Pattern Evaluation Micro-array may have tens of thousands of dimensions
Knowl High complexity of data
Data Mining Engine edge- Data streams and sensor data
Base
Database or Data Time-series data, temporal data, sequence data
Warehouse Server Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
data cleaning, integration, and selection Spatial, spatiotemporal, multimedia, text and Web data
New and sophisticated applications
Database Data World-Wide Other Info
Warehouse Web Repositories
Data Mining: Concepts and Techniques 13 Data Mining: Concepts and Techniques 14
Data Mining: Classification Schemes Data Mining: on what kinds of data?
General functionality Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Descriptive data mining Advanced data sets and advanced applications
Predictive data mining Object-relational databases
Different views lead to different classifications Time-series data, temporal data, sequence data (incl. bio-sequences)
Spatial data and spatiotemporal data
Data view: Kinds of data to be mined Text databases and Multimedia databases
Knowledge view: Kinds of knowledge to be discovered Data streams and sensor data
Method view: Kinds of techniques utilized The World-Wide Web
Application view: Kinds of applications adapted Heterogeneous databases and legacy databases
Data Mining: Concepts and Techniques 15 Data Mining: Concepts and Techniques 16
Data Mining – what kinds of patterns? Data Mining – what kinds of patterns?
Concept/class description: Frequent patterns, association, correlations
Characterization: summarizing the data of the class under study Frequent itemset
in general terms Frequent sequential pattern
E.g. Characteristics of customers spending more than 10000 Frequent structured pattern
sek per year
Discrimination: comparing target class with other (contrasting) E.g. buy(X, “Diaper” Æ buy(X, “Beer”) [support=0.5%, confidence=75%]
classes confidence: if X buys a diaper, then there is 75% chance that X buys beer
E.g. Compare the characteristics of products that had a sales support: of all transactions under consideration 0.5% showed that diaper and
increase to products that had a sales decrease last year beer were bought together
E.g. Age(X, ”20..29”) and income(X, ”20k..29k”) Æ buys(X, ”cd-player”)
[support=2%, confidence=60%]
Data Mining: Concepts and Techniques 17 Data Mining: Concepts and Techniques 18
3
Data Mining – what kinds of patterns? Data Mining – what kinds of patterns?
Classification and prediction Cluster analysis
Construct models (functions) that describe and Class label is unknown: Group data to form new classes, e.g.,
cluster customers to find target groups for marketing
distinguish classes or concepts for future prediction. Maximizing intra-class similarity & minimizing interclass similarity
The derived model is based on analyzing training data Outlier analysis
– data whose class labels are known. Outlier: Data object that does not comply with the general behavior
E.g., classify countries based on (climate), or of the data
classify cars based on (gas mileage) Noise or exception? Useful in fraud detection, rare events analysis
Predict some unknown or missing numerical values Trend and evolution analysis
Trend and deviation
Data Mining: Concepts and Techniques 19 Data Mining: Concepts and Techniques 20
Are All the “Discovered” Patterns Interesting? Find All and Only Interesting Patterns?
Data mining may generate thousands of patterns: Not all of them Find all the interesting patterns: Completeness
are interesting Can a data mining system find all the interesting patterns? Do we
Suggested approach: Human-centered, query-based, focused mining need to find all of the interesting patterns?
Interestingness measures Heuristic vs. exhaustive search
A pattern is interesting if it is easily understood by humans, valid on new Association vs. classification vs. clustering
or test data with some degree of certainty, potentially useful, novel, or Search for only interesting patterns: An optimization problem
validates some hypothesis that a user seeks to confirm Can a data mining system find only the interesting patterns?
Objective vs. subjective interestingness measures Approaches
Objective: based on statistics and structures of patterns, e.g., support, First generate all the patterns and then filter out the
confidence, etc. uninteresting ones
Subjective: based on user’s belief in the data, e.g., unexpectedness, Generate only the interesting patterns—mining query
novelty, actionability, etc. optimization
Data Mining: Concepts and Techniques 21 Data Mining: Concepts and Techniques 22
Data Mining – what techniques used? Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
Database Classification
Statistics #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan
Technology Kaufmann., 1993.
#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Wadsworth, 1984.
#3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996.
Machine Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
Data Mining Visualization #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid
Learning After All? Internat. Statist. Rev. 69, 385-398.
Statistical Learning
#5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory.
Pattern Springer-Verlag.
Other #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J.
Recognition Disciplines Wiley, New York. Association Analysis
Algorithm #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms
for Mining Association Rules. In VLDB '94.
#8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns
without candidate generation. In SIGMOD '00.
Data Mining: Concepts and Techniques 23 Data Mining: Concepts and Techniques 24
4
no reviews yet
Please Login to review.