248x Filetype PDF File size 2.97 MB Source: support.sas.com
Introduction to
Statistical and
Machine Learning
Methods for
Data Science
Carlos Andre Reis Pinheiro
Mike Patetta
The correct bibliographic citation for this manual is as follows: Pinheiro, Carlos Andre Reis and Mike Patetta. 2021.
Introduction to Statistical and Machine Learning Methods for Data Science. Cary, NC: SAS Institute Inc.
Introduction to Statistical and Machine Learning Methods for Data Science
Copyright © 2021, SAS Institute Inc., Cary, NC, USA
ISBN 978-1-953329-64-6 (Hardcover)
ISBN 978-1-953329-60-8 (Paperback)
ISBN 978-1-953329-61-5 (Web PDF)
ISBN 978-1-953329-62-2 (EPUB)
ISBN 978-1-953329-63-9 (Kindle)
All Rights Reserved. Produced in the United States of America.
For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted,
in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the
vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission
of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not
participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer
software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government.
Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms
of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR
227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR
52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and
no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software
and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
August 2021
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
SAS software may be provided with certain third-party software, including but not limited to open-source
software, which is licensed under its applicable third-party software license agreement. For license information
about third-party software distributed with SAS software, refer to http://support.sas.com/thirdpartylicenses.
Contents
About This Book ............................................................................................................................... vii
About These Authors ......................................................................................................................... ix
Acknowledgments .......................................................................................................................... xiii
Foreword .......................................................................................................................................... xv
Chapter 1: Introduction to Data Science..............................................................................................1
Chapter Overview .................................................................................................................................. 1
Data Science .......................................................................................................................................... 1
Mathematics and Statistics ............................................................................................................... 3
Computer Science ............................................................................................................................. 3
Domain Knowledge ........................................................................................................................... 4
Communication and Visualization ....................................................................................................5
Hard and Soft Skills ........................................................................................................................... 6
Data Science Applications ...................................................................................................................... 6
Data Science Lifecycle and the Maturity Framework.............................................................................7
Understand the Question ................................................................................................................. 7
Collect the Data ................................................................................................................................ 8
Explore the Data ............................................................................................................................... 9
Model the Data ................................................................................................................................. 9
Provide an Answer .......................................................................................................................... 11
Advanced Analytics in Data Science ....................................................................................................12
Data Science Practical Examples ..........................................................................................................16
Customer Experience ...................................................................................................................... 16
Revenue Optimization .................................................................................................................... 16
Network Analytics ........................................................................................................................... 17
Data Monetization .......................................................................................................................... 17
Summary ............................................................................................................................................. 18
Additional Reading .............................................................................................................................. 18
Chapter 2: Data Exploration and Preparation ....................................................................................19
Chapter Overview ............................................................................................................................... 19
Introduction to Data Exploration .......................................................................................................20
Nonlinearity .................................................................................................................................... 20
High Cardinality ............................................................................................................................... 20
iv Introduction to Statistical and Machine Learning Methods for Data Science
Unstructured Data .......................................................................................................................... 21
Sparse Data ..................................................................................................................................... 21
Outliers ........................................................................................................................................... 21
Mis-scaled Input Variables ..............................................................................................................21
Introduction to Data Preparation ........................................................................................................22
Representative Sampling ................................................................................................................22
Event-based Sampling ..................................................................................................................... 23
Partitioning ..................................................................................................................................... 24
Imputation ...................................................................................................................................... 25
Replacement ................................................................................................................................... 27
Transformation................................................................................................................................ 27
Feature Extraction ........................................................................................................................... 29
Feature Selection ............................................................................................................................ 32
Model Selection ................................................................................................................................... 33
Model Generalization ..................................................................................................................... 33
Bias–Variance Tradeoff ................................................................................................................... 35
Summary ............................................................................................................................................. 35
Chapter 3: Supervised Models – Statistical Approach .........................................................................37
Chapter Overview ................................................................................................................................ 37
Classification and Estimation ...............................................................................................................37
Linear Regression ................................................................................................................................. 40
Use Case: Customer Value ..............................................................................................................42
Logistic Regression ............................................................................................................................... 42
Use Case: Collecting Predictive Model ............................................................................................44
Decision Tree ....................................................................................................................................... 45
Use Case: Subscription Fraud ..........................................................................................................47
Summary ............................................................................................................................................. 49
Chapter 4: Supervised Models – Machine Learning Approach ...........................................................51
Chapter Overview ............................................................................................................................... 51
Supervised Machine Learning Models .................................................................................................51
Ensemble of Trees................................................................................................................................ 52
Random Forest ................................................................................................................................ 52
Gradient Boosting ........................................................................................................................... 54
Use Case: Usage Fraud .................................................................................................................... 55
Neural Network ................................................................................................................................... 56
Use Case: Bad Debt ......................................................................................................................... 59
Summary ............................................................................................................................................. 61
Chapter 5: Advanced Topics in Supervised Models ............................................................................63
Chapter Overview ................................................................................................................................ 63
Advanced Machine Learning Models and Methods ............................................................................63
Support Vector Machines .................................................................................................................... 64
Use Case: Fraud in Prepaid Subscribers ..........................................................................................67
Factorization Machines ........................................................................................................................ 68
Use Case: Recommender Systems Based on Customer Ratings in Retail ........................................70
no reviews yet
Please Login to review.