Statistical Learning with Python

Statistical Learning in Big Data with Python

Welcome

Welcome to this course! Statistical methods refer to a range of techniques from simple summary statistics intended to help better understand data, to statistical hypothesis tests and estimation statistics that can be used to interpret the results and insights of experiments and predictions from data and models. This course is designed to teach you step-by-step the basics of statistical methods with concrete and executable examples in the real business cases with Python about data analysis and modeling for doing data analysts’ routine work. Here to be emphasized is that Statistics plays the most important role not only to make you capable of doing regular data analysis work but also to pave the way to data scientist.

Statistics is Important

• Statistics is important to data analysis and machine learning practitioners.

• Statistical learning is the solution and process of not only extracting regularities from the data but also interpreting the insights of regularities

• Statistics is a prerequisite in data science on data analysis and applied machine learning.

• Statistical learning plays an essential role in many aspects of data analysis and applied machine learning, including sample and effect sizes evaluation, effect comparisons, driver interpretation, segmentation, prediction, etc.

• Statistical learning is the applied statistics equivalent of predictive modeling in machine learning.

What About This Course

• This course consists of three parts, Fundamental, Intermediate, and Advanced Statistical Learning. Currently, the former two parts are in place, and the latter part will be ready in about 3 months

• This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis in banking, insurance, telecom, and retail industries. Computing is done in Python. There are lectures devoted to python, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each class

• This course is not intended to be taught passively or help for pursuing academic degree as a reference. It is like a workshop intended for you to learn by doing and then apply your new understanding with working Python examples coming from real business cases.

• To get the most out of the course, you would be recommended playing with the examples in each tutorial. Extend them, break them, then fix them. Try some of the extensions presented at the end of each lesson and let me know how you do.

Who Is This Course For?

This course is for data analysts, reporting analyst, modelers and data scientists that may know some statistical data analysis or applied machine learning. The lessons in this course do assume a few things about you, such as:

• You know your way around basic Python for programming.

• You may know some basic SciPy (NumPy, Pandas, Matplotlib) for array and tabular manipulation and visualization.

• You (as a data analyst or reporting analyst or a data scientist or a modeler) want to learn statistical methods to deepen your understanding and application of data analysis and machine learning

A Better Way

• This course is designed for data analysis and machine learning practitioners that gives them only those parts of statistics that they need to know in order to work through a data analysis and predictive modeling projects.

• The statistical methods are presented in the way that practitioners learn–that is with simple language and working code examples in the real business cases.

• The course is taught at the right level for practitioners so that it can be a fascinating, fun, directly applicable, and immeasurably useful area of study

About Your Outcomes from this Course

This course will teach you the basics of statistical methods that you need to know as a machine learning practitioner. After reading and working through this book, you will know:

• Know how to sample data and estimate the power of the sample to ensure the validities of statistical significant test and effect test for mean and proportion comparisons

• Be able to calculate and interpret common summary statistics for distributions, population parameters, and observations, and how to present data using standard data visualization techniques

• Understand common types of statistical distributions to solid the foundation of parametric statistical analysis for the proper use of the parametric-based A/B test

• Can conduct and interpret parametric and non-parametric statistical hypothesis tests for comparing two or more data samples about mean and proportion for statistical significant test and effect test

• Skilled in doing tabular analysis for a data analyst’s routine work, including χ2 Goodness-of-Fit Tests, χ2 Independent Test, χ2 homogeneity Test, and Interaction Effect of Confounder and Effect Modifier, etc.

• Be able to calculate non-parametric confidence interval for classification accuracy and prediction interval for regression given a significant level

• Understand common statistical resampling approaches, and how to use them to make good economic use of available data in order to evaluate your analysis results and predictive models

• About the field of applied statistics, how it relates to data analysis and machine learning, and how to harness statistical methods on data analysis and machine learning projects

This new basic understanding of statistical methods will impact your practice of data analysis and machine learning in the following ways:

• Use descriptive statistics and data visualizations to quickly and more deeply understand the shape and relationships in data

• Use inferential statistical tests to quickly and effectively quantify the relationships between samples to interpret you results, such as the results of experiments with different analysis or predictive algorithms or differing configurations

• Use estimation statistics to quickly and effectively quantify the confidence in your analysis results, estimated model skill and model predictions